首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
SUMMARY: In the segment-by-segment approach to sequence alignment, pairwise and multiple alignments are generated by comparing gap-free segments of the sequences under study. This method is particularly efficient in detecting local homologies, and it has been used to identify functional regions in large genomic sequences. Herein, an algorithm is outlined that calculates optimal pairwise segment-by-segment alignments in essentially linear space. AVAILABILTIY: The program is available at the Bielefeld Bioinformatics Server (BiBiServ) at http://bibiserv.techfak. uni-bielefeld.de/dialign/  相似文献   

2.
The explosive growth in biological data in recent years has led to the development of new methods to identify DNA sequences. Many algorithms have recently been developed that search DNA sequences looking for unique DNA sequences. This paper considers the application of the Burrows-Wheeler transform (BWT) to the problem of unique DNA sequence identification. The BWT transforms a block of data into a format that is extremely well suited for compression. This paper presents a time-efficient algorithm to search for unique DNA sequences in a set of genes. This algorithm is applicable to the identification of yeast species and other DNA sequence sets.  相似文献   

3.
4.
General Purpose Graphic Processing Units (GPGPUs) constitute an inexpensive resource for computing-intensive applications that could exploit an intrinsic fine-grain parallelism. This paper presents the design and implementation in GPGPUs of an exact alignment tool for nucleotide sequences based on the Burrows-Wheeler Transform. We compare this algorithm with state-of-the-art implementations of the same algorithm over standard CPUs, and considering the same conditions in terms of I/O. Excluding disk transfers, the implementation of the algorithm in GPUs shows a speedup larger than 12, when compared to CPU execution. This implementation exploits the parallelism by concurrently searching different sequences on the same reference search tree, maximizing memory locality and ensuring a symmetric access to the data. The paper describes the behavior of the algorithm in GPU, showing a good scalability in the performance, only limited by the size of the GPU inner memory.  相似文献   

5.
A space-efficient algorithm for local similarities   总被引:3,自引:0,他引:3  
Existing dynamic-programming algorithms for identifying similarregions of two sequences require time and space proportionalto the product of the sequence lengths. Often this space requirementis more limiting than the time requirement. We describe a dynamic-programminglocal-similarity algorithm that needs only space proportionalto the sum of the sequence lengths. The method can also findrepeats within a single long sequence. To illustrate the algorithm'spotential, we discuss comparison of a 73 360 nucleotide sequencecontaining the human ß-like globin gene cluster anda corresponding 44 594 nucleotide sequence for rabbit, a problemwell beyond the capabilities of other dynamic-programming software. Received on January 29, 1990; accepted on May 30, 1990  相似文献   

6.

Key message

We propose a novel computational method for genomic selection that combines identical-by-state (IBS)-based Haseman–Elston (HE) regression and best linear prediction (BLP), called HE-BLP.

Abstract

Genomic best linear unbiased prediction (GBLUP) has been widely used in whole-genome prediction for breeding programs. To determine the total genetic variance of a training population, a linear mixed model (LMM) should be solved via restricted maximum likelihood (REML), whose computational complexity is the cube of the sample size. We proposed a novel computational method combining identical-by-state (IBS)-based Haseman–Elston (HE) regression and best linear prediction (BLP), called HE-BLP. With this method, the total genetic variance can be estimated by solving a simple HE linear regression, which has a computational complex of the sample size squared; therefore, it is suitable for large-scale genomic data, except those with which environmental effects need to be estimated simultaneously, because it does not allow for this estimation. In Monte Carlo simulation studies, the estimated heritability based on HE was identical to that based on REML, and the prediction accuracy via HE-BLP and traditional GBLUP was also quite similar when quantitative trait loci (QTLs) were randomly distributed along the genome and their effects followed a normal distribution. In addition, the kernel row number (KRN) trait in a maize IBM population was used to evaluate the performance of the two methods; the results showed similar prediction accuracy of breeding values despite slightly different estimated heritability via HE and REML, probably due to the underlying genetic architecture. HE-BLP can be a future genomic selection method choice for even larger sets of genomic data in certain special cases where environmental effects can be ignored. The software for HE regression and the simulation program is available online in the Genetic Analysis Repository (GEAR; https://github.com/gc5k/GEAR/wiki).
  相似文献   

7.
The Integrative Genomics Viewer (IGV) for iPad, based on the popular IGV application for desktop and laptop computers, supports researchers who wish to take advantage of the mobility of today’s tablet computers to view genomic data and present findings to colleagues.  相似文献   

8.
A protocol for the construction of microsatellite enriched genomic library   总被引:1,自引:0,他引:1  
An improved protocol for constructing microsatellite-enriched libraries was developed. The procedure depends on digesting genomic DNA with a restriction enzyme that generates blunt-ends, and on ligating linkers that, when dimerized, create a restriction site for a different blunt-end producing restriction enzyme. Efficient ligation of linkers to the genomic DNA fragments is achieved by including restriction enzymes in the ligation reaction that eliminate unwanted ligation products. After ligation, the reaction mixture is subjected to subtractive hybridization without purification. DNA fragments containing microsatellites are captured by biotin-labeled oligonucleotide repeats and recovered using streptavidin-coated beads. The recovered fragments are amplified by PCR using the linker sequence as primer, and cloned directly into a plasmid vector. The linker has the sequence GTTT on the 5′ end, which promotes efficient adenylation of the 3′ ends of the PCR products. Consequently, the amplified fragments could be cloned into vectors without purification. This procedure enables efficient enrichment and cloning of microsatellite sequences, resulting in a library with a low level of redundancy.  相似文献   

9.
MOTIVATION: Most of diseases are caused by a set of gene defects, which occur in a complex association. The association scheme of expressed genes can be modelled by genetic networks. Genetic networks are efficiently facilities to understand the dynamic of pathogenic processes by modelling molecular reality of cell conditions. In this sense a genetic network consists of first, a set of genes of specified cells, tissues or species and second, causal relations between these genes determining the functional condition of the biological system, i. e. under disease. A relation between two genes will exist if they both are directly or indirectly associated with disease [8]. Our goal is to characterize diseases (especially autoimmune diseases like chronic pancreatitis CP, multiple sclerosis MS, rheumatoid arthritis RA) by genetic networks generated by a computer system. We want to introduce this practice as a bioinformatic approach for finding targets.  相似文献   

10.
A statistical framework for genomic data fusion   总被引:8,自引:0,他引:8  
MOTIVATION: During the past decade, the new focus on genomics has highlighted a particular challenge: to integrate the different views of the genome that are provided by various types of experimental data. RESULTS: This paper describes a computational framework for integrating and drawing inferences from a collection of genome-wide measurements. Each dataset is represented via a kernel function, which defines generalized similarity relationships between pairs of entities, such as genes or proteins. The kernel representation is both flexible and efficient, and can be applied to many different types of data. Furthermore, kernel functions derived from different types of data can be combined in a straightforward fashion. Recent advances in the theory of kernel methods have provided efficient algorithms to perform such combinations in a way that minimizes a statistical loss function. These methods exploit semidefinite programming techniques to reduce the problem of finding optimizing kernel combinations to a convex optimization problem. Computational experiments performed using yeast genome-wide datasets, including amino acid sequences, hydropathy profiles, gene expression data and known protein-protein interactions, demonstrate the utility of this approach. A statistical learning algorithm trained from all of these data to recognize particular classes of proteins--membrane proteins and ribosomal proteins--performs significantly better than the same algorithm trained on any single type of data. AVAILABILITY: Supplementary data at http://noble.gs.washington.edu/proj/sdp-svm  相似文献   

11.

Background

A metagenomic sample is a set of DNA fragments, randomly extracted from multiple cells in an environment, belonging to distinct, often unknown species. Unsupervised metagenomic clustering aims at partitioning a metagenomic sample into sets that approximate taxonomic units, without using reference genomes. Since samples are large and steadily growing, space-efficient clustering algorithms are strongly needed.

Results

We design and implement a space-efficient algorithmic framework that solves a number of core primitives in unsupervised metagenomic clustering using just the bidirectional Burrows-Wheeler index and a union-find data structure on the set of reads. When run on a sample of total length n, with m reads of maximum length ? each, on an alphabet of total size σ, our algorithms take O(n(t+logσ)) time and just 2n+o(n)+O(max{? σlogn,K logm}) bits of space in addition to the index and to the union-find data structure, where K is a measure of the redundancy of the sample and t is the query time of the union-find data structure.

Conclusions

Our experimental results show that our algorithms are practical, they can exploit multiple cores by a parallel traversal of the suffix-link tree, and they are competitive both in space and in time with the state of the art.
  相似文献   

12.

Key message

New methods that incorporate the main and interaction effects of high-dimensional markers and of high-dimensional environmental covariates gave increased prediction accuracy of grain yield in wheat across and within environments.

Abstract

In most agricultural crops the effects of genes on traits are modulated by environmental conditions, leading to genetic by environmental interaction (G × E). Modern genotyping technologies allow characterizing genomes in great detail and modern information systems can generate large volumes of environmental data. In principle, G × E can be accounted for using interactions between markers and environmental covariates (ECs). However, when genotypic and environmental information is high dimensional, modeling all possible interactions explicitly becomes infeasible. In this article we show how to model interactions between high-dimensional sets of markers and ECs using covariance functions. The model presented here consists of (random) reaction norm where the genetic and environmental gradients are described as linear functions of markers and of ECs, respectively. We assessed the proposed method using data from Arvalis, consisting of 139 wheat lines genotyped with 2,395 SNPs and evaluated for grain yield over 8 years and various locations within northern France. A total of 68 ECs, defined based on five phases of the phenology of the crop, were used in the analysis. Interaction terms accounted for a sizable proportion (16 %) of the within-environment yield variance, and the prediction accuracy of models including interaction terms was substantially higher (17–34 %) than that of models based on main effects only. Breeding for target environmental conditions has become a central priority of most breeding programs. Methods, like the one presented here, that can capitalize upon the wealth of genomic and environmental information available, will become increasingly important.  相似文献   

13.
M Ghandi  MA Beer 《PloS one》2012,7(8):e38695
Data normalization is a crucial preliminary step in analyzing genomic datasets. The goal of normalization is to remove global variation to make readings across different experiments comparable. In addition, most genomic loci have non-uniform sensitivity to any given assay because of variation in local sequence properties. In microarray experiments, this non-uniform sensitivity is due to different DNA hybridization and cross-hybridization efficiencies, known as the probe effect. In this paper we introduce a new scheme, called Group Normalization (GN), to remove both global and local biases in one integrated step, whereby we determine the normalized probe signal by finding a set of reference probes with similar responses. Compared to conventional normalization methods such as Quantile normalization and physically motivated probe effect models, our proposed method is general in the sense that it does not require the assumption that the underlying signal distribution be identical for the treatment and control, and is flexible enough to correct for nonlinear and higher order probe effects. The Group Normalization algorithm is computationally efficient and easy to implement. We also describe a variant of the Group Normalization algorithm, called Cross Normalization, which efficiently amplifies biologically relevant differences between any two genomic datasets.  相似文献   

14.
J Zilsel  P H Ma  J T Beatty 《Gene》1992,120(1):89-92
We present and derive a formula that is useful for the design and evaluation of gene cloning experiments in which a complete gene library of the entire genome of an organism is desired. The formula n = ln(1-phi f)/ln(1-f) (in which n is the number of recombinant clones required to ensure a probability, phi, of obtaining at least one of each of all possible gene sequences, and f is the fraction of the genome contained in an average-sized DNA fragment) applies to construction of libraries, in which at least one copy of all the genetic information of a genome is required. The use of this formula for quantitative evaluation of genomic libraries should give greater assurance that a given library would be complete.  相似文献   

15.
Random forests for genomic data analysis   总被引:1,自引:0,他引:1  
Chen X  Ishwaran H 《Genomics》2012,99(6):323-329
Random forests (RF) is a popular tree-based ensemble machine learning tool that is highly data adaptive, applies to "large p, small n" problems, and is able to account for correlation as well as interactions among features. This makes RF particularly appealing for high-dimensional genomic data analysis. In this article, we systematically review the applications and recent progresses of RF for genomic data, including prediction and classification, variable selection, pathway analysis, genetic association and epistasis detection, and unsupervised learning.  相似文献   

16.
17.
Multidimensional scaling for large genomic data sets   总被引:1,自引:0,他引:1  

Background  

Multi-dimensional scaling (MDS) is aimed to represent high dimensional data in a low dimensional space with preservation of the similarities between data points. This reduction in dimensionality is crucial for analyzing and revealing the genuine structure hidden in the data. For noisy data, dimension reduction can effectively reduce the effect of noise on the embedded structure. For large data set, dimension reduction can effectively reduce information retrieval complexity. Thus, MDS techniques are used in many applications of data mining and gene network research. However, although there have been a number of studies that applied MDS techniques to genomics research, the number of analyzed data points was restricted by the high computational complexity of MDS. In general, a non-metric MDS method is faster than a metric MDS, but it does not preserve the true relationships. The computational complexity of most metric MDS methods is over O(N 2 ), so that it is difficult to process a data set of a large number of genes N, such as in the case of whole genome microarray data.  相似文献   

18.
19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号