共查询到20条相似文献,搜索用时 15 毫秒
1.
Battagliero S Puglia G Vicario S Rubino F Scioscia G Leo P 《IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM》2011,8(5):1196-1207
The increasing use of phylogeny in biological studies is limited by the need to make available more efficient tools for computing distances between trees. The geodesic tree distance-introduced by Billera, Holmes, and Vogtmann-combines both the tree topology and edge lengths into a single metric. Despite the conceptual simplicity of the geodesic tree distance, algorithms to compute it don't scale well to large, real-world phylogenetic trees composed of hundred or even thousand leaves. In this paper, we propose the geodesic distance as an effective tool for exploring the likelihood profile in the space of phylogenetic trees, and we give a cubic time algorithm, GeoHeuristic, in order to compute an approximation of the distance. We compare it with the GTP algorithm, which calculates the exact distance, and the cone path length, which is another approximation, showing that GeoHeuristic achieves a quite good trade-off between accuracy (relative error always lower than 0.0001) and efficiency. We also prove the equivalence among GeoHeuristic, cone path, and Robinson-Foulds distances when assuming branch lengths equal to unity and we show empirically that, under this restriction, these distances are almost always equal to the actual geodesic. 相似文献
2.
J.R. Chasnov 《Theoretical population biology》2010,77(4):270-278
A fast algorithm for computing recombination is developed for model organisms with selection on haploids. Haplotype frequencies are transformed to marginal frequencies; random mating and recombination are computed; marginal frequencies are transformed back to haplotype frequencies. With L diallelic loci, this algorithm is theoretically a factor of a constant times (3/8)L faster than standard computations with selection on diploids, and up to 16 recombining loci have been computed. This algorithm is then applied to model the opposing evolutionary forces of multilocus epistatic selection and recombination. Selection is assumed to favor haplotypes with specific alleles either all present or all absent. When the number of linked loci exceeds a critical value, a jump bifurcation occurs in the two-dimensional parameter space of the selection coefficient s and the recombination frequency r. The equilibrium solution jumps from high to low mean fitness with increasing r or decreasing s. These numerical results display an unexpected and dramatic nonlinear effect occurring in linkage models with a large number of loci. 相似文献
3.
4.
IQPNNI: moving fast through tree space and stopping in time 总被引:12,自引:0,他引:12
An efficient tree reconstruction method (IQPNNI) is introduced to reconstruct a phylogenetic tree based on DNA or amino acid sequence data. Our approach combines various fast algorithms to generate a list of potential candidate trees. The key ingredient is the definition of so-called important quartets (IQs), which allow the computation of an intermediate tree in O(n(2)) time for n sequences. The resulting tree is then further optimized by applying the nearest neighbor interchange (NNI) operation. Subsequently a random fraction of the sequences is deleted from the best tree found so far. The deleted sequences are then re-inserted in the smaller tree using the important quartet puzzling (IQP) algorithm. These steps are repeated several times and the best tree, with respect to the likelihood criterion, is considered as the inferred phylogenetic tree. Moreover, we suggest a rule which indicates when to stop the search. Simulations show that IQPNNI gives a slightly better accuracy than other programs tested. Moreover, we applied the approach to 218 small subunit rRNA sequences and 500 rbcL sequences. We found trees with higher likelihood compared to the results by others. A program to reconstruct DNA or amino acid based phylogenetic trees is available online (http://www.bi.uni-duesseldorf.de/software/iqpnni). 相似文献
5.
Approximate geodesic distances reveal biologically relevant structures in microarray data 总被引:1,自引:0,他引:1
MOTIVATION: Genome-wide gene expression measurements, as currently determined by the microarray technology, can be represented mathematically as points in a high-dimensional gene expression space. Genes interact with each other in regulatory networks, restricting the cellular gene expression profiles to a certain manifold, or surface, in gene expression space. To obtain knowledge about this manifold, various dimensionality reduction methods and distance metrics are used. For data points distributed on curved manifolds, a sensible distance measure would be the geodesic distance along the manifold. In this work, we examine whether an approximate geodesic distance measure captures biological similarities better than the traditionally used Euclidean distance. RESULTS: We computed approximate geodesic distances, determined by the Isomap algorithm, for one set of lymphoma and one set of lung cancer microarray samples. Compared with the ordinary Euclidean distance metric, this distance measure produced more instructive, biologically relevant, visualizations when applying multidimensional scaling. This suggests the Isomap algorithm as a promising tool for the interpretation of microarray data. Furthermore, the results demonstrate the benefit and importance of taking nonlinearities in gene expression data into account. 相似文献
6.
Sébastien Angibaud Guillaume Fertin Irena Rusu Stéphane Vialette 《Journal of computational biology》2007,14(4):379-393
Computing genomic distances between whole genomes is a fundamental problem in comparative genomics. Recent researches have resulted in different genomic distance definitions, for example, number of breakpoints, number of common intervals, number of conserved intervals, and Maximum Adjacency Disruption number. Unfortunately, it turns out that, in presence of duplications, most problems are NP-hard, and hence several heuristics have been recently proposed. However, while it is relatively easy to compare heuristics between them, until now very little is known about the absolute accuracy of these heuristics. Therefore, there is a great need for algorithmic approaches that compute exact solutions for these genomic distances. In this paper, we present a novel generic pseudo-boolean approach for computing the exact genomic distance between two whole genomes in presence of duplications, and put strong emphasis on common intervals under the maximum matching model. Of particular importance, we show three heuristics which provide very good results on a well-known public dataset of gamma-Proteobacteria. 相似文献
7.
A new fast algorithm for solving the minimum spanning tree problem based on DNA molecules computation 总被引:1,自引:0,他引:1
The minimum spanning tree (MST) problem is to find minimum edge connected subsets containing all the vertex of a given undirected graph. It is a vitally important NP-complete problem in graph theory and applied mathematics, having numerous real life applications. Moreover in previous studies, DNA molecular operations usually were used to solve NP-complete head-to-tail path search problems, rarely for NP-hard problems with multi-lateral path solutions result, such as the minimum spanning tree problem. In this paper, we present a new fast DNA algorithm for solving the MST problem using DNA molecular operations. For an undirected graph with n vertex and m edges, we reasonably design flexible length DNA strands representing the vertex and edges, take appropriate steps and get the solutions of the MST problem in proper length range and O(3m + n) time complexity. We extend the application of DNA molecular operations and simultaneity simplify the complexity of the computation. Results of computer simulative experiments show that the proposed method updates some of the best known values with very short time and that the proposed method provides a better performance with solution accuracy over existing algorithms. 相似文献
8.
A fast method for computing high-significance disease association in large population-based studies
下载免费PDF全文

Because of rapid progress in genotyping techniques, many large-scale, genomewide disease-association studies are now under way. Typically, the disorders examined are multifactorial, and, therefore, researchers seeking association must consider interactions among loci and between loci and other factors. One of the challenges of large disease-association studies is obtaining accurate estimates of the significance of discovered associations. The linkage disequilibrium between SNPs makes the tests highly dependent, and dependency worsens when interactions are tested. The standard way of assigning significance (P value) is by a permutation test. Unfortunately, in large studies, it is prohibitively slow to compute low P values by this method. We present here a faster algorithm for accurately calculating low P values in case-control association studies. Unlike with several previous methods, we do not assume a specific distribution of the traits, given the genotypes. Our method is based on importance sampling and on accounting for the decay in linkage disequilibrium along the chromosome. The algorithm is dramatically faster than the standard permutation test. On data sets mimicking medium-to-large association studies, it speeds up computation by a factor of 5,000-100,000, sometimes reducing running times from years to minutes. Thus, our method significantly increases the problem-size range for which accurate, meaningful association results are attainable. 相似文献
9.
The increasing throughput of sequencing raises growing needs for methods of sequence analysis and comparison on a genomic scale, notably, in connection with phylogenetic tree reconstruction. Such needs are hardly fulfilled by the more traditional measures of sequence similarity and distance, like string edit and gene rearrangement, due to a mixture of epistemological and computational problems. Alternative measures, based on the subword composition of sequences, have emerged in recent years and proved to be both fast and effective in a variety of tested cases. The common denominator of such measures is an underlying information theoretic notion of relative compressibility. Their viability depends critically on computational cost. The present paper describes as a paradigm the extension and efficient implementation of one of the methods in this class. The method is based on the comparison of the frequencies of all subwords in the two input sequences, where frequencies are suitably adjusted to take into account the statistical background. 相似文献
10.
Ming Fang 《TAG. Theoretical and applied genetics. Theoretische und angewandte Genetik》2012,125(8):1727-1734
The recent technology of the single-nucleotide-polymorphism (SNP) array makes it possible to genotype millions of SNP markers on genome, which in turn requires to develop fast and efficient method for fine-scale quantitative trait loci (QTL) mapping. The single-marker association (SMA) is the simplest method for fine-scale QTL mapping, but it usually shows many false-positive signals and has low QTL-detection power. Compared with SMA, the haplotype-based method of Meuwissen and Goddard who assume QTL effect to be random and estimate variance components (VC) with identity-by-descent (IBD) matrices that inferred from unknown historic population is more powerful for fine-scale QTL mapping; furthermore, their method also tends to show continuous QTL-detection profile to diminish many false-positive signals. However, as we know, the variance component estimation is usually very time consuming and difficult to converge. Thus, an extremely fast EMF (Expectation-Maximization algorithm under Fixed effect model) is proposed in this research, which assumes a biallelic QTL and uses an expectation-maximization (EM) algorithm to solve model effects. The results of simulation experiments showed that (1) EMF was computationally much faster than VC method; (2) EMF and VC performed similarly in QTL detection power and parameter estimations, and both outperformed the paired-marker analysis and SMA. However, the power of EMF would be lower than that of VC if the QTL was multiallelic. 相似文献
11.
A graphics processing unit (GPU) has been widely used to accelerate discrete optimization problems. In this paper, we introduce a novel hybrid parallel algorithm to generate a shortest addition chain for a positive integer e. The main idea of the proposed algorithm is to divide the search tree into a sequence of three subtrees: top, middle, and bottom. The top subtree works using a branch and bound depth first strategy. The middle subtree works using a branch and bound breadth first strategy, while the bottom subtree works using a parallel (GPU) branch and bound depth first strategy. Our experimental results show that, compared to the fastest sequential algorithm for generating a shortest addition chain, we improve the generation by about 70% using one GPU (NVIDIA GeForce GTX 770). For generating all shortest addition chains, the percentage of the improvement is about 50%. 相似文献
12.
MOTIVATION: Graph drawing algorithms are often used for visualizing relational information, but a naive implementation of a graph drawing algorithm encounters real difficulties when drawing large-scale graphs such as protein interaction networks. RESULTS: We have developed a new, extremely fast layout algorithm for visualizing large-scale protein interaction networks in the three-dimensional space. The algorithm (1) first finds a layout of connected components of an entire network, (2) finds a global layout of nodes with respect to pivot nodes within a connected component and (3) refines the local layout of each connected component by first relocating midnodes with respect to their cutvertices and direct neighbors of the cutvertices and then by relocating all nodes with respect to their neighbors within distance 2. Advantages of this algorithm over classical graph drawing methods include: (1) it is an order of magnitude faster, (2) it can directly visualize data from protein interaction databases and (3) it provides several abstraction and comparison operations for effectively analyzing large-scale protein interaction networks. AVAILABILITY: http://wilab.inha.ac.kr/interviewer/ 相似文献
13.
Crawford OH 《Bioinformatics (Oxford, England)》1999,15(1):66-71
MOTIVATION: Sequences for new proteins are being determined at a rapid rate, as a result of the Human Genome Project, and related genome research. The ability to predict the three-dimensional structure of proteins from sequence alone would be useful in discovering and understanding their function. Threading, or fold recognition, aims to predict the tertiary structure of a protein by aligning its amino acid sequence with a large number of structures, and finding the best fit. This approach depends on obtaining good performance from both the scoring function, which simulates the free energy for given trial alignments, and the threading algorithm, which searches for the lowest-score alignment. It appears that current scoring functions and threading algorithms need improvement. RESULTS: This paper presents a new threading algorithm. Numerical tests demonstrate that it is more powerful than two popular approximate algorithms, and much faster than exact methods. 相似文献
14.
15.
We have implemented a Fast Fourier Summation algorithm for tomographic reconstruction of three-dimensional biological data sets obtained via transmission electron microscopy. We designed the fast algorithm to reproduce results obtained by the direct summation algorithm (also known as filtered or R-weighted backprojection). For two-dimensional images, the new algorithm scales as O(N(theta)M log M)+O(MN log N) operations, where N(theta) is the number of projection angles and M x N is the size of the reconstructed image. Three-dimensional reconstructions are constructed from sequences of two-dimensional reconstructions. We demonstrate the algorithm on real data sets. For typical sizes of data sets, the new algorithm is 1.5-2.5 times faster than using direct summation in the space domain. The speed advantage is even greater as the size of the data sets grows. The new algorithm allows us to use higher order spline interpolation of the data without additional computational cost. The algorithm has been incorporated into a commonly used package for tomographic reconstruction. 相似文献
16.
By integrating the underlying developmental mechanisms for the phenotypic formation of traits into a mapping framework, functional mapping has emerged as an important statistical approach for mapping complex traits. In this note, we explore the feasibility of using the simplex algorithm as an alternative to solve the mixture-based likelihood for functional mapping of complex traits. The results from the simplex algorithm are consistent with those from the traditional EM algorithm, but the simplex algorithm has considerably reduced computational times. Moreover, because of its nonderivative nature and easy implementation with current software, the simplex algorithm enjoys an advantage over the EM algorithm in the dynamic modeling and analysis of complex traits. 相似文献
17.
The SEQUEST program was the first and remains one of the most widely used tools for assigning a peptide sequence within a database to a tandem mass spectrum. The cross correlation score is the primary score function implemented within SEQUEST and it is this score that makes the tool particularly sensitive. Unfortunately, this score is computationally expensive to calculate, and thus, to make the score manageable, SEQUEST uses a less sensitive but fast preliminary score and restricts the cross correlation to just the top 500 peptides returned by the preliminary score. Classically, the cross correlation score has been calculated using Fast Fourier Transforms (FFT) to generate the full correlation function. We describe an alternate method of calculating the cross correlation score that does not require FFTs and can be computed efficiently in a fraction of the time. The fast calculation allows all candidate peptides to be scored by the cross correlation function, potentially mitigating the need for the preliminary score, and enables an E-value significance calculation based on the cross correlation score distribution calculated on all candidate peptide sequences obtained from a sequence database. 相似文献
18.
Genome-wide association studies (GWAS) have identified a large amount of single-nucleotide polymorphisms (SNPs) associated with complex traits. A recently developed linear mixed model for estimating heritability by simultaneously fitting all SNPs suggests that common variants can explain a substantial fraction of heritability, which hints at the low power of single variant analysis typically used in GWAS. Consequently, many multi-locus shrinkage models have been proposed under a Bayesian framework. However, most use Markov Chain Monte Carlo (MCMC) algorithm, which are time-consuming and challenging to apply to GWAS data. Here, we propose a fast algorithm of Bayesian adaptive lasso using variational inference (BAL-VI). Extensive simulations and real data analysis indicate that our model outperforms the well-known Bayesian lasso and Bayesian adaptive lasso models in accuracy and speed. BAL-VI can complete a simultaneous analysis of a lung cancer GWAS data with ~3400 subjects and ~570,000 SNPs in about half a day. 相似文献
19.
We have developed a pruning algorithm for likelihood estimation of a tree of populations. This algorithm enables us to compute the likelihood for large trees. Thus, it gives an efficient way of obtaining the maximum-likelihood estimate (MLE) for a given tree topology. Our method utilizes the differences accumulated by random genetic drift in allele count data from single-nucleotide polymorphisms (SNPs), ignoring the effect of mutation after divergence from the common ancestral population. The computation of the maximum-likelihood tree involves both maximizing likelihood over branch lengths of a given topology and comparing the maximum-likelihood across topologies. Here our focus is the maximization of likelihood over branch lengths of a given topology. The pruning algorithm computes arrays of probabilities at the root of the tree from the data at the tips of the tree; at the root, the arrays determine the likelihood. The arrays consist of probabilities related to the number of coalescences and allele counts for the partially coalesced lineages. Computing these probabilities requires an unusual two-stage algorithm. Our computation is exact and avoids time-consuming Monte Carlo methods. We can also correct for ascertainment bias. 相似文献
20.
Mario PL Calus 《遗传、选种与进化》2014,46(1):24