共查询到20条相似文献,搜索用时 0 毫秒
1.
A stepwise algorithm for finding minimum evolution trees 总被引:1,自引:6,他引:1
A stepwise algorithm for reconstructing minimum evolution (ME) trees from
evolutionary distance data is proposed. In each step, a taxon that
potentially has a neighbor (another taxon connected to it with a single
interior node) is first chosen and then its true neighbor searched
iteratively. For m taxa, at most (m-1)!/2 trees are examined and the tree
with the minimum sum of branch lengths (S) is chosen as the final tree.
This algorithm provides simple strategies for restricting the tree space
searched and allows us to implement efficient ways of dynamically computing
the ordinary least squares estimates of S for the topologies examined.
Using computer simulation, we found that the efficiency of the ME method in
recovering the correct tree is similar to that of the neighbor-joining
method (Saitou and Nei 1987). A more exhaustive search is unlikely to
improve the efficiency of the ME method in finding the correct tree because
the correct tree is almost always included in the tree space searched with
this stepwise algorithm. The new algorithm finds trees for which S values
may not be significantly different from that of the ME tree if the correct
tree contains very small interior branches or if the pairwise distance
estimates have large sampling errors. These topologies form a set of
plausible alternatives to the ME tree and can be compared with each other
using statistical tests based on the minimum evolution principle. The new
algorithm makes it possible to use the ME method for large data sets.
相似文献
2.
ABSTRACT: BACKGROUND: Previous studies on tumor classification based on gene expression profiles suggest that gene selection plays a key role in improving the classification performance. Moreover, finding important tumor-related genes with the highest accuracy is a very important task because these genes might serve as tumor biomarkers, which is of great benefit to not only tumor molecular diagnosis but also drug development. RESULTS: This paper proposes a novel gene selection method with rich biomedical meaning based on Heuristic Breadth-first Search Algorithm (HBSA) to find as many optimal gene subsets as possible. Due to the curse of dimensionality, this type of method could suffer from over-fitting and selection bias problems. To address these potential problems, a HBSA-based ensemble classifier is constructed using majority voting strategy from individual classifiers constructed by the selected gene subsets, and a novel HBSA-based gene ranking method is designed to find important tumor-related genes by measuring the significance of genes using their occurrence frequencies in the selected gene subsets. The experimental results on nine tumor datasets including three pairs of cross-platform datasets indicate that the proposed method can not only obtain better generalization performance but also find many important tumor-related genes. CONCLUSIONS: It is found that the frequencies of the selected genes follow a power-law distribution, indicating that only a few top-ranked genes can be used as potential diagnosis biomarkers. Moreover, the top-ranked genes leading to very high prediction accuracy are closely related to specific tumor subtype and even hub genes. Compared with other related methods, the proposed method can achieve higher prediction accuracy with fewer genes. Moreover, they are further justified by analyzing the top-ranked genes in the context of individual gene function, biological pathway, and protein-protein interaction network. 相似文献
3.
4.
Keith JM Adams P Bryant D Kroese DP Mitchelson KR Cochran DA Lala GH 《Bioinformatics (Oxford, England)》2002,18(11):1494-1499
MOTIVATION: A consensus sequence for a family of related sequences is, as the name suggests, a sequence that captures the features common to most members of the family. Consensus sequences are important in various DNA sequencing applications and are a convenient way to characterize a family of molecules. RESULTS: This paper describes a new algorithm for finding a consensus sequence, using the popular optimization method known as simulated annealing. Unlike the conventional approach of finding a consensus sequence by first forming a multiple sequence alignment, this algorithm searches for a sequence that minimises the sum of pairwise distances to each of the input sequences. The resulting consensus sequence can then be used to induce a multiple sequence alignment. The time required by the algorithm scales linearly with the number of input sequences and quadratically with the length of the consensus sequence. We present results demonstrating the high quality of the consensus sequences and alignments produced by the new algorithm. For comparison, we also present similar results obtained using ClustalW. The new algorithm outperforms ClustalW in many cases. 相似文献
5.
Cluster Computing - Minimum latency problem (MLP) is an NP-Hard combinatorial optimization problem. Metaheuristics use perturbation and randomization to arrive at a satisfactory solution under time... 相似文献
6.
Background
Distance matrix methods constitute a major family of phylogenetic estimation methods, and the minimum evolution (ME) principle (aiming at recovering the phylogeny with shortest length) is one of the most commonly used optimality criteria for estimating phylogenetic trees. The major difficulty for its application is that the number of possible phylogenies grows exponentially with the number of taxa analyzed and the minimum evolution principle is known to belong to the -hard class of problems. 相似文献7.
Heinrich Kuhn 《Flexible Services and Manufacturing Journal》1995,7(3):229-254
The paper considers the loading problem in flexible manufacturing systems (FMSs). This problem involves the assignment to the machine tools of all operations and associated cutting tools required for part types that have been selected to be produced simultaneously. The loading problem is first formulated as a linear mixed 0–1 program with the objective to minimize the greatest workload assigned to each machine. A heuristic procedure is presented in which an assignment of operations to machine tools is obtained by solving a parameterized generalized assignment problem with an objective function that approximates the use of tool slots required by the operations assigned to the machines. The algorithm is coded in FORTRAN and tested on an IBM-compatible personal computer. Computational results are presented for different test problems to demonstrate the efficiency and effectiveness of the suggested procedure. 相似文献
8.
Dubrova E Teslenko M 《IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM》2011,8(5):1393-1399
This paper addresses the problem of finding attractors in synchronous Boolean networks. The existing Boolean decision diagram-based algorithms have limited capacity due to the excessive memory requirements of decision diagrams. The simulation-based algorithms can be applied to larger networks, however, they are incomplete. We present an algorithm, which uses a SAT-based bounded model checking to find all attractors in a Boolean network. The efficiency of the presented algorithm is evaluated by analyzing seven networks models of real biological processes, as well as 150,000 randomly generated Boolean networks of sizes between 100 and 7,000. The results show that our approach has a potential to handle an order of magnitude larger models than currently possible. 相似文献
9.
This paper describes a generic algorithm for finding restrictionsites within DNA sequences. The genericity ofthe algorithm is made possible through the use of set theory.Basic elements of DNA sequences, i.e. nucleotides (bases), arerepresented in sets, and DNA sequences, whether specific, ambiguousor even protein-coding, are represented as sequences of thosesets. The set intersection operation demonstrates its abilityto perform pattern-matching correctly on various DNA sequences.The performance analysis showed that the degree of complexityof the pattern matching is reduced from exponential to linear.An example is given to show the actual and potential restrictionsites, derived by the generic algorithm, in the DNA sequencetemplate coding for a synthetic calmodulin. Received on October 2, 1990; accepted on December 18, 1990 相似文献
10.
11.
12.
Accurate phylogenetic reconstruction methods are inherently computationally heavy and therefore are limited to relatively small numbers of taxa. Supertree construction is the task of amalgamating small trees over partial sets into a big tree over the complete taxa set. The need for fast and accurate supertree methods has become crucial due to the enormous number of new genomic sequences generated by modern technology and the desire to use them for classification purposes. In particular, the Assembling the Tree of Life (ATOL) program aims at constructing the evolutionary history of all living organisms on Earth. When dealing with unrooted trees, a quartet - an unrooted tree over four taxa - is the most basic piece of phylogenetic information. Therefore, quartet amalgamation stands at the heart of any supertree problem as it concerns combining many minimal pieces of information into a single, coherent, and more comprehensive piece of information.We have devised an extremely fast algorithm for quartet amalgamation and implemented it in a very efficient code. The new code can handle over a hundred millions of quartet trees over several hundreds of taxa with very high accuracy. 相似文献
13.
14.
A dynamic programming algorithm for finding alternative RNA secondary structures. 总被引:6,自引:8,他引:6
下载免费PDF全文

Dynamic programming algorithms that predict RNA secondary structure by minimizing the free energy have had one important limitation. They were able to predict only one optimal structure. Given the uncertainties of the thermodynamic data and the effects of proteins and other environmental factors on structure, the optimal structure predicted by these methods may not have biological significance. We present a dynamic programming algorithm that can determine optimal and suboptimal secondary structures for an RNA. The power and utility of the method is demonstrated in the folding of the intervening sequence of the rRNA of Tetrahymena. By first identifying the major secondary structures corresponding to the lowest free energy minima, a secondary structure of possible biological significance is derived. 相似文献
15.
R.J. Pankhurst 《Mathematical biosciences》1983,65(2):209-218
A computer program is described which finds sets of diagnostic characters for the recognition of species. Unlike previous algorithms, it finds all the possible sets requested and will also run with reasonable demands on computer time and storage. The program will search for sets with a specified size range and with a given minimum number of diagnostic characters to distinguish a taxon from all the others. 相似文献
16.
SUMMARY: We describe an algorithm and software tool for comparing alternative phylogenetic trees. The main application of the software is to compare phylogenies obtained using different phylogenetic methods for some fixed set of species or obtained using different gene sequences from those species. The algorithm pairs up each branch in one phylogeny with a matching branch in the second phylogeny and finds the optimum 1-to-1 map between branches in the two trees in terms of a topological score. The software enables the user to explore the corresponding mapping between the phylogenies interactively, and clearly highlights those parts of the trees that differ, both in terms of topology and branch length. AVAILABILITY: The software is implemented as a Java applet at http://www.mrc-bsu.cam.ac.uk/personal/thomas/phylo_comparison/comparison_page.html. It is also available on request from the authors. 相似文献
17.
18.
19.
We study the problem of approximate non-tandem repeat extraction. Given a long subject string S of length N over a finite alphabet Sigma and a threshold D, we would like to find all short substrings of S of length P that repeat with at most D differences, i.e., insertions, deletions, and mismatches. We give a careful theoretical characterization of the set of seeds (i.e., some maximal exact repeats) required by the algorithm, and prove a sublinear bound on their expected numbers. Using this result, we present a sub-quadratic algorithm for finding all short (i.e., of length O(log N)) approximate repeats. The running time of our algorithm is O(DN(3pow(epsilon)-1)log N), where epsilon = D/P and pow(epsilon) is an increasing, concave function that is 0 when epsilon = 0 and about 0.9 for DNA and protein sequences. 相似文献
20.
Yoon BJ 《BMC bioinformatics》2011,12(Z1):S18