首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this article the question of reconstructing a phylogeny from additive distance data is addressed. Previous algorithms used the complete distance matrix of then OTUs (Operational Taxonomic Unit), that corresponds to the tips of the tree. This usedO(n 2) computing time. It is shown that this is wasteful for biologically reasonable trees. If the tree has internal nodes with degrees that are bounded onO(n*log(n)) algorithm is possible. It is also shown if the nodes can have unbounded degrees the problem hasn 2 as lower bound.  相似文献   

2.
3.
We developed a recurrence relation that counts the number of tandem duplication trees (either rooted or unrooted) that are consistent with a set of n tandemly repeated sequences generated under the standard unequal recombination (or crossover) model of tandem duplications. The number of rooted duplication trees is exactly twice the number of unrooted trees, which means that on average only two positions for a root on a duplication tree are possible. Using the recurrence, we tabulated these numbers for small values of n. We also developed an asymptotic formula that for large n provides estimates for these numbers. These numbers give a priori probabilities for phylogenies of the repeated sequences to be duplication trees. This work extends earlier studies where exhaustive counts of the numbers for small n were obtained. One application showed the significance of finding that most maximum-parsimony trees constructed from repeat sequences from human immunoglobins and T-cell receptors were tandem duplication trees. Those findings provided strong support to the proposed mechanisms of tandem gene duplication. The recurrence relation also suggests efficient algorithms to recognize duplication trees and to generate random duplication trees for simulation. We present a linear-time recognition algorithm.  相似文献   

4.
The problem of reconstructing the duplication history of a set of tandemly repeated sequences was first introduced by Fitch (1977). Many recent studies deal with this problem, showing the validity of the unequal recombination model proposed by Fitch, describing numerous inference algorithms, and exploring the combinatorial properties of these new mathematical objects, which are duplication trees. In this paper, we deal with the topological rearrangement of these trees. Classical rearrangements used in phylogeny (NNI, SPR, TBR, ...) cannot be applied directly on duplication trees. We show that restricting the neighborhood defined by the SPR (Subtree Pruning and Regrafting) rearrangement to valid duplication trees, allows exploring the whole duplication tree space. We use these restricted rearrangements in a local search method which improves an initial tree via successive rearrangements. This method is applied to the optimization of parsimony and minimum evolution criteria. We show through simulations that this method improves all existing programs for both reconstructing the topology of the true tree and recovering its duplication events. We apply this approach to tandemly repeated human Zinc finger genes and observe that a much better duplication tree is obtained by our method than using any other program.  相似文献   

5.
The problem of reconstructing a species supertree from a given set of protein, gene, and regulatorysite trees is the subject of this study. Under the traditional formulation, this problem is proven to be NP-hard. We propose a reformulation: to seek for a supertree, most of the clades of which contribute to the original protein trees. In such a variant, the problem seems to be biologically natural and a fast algorithm can be developed for its solution. The algorithm was tested on artificial and biological sets of protein trees, and it proved to be efficient even under the assumption of horizontal gene transfer. When horizontal transfer is not allowed, the algorithm correctness is proved mathematically; the time necessary for repeating the algorithm is assessed, and, in the worst case scenario, it is of the order n 3 · |V 0|3, where n is the number of gene trees and |V 0| is the number of tree species. Our software for supertree construction, examples of computations, and instructions can be freely accessed at . Events associated with horizontal gene transfer are not included either in this study or in any variant of the software. A general case is described in the authors’ report (journal Problems of Information Transmission, 2011).  相似文献   

6.
We present an efficient algorithm for statistical multiple alignment based on the TKF91 model of Thorne, Kishino, and Felsenstein (1991) on an arbitrary k-leaved phylogenetic tree. The existing algorithms use a hidden Markov model approach, which requires at least O( radical 5(k)) states and leads to a time complexity of O(5(k)L(k)), where L is the geometric mean sequence length. Using a combinatorial technique reminiscent of inclusion/exclusion, we are able to sum away the states, thus improving the time complexity to O(2(k)L(k)) and considerably reducing memory requirements. This makes statistical multiple alignment under the TKF91 model a definite practical possibility in the case of a phylogenetic tree with a modest number of leaves.  相似文献   

7.
近年来,随着高通量染色体构象捕获(Hi-C)等技术的发展和高通量测序成本的降低,全基因组交互作用的数据量快速增长,交互作用图谱分辨率不断提高,促使染色体和基因组三维结构建模的研究取得了很大进展,已经提出了几种从染色体构象捕捉数据中构建单个染色体或整个基因组结构的方法。文中通过对在 Hi-C 数据基础上对染色体三维结构重建的相关文献进行分析,总结了重建染色体三维空间结构的经典算法3DMax的原理,并且提出了一种新的随机梯度上升算法:XNadam,是Nadam优化方法的一个变体,将其应用于3DMax算法中,以便提高3DMax算法的性能,从而用于预测染色体三维结构。  相似文献   

8.
An efficient code searching for sequence homology and DNA duplication   总被引:1,自引:0,他引:1  
This paper presents a very simple and efficient algorithm that searches for sequence homology and gene duplication. The code finds the best alignment of two, short or long, sequences without having to specify how many unmatched bases are allowed to be looped out. Following Needleman & Wunsch (1970), Sellers (1974), Sankoff (1972) and Smith, Waterman & Fitch (1981) gap constraints are incorporated into the program and may be employed. The availability of such a fast computer code enables detection of homologous genes which may fulfill a different function at present, but have arisen from a common gene-ancestor. The code is extremely fast and runs in O(n3/2) units of time. It was applied to the phi X174 sequence.  相似文献   

9.

Background

With the advent of paired-end high throughput sequencing, it is now possible to identify various types of structural variation on a genome-wide scale. Although many methods have been proposed for structural variation detection, most do not provide precise boundaries for identified variants. In this paper, we propose a new method, Distribution Based detection of Duplication Boundaries (DB2), for accurate detection of tandem duplication breakpoints, an important class of structural variation, with high precision and recall.

Results

Our computational experiments on simulated data show that DB2 outperforms state-of-the-art methods in terms of finding breakpoints of tandem duplications, with a higher positive predictive value (precision) in calling the duplications’ presence. In particular, DB2’s prediction of tandem duplications is correct 99% of the time even for very noisy data, while narrowing down the space of possible breakpoints within a margin of 15 to 20 bps on the average. Most of the existing methods provide boundaries in ranges that extend to hundreds of bases with lower precision values. Our method is also highly robust to varying properties of the sequencing library and to the sizes of the tandem duplications, as shown by its stable precision, recall and mean boundary mismatch performance. We demonstrate our method’s efficacy using both simulated paired-end reads, and those generated from a melanoma sample and two ovarian cancer samples. Newly discovered tandem duplications are validated using PCR and Sanger sequencing.

Conclusions

Our method, DB2, uses discordantly aligned reads, taking into account the distribution of fragment length to predict tandem duplications along with their breakpoints on a donor genome. The proposed method fine tunes the breakpoint calls by applying a novel probabilistic framework that incorporates the empirical fragment length distribution to score each feasible breakpoint. DB2 is implemented in Java programming language and is freely available at http://mendel.gene.cwru.edu/laframboiselab/software.php.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-175) contains supplementary material, which is available to authorized users.  相似文献   

10.
11.
12.
An algorithm for approximate tandem repeats.   总被引:4,自引:0,他引:4  
A perfect single tandem repeat is defined as a nonempty string that can be divided into two identical substrings, e.g., abcabc. An approximate single tandem repeat is one in which the substrings are similar, but not identical, e.g., abcdaacd. In this paper we consider two criterions of similarity: the Hamming distance (k mismatches) and the edit distance (k differences). For a string S of length n and an integer k our algorithm reports all locally optimal approximate repeats, r = umacro ?, for which the Hamming distance of umacro and ? is at most k, in O(nk log (n/k)) time, or all those for which the edit distance of umacro and ? is at most k, in O(nk log k log (n/k)) time. This paper concentrates on a more general type of repeat called multiple tandem repeats. A multiple tandem repeat in a sequence S is a (periodic) substring r of S of the form r = u(a)u', where u is a prefix of r and u' is a prefix of u. An approximate multiple tandem repeat is a multiple repeat with errors; the repeated subsequences are similar but not identical. We precisely define approximate multiple repeats, and present an algorithm that finds all repeats that concur with our definition. The time complexity of the algorithm, when searching for repeats with up to k errors in a string S of length n, is O(nka log (n/k)) where a is the maximum number of periods in any reported repeat. We present some experimental results concerning the performance and sensitivity of our algorithm. The problem of finding repeats within a string is a computational problem with important applications in the field of molecular biology. Both exact and inexact repeats occur frequently in the genome, and certain repeats occurring in the genome are known to be related to diseases in the human.  相似文献   

13.
Tree structures are useful for describing and analyzing biological objects and processes. Consequently, there is a need to design metrics and algorithms to compare trees. A natural comparison metric is the "Tree Edit Distance," the number of simple edit (insert/delete) operations needed to transform one tree into the other. Rooted-ordered trees, where the order between the siblings is significant, can be compared in polynomial time. Rooted-unordered trees are used to describe processes or objects where the topology, rather than the order or the identity of each node, is important. For example, in immunology, rooted-unordered trees describe the process of immunoglobulin (antibody) gene diversification in the germinal center over time. Comparing such trees has been proven to be a difficult computational problem that belongs to the set of NP-Complete problems. Comparing two trees can be viewed as a search problem in graphs. A* is a search algorithm that explores the search space in an efficient order. Using a good lower bound estimation of the degree of difference between the two trees, A* can reduce search time dramatically. We have designed and implemented a variant of the A* search algorithm suitable for calculating tree edit distance. We show here that A* is able to perform an edit distance measurement in reasonable time for trees with dozens of nodes.  相似文献   

14.
Assessment of crown condition is a useful tool for monitoring forest health and is used widely to assess tree dieback and decline. Multiple methods used to assess eucalypt crown condition are difficult to compare across studies. Furthermore, the relative effectiveness of the available methods has not been evaluated. The objective of this study was to find an accurate, precise and efficient method for the assessment of crown condition in eucalypts. Four widely used methods were used to assess the crown condition of 516 eucalypt trees from Tasmania and Western Australia. Two of the methods assessed individual crown condition parameters (single‐parameter methods) which could then be added to give an overall score for condition (additive parameter). The other two methods use a single score to encompass many crown parameters (combined‐parameter methods). A selection of trees was scored on multiple occasions and by multiple assessors to determine repeatability and reproducibility. Data were analysed by Spearman's correlations and principal components analysis. All scored parameters were positively correlated to varying degrees. All parameters were distributed along a single‐principal components analysis axis, with the parameters from the single‐parameter methods having the greatest weightings. In order to address the objective of this study five criteria were developed for consideration of parameters that assess crown condition: (i) capacity to assess dieback (high correlation with other crown parameters, and capacity to indicate the presence of dieback symptoms in the crown); (ii) observer bias (sensitivity to minor change and small difference between observers); (iii) repeatability (sensitivity to minor change and small difference between years); (iv) capacity to assess different species; and (v) efficiency to score. Methods that best met these criteria used additive parameters derived from the crown parameters of primary crown dieback, epicormic growth and either crown shrinkage or dead branches. The single most useful parameter for assessment of eucalypt crown condition was primary crown dieback. This parameter was found to be the most accurate and precise measure of crown condition and is efficient to score. Primary crown dieback is recommended as the standard method for assessment of crown condition of eucalypt trees.  相似文献   

15.

Background

Mate selection can be used as a framework to balance key technical, cost and logistical issues while implementing a breeding program at a tactical level. The resulting mating lists accommodate optimal contributions of parents to future generations, in conjunction with other factors such as progeny inbreeding, connection between herds, use of reproductive technologies, management of the genetic distribution of nominated traits, and management of allele/genotype frequencies for nominated QTL/markers.

Methods

This paper describes a mate selection algorithm that is widely used and presents an extension that makes it possible to apply constraints on certain matings, as dictated through a group mating permission matrix.

Results

This full algorithm leads to simpler applications, and to computing speed for the scenario tested, which is several hundred times faster than the previous strategy of penalising solutions that break constraints.

Conclusions

The much higher speed of the method presented here extends the use of mate selection and enables implementation in relatively large programs across breeding units.  相似文献   

16.

Background  

The comparison of homologous sequences from different species is an essential approach to reconstruct the evolutionary history of species and of the genes they harbour in their genomes. Several complete mitochondrial and nuclear genomes are now available, increasing the importance of using multiple sequence alignment algorithms in comparative genomics. MtDNA has long been used in phylogenetic analysis and errors in the alignments can lead to errors in the interpretation of evolutionary information. Although a large number of multiple sequence alignment algorithms have been proposed to date, they all deal with linear DNA and cannot handle directly circular DNA. Researchers interested in aligning circular DNA sequences must first rotate them to the "right" place using an essentially manual process, before they can use multiple sequence alignment tools.  相似文献   

17.

Background

The growing wealth of public available gene expression data has made the systemic studies of how genes interact in a cell become more feasible. Liquid association (LA) describes the extent to which coexpression of two genes may vary based on the expression level of a third gene (the controller gene). However, genome-wide application has been difficult and resource-intensive. We propose a new screening algorithm for more efficient processing of LA estimation on a genome-wide scale and apply its use to a Saccharomyces cerevisiae data set.

Results

On a test subset of the data, the fast screening algorithm achieved >99.8% agreement with the exhaustive search of LA values, while reduced run time by 81–93 %. Using a well-known yeast cell-cycle data set with 6,178 genes, we identified triplet combinations with significantly large LA values. In an exploratory gene set enrichment analysis, the top terms for the controller genes in these triplets with large LA values are involved in some of the most fundamental processes in yeast such as energy regulation, transportation, and sporulation.

Conclusion

In summary, in this paper we propose a novel, efficient algorithm to explore LA on a genome-wide scale and identified triplets of interest in cell cycle pathways using the proposed method in a yeast data set. A software package named fastLiquidAssociation for implementing the algorithm is available through http://www.bioconductor.org.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-014-0371-5) contains supplementary material, which is available to authorized users.  相似文献   

18.
Zinc finger genes in mammalian genomes are frequently found to occur in clusters with cluster members appearing in a tandem array on the chromosome. It has been suggested that in situ gene duplication events are primarily responsible for the evolution of such clusters. The problem of inferring the series of duplication events responsible for producing clustered families is different from the standard phylogeny problem. In this paper, we study this inference problem using a graph called duplication model that captures the series of duplication events while taking into account the observed order of the genes on the chromosome. We provide algorithms to reconstruct a duplication model for a given data set. We use our method to hypothesize the series of duplication events that may have produced the ZNF45 family that appears on human chromosome 19.  相似文献   

19.
We have developed a method to determine the three-dimensional structure of a protein molecule from such a set of distance constraints as can be determined by nuclear magnetic resonance studies. The currently popular methods for distance geometry based on the use of the metric matrix are applicable only to small systems. The method developed here is applicable to large molecules, such as proteins, with all atoms treated explicitly. This method works in the space of variable dihedral angles and determines a three-dimensional structure by minimization of a target function. We avoid difficulties hitherto inherent in this type of approach by two new devices: the use of variable target functions; and a method of rapid calculation of the gradient of the target functions. The method is applied to the determination of the structures of a small globular protein, bovine pancreatic trypsin inhibitor, from several artificial sets of distance constraints extracted from the X-ray crystal structure of this molecule. When a good set of constraints was available for both short- and long-range distances, the crystal structure was regenerated nearly exactly. When some ambiguities, such as those expected in experimental information, are allowed, the protein conformation can be determined up to a few local deformations. These ambiguities are mainly associated with the low resolving power of the short-range information.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号