首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Distance based algorithms are a common technique in the construction of phylogenetic trees from taxonomic sequence data. The first step in the implementation of these algorithms is the calculation of a pairwise distance matrix to give a measure of the evolutionary change between any pair of the extant taxa. A standard technique is to use the log det formula to construct pairwise distances from aligned sequence data. We review a distance measure valid for the most general models, and show how the log det formula can be used as an estimator thereof. We then show that the foundation upon which the log det formula is constructed can be generalized to produce a previously unknown estimator which improves the consistency of the distance matrices constructed from the log det formula. This distance estimator provides a consistent technique for constructing quartets from phylogenetic sequence data under the assumption of the most general Markov model of sequence evolution.  相似文献   

2.
H Tyson 《Génome》1992,35(2):360-371
Optimum alignment in all pairwise combinations among a group of amino acid sequences generated a distance matrix. These distances were clustered to evaluate relationships among the sequences. The degree of relationship among sequences was also evaluated by calculating specific distances from the distance matrix and examining correlations between patterns of specific distances for pairs of sequences. The sequences examined were a group of 20 amino acid sequences of scorpion toxins originally published and analyzed by M.J. Dufton and H. Rochat in 1984. Alignment gap penalties were constant for all 190 pairwise sequence alignments and were chosen after assessing the impact of changing penalties on resultant distances. The total distances generated by the 190 pairwise sequence alignments were clustered using complete (farthest neighbour) linkage. The square, symmetrical input distance matrix is analogous to diallel cross data where reciprocal and parental values are absent. Diallel analysis methods provided analogues for the distance matrix to genetical specific combining abilities, namely specific distances between all sequence pairs that are independent of the average distances shown by individual sequences. Correlation of specific distance patterns, with transformation to modified z values and a stringent probability level, were used to delineate subgroups of related sequences. These were compared with complete linkage clustering results. Excellent agreement between the two approaches was found. Three originally outlying sequences were placed within the four new subgroups.  相似文献   

3.
Clearcut: a fast implementation of relaxed neighbor joining   总被引:1,自引:0,他引:1  
SUMMARY: Clearcut is an open source implementation for the relaxed neighbor joining (RNJ) algorithm. While traditional neighbor joining (NJ) remains a popular method for distance-based phylogenetic tree reconstruction, it suffers from a O(N(3)) time complexity, where N represents the number of taxa in the input. Due to this steep asymptotic time complexity, NJ cannot reasonably handle very large datasets. In contrast, RNJ realizes a typical-case time complexity on the order of N(2)logN without any significant qualitative difference in output. RNJ is particularly useful when inferring a very large tree or a large number of trees. In addition, RNJ retains the desirable property that it will always reconstruct the true tree given a matrix of additive pairwise distances. Clearcut implements RNJ as a C program, which takes either a set of aligned sequences or a pre-computed distance matrix as input and produces a phylogenetic tree. Alternatively, Clearcut can reconstruct phylogenies using an extremely fast standard NJ implementation. AVAILABILITY: Clearcut source code is available for download at: http://bioinformatics.hungry.com/clearcut  相似文献   

4.
Estimating dispersal—a key parameter for population ecology and management—is notoriously difficult. The use of pedigree assignments, aided by likelihood‐based software, has become popular to estimate dispersal rate and distance. However, the partial sampling of populations may produce false assignments. Further, it is unknown how the accuracy of assignment is affected by the genealogical relationships of individuals and is reflected by software‐derived assignment probabilities. Inspired by a project managing invasive American mink (Neovison vison), we estimated individual dispersal distances using inferred pairwise relationships of culled individuals. Additionally, we simulated scenarios to investigate the accuracy of pairwise inferences. Estimates of dispersal distance varied greatly when derived from different inferred pairwise relationships, with mother–offspring relationship being the shortest (average = 21 km) and the most accurate. Pairs assigned as maternal half‐siblings were inaccurate, with 64%–97% falsely assigned, implying that estimates for these relationships in the wild population were unreliable. The false assignment rate was unrelated to the software‐derived assignment probabilities at high dispersal rates. Assignments were more accurate when the inferred parents were older and immigrants and when dispersal rates between subpopulations were low (1% and 2%). Using 30 instead of 15 loci increased pairwise reliability, but half‐sibling assignments were still inaccurate (>59% falsely assigned). The most reliable approach when using inferred pairwise relationships in polygamous species would be not to use half‐sibling relationship types. Our simulation approach provides guidance for the application of pedigree inferences under partial sampling and is applicable to other systems where pedigree assignments are used for ecological inference.  相似文献   

5.
We develop a new approach to estimate a matrix of pairwise evolutionary distances from a codon-based alignment based on a codon evolutionary model. The method first computes a standard distance matrix for each of the three codon positions. Then these three distance matrices are weighted according to an estimate of the global evolutionary rate of each codon position and averaged into a unique distance matrix. Using a large set of both real and simulated codon-based alignments of nucleotide sequences, we show that this approach leads to distance matrices that have a significantly better treelikeness compared to those obtained by standard nucleotide evolutionary distances. We also propose an alternative weighting to eliminate the part of the noise often associated with some codon positions, particularly the third position, which is known to induce a fast evolutionary rate. Simulation results show that fast distance-based tree reconstruction algorithms on distance matrices based on this codon position weighting can lead to phylogenetic trees that are at least as accurate as, if not better, than those inferred by maximum likelihood. Finally, a well-known multigene dataset composed of eight yeast species and 106 codon-based alignments is reanalyzed and shows that our codon evolutionary distances allow building a phylogenetic tree which is similar to those obtained by non-distance-based methods (e.g., maximum parsimony and maximum likelihood) and also significantly improved compared to standard nucleotide evolutionary distance estimates.  相似文献   

6.
The field of phylogenetic tree estimation has been dominated by three broad classes of methods: distance-based approaches, parsimony and likelihood-based methods (including maximum likelihood (ML) and Bayesian approaches). Here we introduce two new approaches to tree inference: pairwise likelihood estimation and a distance-based method that estimates the number of substitutions along the paths through the tree. Our results include the derivation of the formulae for the probability that two leaves will be identical at a site given a number of substitutions along the path connecting them. We also derive the posterior probability of the number of substitutions along a path between two sequences. The calculations for the posterior probabilities are exact for group-based, symmetric models of character evolution, but are only approximate for more general models.  相似文献   

7.
An extension to the Leslie matrix is presented in which the age of transformation from immature to adult has a log—normal distribution. The major effect of this is shown to be on the second largest eigenvalue. The ratio of the largest to the second largest eigenvalue |λ12|, which is an index of the rate of approach to the stable age distribution, is greater in the new model, even though the value of λ1 is effectively the same. The differences in the models are most pronounced where the population is subjected to a harvesting regime.  相似文献   

8.
We have determined, via 1H-n.m.r., the solution conformation of the collagen-binding b-domain of the bovine seminal fluid protein PDC-109 (PDC-109/b). The structure determination is based on 341 interproton distance estimates and 42 dihedral angle estimates: a set of 24 initial structures were computed; 12 using the variable target function program DIANA, and 12 using the metric matrix program DISGEO. These structures were optimized by restrained energy minimization and dynamic simulated annealing using the CHARMM and X-PLOR programs. The average pairwise root-mean-square difference (r.m.s.d) between the optimized DIANA (DISGEO) structures is 0.71 A (0.82 A) for the backbone atoms, and 1.73 A (2.03 A) for all atoms. Both sets of structures exhibit the same global fold, secondary structure and placement of most non-polar side-chains. Two central antiparallel beta-sheets, which lie roughly perpendicular to each other, and two irregular loops support a large, partially exposed, hydrophobic surface that defines a putative binding site. A test of a hybrid relaxation matrix-based distance refinement protocol (MIDGE program) was performed using a normalized 250 millisecond NOESY spectrum. The resulting distances were input to the molecular mechanics/dynamics procedures mentioned above in order to optimize the DIANA structures. Our results indicate that relaxation matrix refinement of distances is most useful when used conservatively for identifying underestimated distance constraints. 1H-n.m.r. monitored ligand titration experiments revealed definite, albeit weak, binding interactions for phenethylamine and leucine analogs (Ka less than or equal to 25 M-1). Residues perturbed by ligand binding include Tyr7, Trp26, Tyr33, Asp34 and Trp39. These results suggest that PDC-109/b may recognize specific leucine and/or isoleucine-containing sequences within collagen.  相似文献   

9.
10.
11.
Streptococcus suis is an important pathogen of swine which occasionally infects humans as well. There are 35 serotypes known for this organism, and it would be desirable to develop rapid methods methods to identify and differentiate the strains of this species. To that effect, partial chaperonin 60 gene sequences were determined for the 35 serotype reference strains of S. suis. Analysis of a pairwise distance matrix showed that the distances ranged from 0 to 0.275 when values were calculated by the maximum-likelihood method. For five of the strains the distances from serotype 1 were greater than 0.1, and for two of these strains the distances were were more than 0.25, suggesting that they belong to a different species. Most of the nucleotide differences were silent; alignment of protein sequences showed that there were only 11 distinct sequences for the 35 strains under study. The chaperonin 60 gene phylogenetic tree was similar to the previously published tree based on 16S rRNA sequences, and it was also observed that strains with identical chaperonin 60 gene sequences tended to have identical 16S rRNA sequences. The chaperonin 60 gene sequences provided a higher level of discrimination between serotypes than the 16S RNA sequences provided and could form the basis for a diagnostic protocol.  相似文献   

12.
We present a method for estimating the most general reversible substitution matrix corresponding to a given collection of pairwise aligned DNA sequences. This matrix can then be used to calculate evolutionary distances between pairs of sequences in the collection. If only two sequences are considered, our method is equivalent to that of Lanave et al. (1984). The main novelty of our approach is in combining data from different sequence pairs. We describe a weighting method for pairs of taxa related by a known tree that results in uniform weights for all branches. Our method for estimating the rate matrix results in fast execution times, even on large data sets, and does not require knowledge of the phylogenetic relationships among sequences. In a test case on a primate pseudogene, the matrix we arrived at resembles one obtained using maximum likelihood, and the resulting distance measure is shown to have better linearity than is obtained in a less general model.  相似文献   

13.
Freeing phylogenies from artifacts of alignment.   总被引:1,自引:0,他引:1  
Widely used methods for phylogenetic inference, both those that require and those that produce alignments, share certain weaknesses. These weaknesses are discussed, and a method that lacks them is introduced. For each pair of sequences in the data set, the method utilizes both insertion-deletion and amino acid replacement information to estimate a pairwise evolutionary distance. It is also possible to allow regional heterogeneity of replacement rates. Because a likelihood framework is adopted, the standard deviation of each pairwise distance can be estimated. The distance matrix and standard error estimates are used to infer a phylogenetic tree. As an example, this method is used on 10 widely diverged sequences of the second largest RNA polymerase subunit. A pseudo-bootstrap technique is devised to assess the validity of the inferred phylogenetic tree.  相似文献   

14.
Streptococcus suis is an important pathogen of swine which occasionally infects humans as well. There are 35 serotypes known for this organism, and it would be desirable to develop rapid methods methods to identify and differentiate the strains of this species. To that effect, partial chaperonin 60 gene sequences were determined for the 35 serotype reference strains of S. suis. Analysis of a pairwise distance matrix showed that the distances ranged from 0 to 0.275 when values were calculated by the maximum-likelihood method. For five of the strains the distances from serotype 1 were greater than 0.1, and for two of these strains the distances were were more than 0.25, suggesting that they belong to a different species. Most of the nucleotide differences were silent; alignment of protein sequences showed that there were only 11 distinct sequences for the 35 strains under study. The chaperonin 60 gene phylogenetic tree was similar to the previously published tree based on 16S rRNA sequences, and it was also observed that strains with identical chaperonin 60 gene sequences tended to have identical 16S rRNA sequences. The chaperonin 60 gene sequences provided a higher level of discrimination between serotypes than the 16S RNA sequences provided and could form the basis for a diagnostic protocol.  相似文献   

15.
Most molecular analyses, including phylogenetic inference, are based on sequence alignments. We present an algorithm that estimates relatedness between biomolecules without the requirement of sequence alignment by using a protein frequency matrix that is reduced by singular value decomposition (SVD), in a latent semantic index information retrieval system. Two databases were used: one with 832 proteins from 13 mitochondrial gene families and another composed of 1000 sequences from nine types of proteins retrieved from GenBank. Firstly, 208 sequences from the first database and 200 from the second were randomly selected and compared using edit distance between each pair of sequences and respective cosines and Euclidean distances from SVD. Correlation between cosine and edit distance was -0.32 (P < 0.01) and between Euclidean distance and edit distance was +0.70 (P < 0.01). In order to check the ability of SVD in classifying sequences according to their categories, we used a sample of 202 sequences from the 13 gene families as queries (test set), and the other proteins (630) were used to generate the frequency matrix (training set). The classification algorithm applies a voting scheme based on the five most similar sequences with each query. With a 3-peptide frequency matrix, all 202 queries were correctly classified (accuracy = 100%). This algorithm is very attractive, because sequence alignments are neither generated nor required. In order to achieve results similar to those obtained with edit distance analysis, we recommend that Euclidean distance be used as a similarity measure for protein sequences in latent semantic indexing methods.  相似文献   

16.
The hyper-variable V4 and V9 regions of the small subunit (SSU) rDNA have been targeted for assessing environmental diversity of microbial eukaryotes using next generation sequencing technologies. Here, we explore how the genetic distances among these short fragments compare with the distances obtained from near full-length SSU-rDNA sequences by comparing all pairwise estimates, as well as within and among species of ciliates. Results show that pairwise distances from V4 more closely match the near full-length SSU-rDNA and are more comparable with previous studies based on much longer SSU-rDNA fragments, then pairwise distances from V9. Thus, studies that use the V4 will estimate similar values of phylotype richness and community structure as would have been estimated using the full-length SSU-rDNA.  相似文献   

17.
In density-independent models, the population growth rate lambda measures population performance, and the perturbation analysis of lambda (its sensitivity and elasticity) plays an important role in demography. In density-dependent models, the invasion exponent lambdaI replaces lambda as a measure of population performance. The perturbation analysis of lambdaI reveals the effects of environmental changes and management actions, gives the direction and intensity of density-dependent natural selection on life history traits, and permits calculation of the sampling variance of the invasion exponent. Because density-dependent models require more data than density-independent models, it is tempting to look for substitutes for the invasion exponent, the sensitivity and elasticity of which can be calculated from a density-independent model. Here we examine the accuracy of two such substitutes: the dominant eigenvalue of the projection matrix evaluated at equilibrium (An) and the dominant eigenvalue of the matrix averaged over the attractor (A). Using a two-stage model that represents a wide range of life history types, we find that the elasticities of An or A often agree to within less than 5% error with those of the invasion exponent, even when population dynamics are chaotic. The exceptions are for semelparous life histories, especially when density-dependence affects fertility. This suggests that the elasticity analysis of density-independent models near equilibrium, or averaged over the attractor, provides useful information about the elasticity of the invasion exponent in density-dependent models.  相似文献   

18.
DNA hybridization in animal taxonomy: a critique from first principles   总被引:2,自引:0,他引:2  
DNA hybridization is a "distance method" for phylogenetic reconstruction and, as such, shares a set of assumptions, advantages, and problems with other techniques that do not directly employ character data. The technique purports to measure the average percent mismatch of homologous nucleotide sequences between the single-copy genomes of species. This measurement, as any other, is subject to considerations of accuracy and precision. While replicate measurements and technical modifications can improve precision, the accuracy of such measurements is limited by the equivalence of genomes under comparison. Such routine events in genome evolution as gene duplication and deletion may complicate the interpretation of DNA hybridization distances. Beyond measurement limitations, the most serious potential distortions of distances are due to biased sequence sampling and homoplasy. These problems, however, do not necessarily preclude phylogenetic reconstruction, and their effects may be mitigated by numerical corrections. Homoplasy, in particular, is a difficulty faced by all methods of phylogenetic inference. If such distortions can be eliminated, mitigated by correction, or shown to be trivial, pairwise tree-construction strategies should provide reliable estimates of phylogeny.  相似文献   

19.
Pyrosequencing of PCR-amplified fragments that target variable regions within the 16S rRNA gene has quickly become a powerful method for analyzing the membership and structure of microbial communities. This approach has revealed and introduced questions that were not fully appreciated by those carrying out traditional Sanger sequencing-based methods. These include the effects of alignment quality, the best method of calculating pairwise genetic distances for 16S rRNA genes, whether it is appropriate to filter variable regions, and how the choice of variable region relates to the genetic diversity observed in full-length sequences. I used a diverse collection of 13,501 high-quality full-length sequences to assess each of these questions. First, alignment quality had a significant impact on distance values and downstream analyses. Specifically, the greengenes alignment, which does a poor job of aligning variable regions, predicted higher genetic diversity, richness, and phylogenetic diversity than the SILVA and RDP-based alignments. Second, the effect of different gap treatments in determining pairwise genetic distances was strongly affected by the variation in sequence length for a region; however, the effect of different calculation methods was subtle when determining the sample''s richness or phylogenetic diversity for a region. Third, applying a sequence mask to remove variable positions had a profound impact on genetic distances by muting the observed richness and phylogenetic diversity. Finally, the genetic distances calculated for each of the variable regions did a poor job of correlating with the full-length gene. Thus, while it is tempting to apply traditional cutoff levels derived for full-length sequences to these shorter sequences, it is not advisable. Analysis of β-diversity metrics showed that each of these factors can have a significant impact on the comparison of community membership and structure. Taken together, these results urge caution in the design and interpretation of analyses using pyrosequencing data.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号