首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 546 毫秒
1.
Version 1.5 of the computer program TNT completely integrates landmark data into phylogenetic analysis. Landmark data consist of coordinates (in two or three dimensions) for the terminal taxa; TNT reconstructs shapes for the internal nodes such that the difference between ancestor and descendant shapes for all tree branches sums up to a minimum; this sum is used as tree score. Landmark data can be analysed alone or in combination with standard characters; all the applicable commands and options in TNT can be used transparently after reading a landmark data set. The program continues implementing all the types of analyses in former versions, including discrete and continuous characters (which can now be read at any scale, and automatically rescaled by TNT). Using algorithms described in this paper, searches for landmark data can be made tens to hundreds of times faster than it was possible before (from T to 3T times faster, where T is the number of taxa), thus making phylogenetic analysis of landmarks feasible even on standard personal computers.  相似文献   

2.
We present SequenceMatrix, software that is designed to facilitate the assembly and analysis of multi‐gene datasets. Genes are concatenated by dragging and dropping FASTA, NEXUS, or TNT files with aligned sequences into the program window. A multi‐gene dataset is concatenated and displayed in a spreadsheet; each sequence is represented by a cell that provides information on sequence length, number of indels, the number of ambiguous bases (“Ns”), and the availability of codon information. Alternatively, GenBank numbers for the sequences can be displayed and exported. Matrices with hundreds of genes and taxa can be concatenated within minutes and exported in TNT, NEXUS, or PHYLIP formats, preserving both character set and codon information for TNT and NEXUS files. SequenceMatrix also creates taxon sets listing taxa with a minimum number of characters or gene fragments, which helps assess preliminary datasets. Entire taxa, whole gene fragments, or individual sequences for a particular gene and species can be excluded from export. Data matrices can be re‐split into their component genes and the gene fragments can be exported as individual gene files. SequenceMatrix also includes two tools that help to identify sequences that may have been compromised through laboratory contamination or data management error. One tool lists identical or near‐identical sequences within genes, while the other compares the pairwise distance pattern of one gene against the pattern for all remaining genes combined. SequenceMatrix is Java‐based and compatible with the Microsoft Windows, Apple MacOS X and Linux operating systems. The software is freely available from http://code.google.com/p/sequencematrix/ . © The Willi Hennig Society 2010.  相似文献   

3.
Roshan et al. recently described a "divide-and-conquer" technique for parsimony analysis of large data sets, Rec-I-DCM3, and stated that it compares very favorably to results using the program TNT. Their technique is based on selecting subsets of taxa to create reduced data sets or subproblems, finding most-parsimonious trees for each reduced data set, recombining all parts together, and then performing global TBR swapping on the combined tree. Here, we contrast this approach to sectorial searches, a divide-and-conquer algorithm implemented in TNT. This algorithm also uses a guide tree to create subproblems, with the first-pass state sets of the nodes that join the selected sectors with the rest of the topology; this allows exact length calculations for the entire topology (that is, any solution N steps shorter than the original, for the reduced subproblem, must also be N steps shorter for the entire topology). We show here that, for sectors of similar size analyzed with the same search algorithms, subdividing data sets with sectorial searches produces better results than subdividing with Rec-I-DCM3. Roshan et al.'s claim that Rec-I-DCM3 outperforms the techniques in TNT was caused by a poor experimental design and algorithmic settings used for the runs in TNT. In particular, for finding trees at or very close to the minimum known length of the analyzed data sets, TNT clearly outperforms Rec-I-DCM3. Finally, we show that the performance of Rec-I-DCM3 is bound by the efficiency of TBR implementation for the complete data set, as this method behaves (after some number of iterations) as a technique for cyclic perturbations and improvements more than as a divide-and-conquer strategy.  相似文献   

4.
Continuous characters analyzed as such   总被引:4,自引:0,他引:4  
Quantitative and continuous characters have rarely been included in cladistic analyses of morphological data; when included, they have always been discretized, using a variety of ad hoc methods. As continuous characters are typically additive, they can be optimized with well known algorithms, so that with a proper implementation they could be easily analyzed without discretization. The program TNT has recently incorporated algorithms for analysis of continuous characters. One of the problems that has been pointed out with existing methods for discretization is that they can attribute different states to terminals that do not differ significantly—or vice versa. With the implementation in TNT, this problem is diminished (or avoided entirely) by simply assigning to each terminal a range that goes from the mean minus one (or two) SE to the mean plus one (or two) SE; given normal distributions, terminals that do not overlap thus differ significantly (more significantly if using more than 1 SE). Three real data sets (for scorpions, spiders and lizards) comprising both discrete and quantitative characters are analyzed to study the performance of continuous characters. One of the matrices has a reduced number of continuous characters, and thus continuous characters analyzed by themselves produce only poorly resolved trees; the support for many of the groups supported by the discrete characters alone, however, is increased when the continuous characters are added to the analysis. The other two matrices have larger numbers of continuous characters, so that the results of separate analyses for the discrete and the continuous characters can be more meaningfully compared. In both cases, the continuous characters (analyzed alone) result in trees that are relatively similar to the trees produced by the discrete characters alone. These results suggest that continuous characters carry indeed phylogenetic information, and that (if they have been observed) there is no real reason to exclude them from the analysis. © The Willi Hennig Society 2006.  相似文献   

5.
This paper presents a pipeline, implemented in an open‐source program called GB→TNT (GenBank‐to‐TNT), for creating large molecular matrices, starting from GenBank files and finishing with TNT matrices which incorporate taxonomic information in the terminal names. GB→TNT is designed to retrieve a defined genomic region from a bulk of sequences included in a GenBank file. The user defines the genomic region to be retrieved and several filters (genome, length of the sequence, taxonomic group, etc.); each genomic region represents a different data block in the final TNT matrix. GB→TNT first generates Fasta files from the input GenBank files, then creates an alignment for each of those (by calling an alignment program), and finally merges all the aligned files into a single TNT matrix. The new version of TNT can make use of the taxonomic information contained in the terminal names, allowing easy diagnosis of results, evaluation of fit between the trees and the taxonomy, and automatic labelling or colouring of tree branches with the taxonomic groups they represent. © The Willi Hennig Society 2012.  相似文献   

6.
Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high‐quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high‐quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.  相似文献   

7.
8.
Individual‐based data sets tracking organisms over space and time are fundamental to answering broad questions in ecology and evolution. A ‘permanent’ genetic tag circumvents a need to invasively mark or tag animals, especially if there are little phenotypic differences among individuals. However, genetic tracking of individuals does not come without its limits; correctly matching genotypes and error rates associated with laboratory work can make it difficult to parse out matched individuals. In addition, defining a sampling design that effectively matches individuals in the wild can be a challenge for researchers. Here, we combine the two objectives of defining sampling design and reducing genotyping error through an efficient Python‐based computer‐modelling program, wisepair . We describe the methods used to develop the computer program and assess its effectiveness through three empirical data sets, with and without reference genotypes. Our results show that wisepair outperformed similar genotype matching programs using previously published from reference genotype data of diurnal poison frogs (Allobates femoralis) and without‐reference (faecal) genotype sample data sets of harbour seals (Phoca vitulina) and Eurasian otters (Lutra lutra). In addition, due to limited sampling effort in the harbour seal data, we present optimal sampling designs for future projects. wisepair allows for minimal sacrifice in the available methods as it incorporates sample rerun error data, allelic pairwise comparisons and probabilistic simulations to determine matching thresholds. Our program is the lone tool available to researchers to define parameters a priori for genetic tracking studies.  相似文献   

9.
NeEstimator v2 is a completely revised and updated implementation of software that produces estimates of contemporary effective population size, using several different methods and a single input file. NeEstimator v2 includes three single‐sample estimators (updated versions of the linkage disequilibrium and heterozygote‐excess methods, and a new method based on molecular coancestry), as well as the two‐sample (moment‐based temporal) method. New features include the following: (i) an improved method for accounting for missing data; (ii) options for screening out rare alleles; (iii) confidence intervals for all methods; (iv) the ability to analyse data sets with large numbers of genetic markers (10 000 or more); (v) options for batch processing large numbers of different data sets, which will facilitate cross‐method comparisons using simulated data; and (vi) correction for temporal estimates when individuals sampled are not removed from the population (Plan I sampling). The user is given considerable control over input data and composition, and format of output files. The freely available software has a new JAVA interface and runs under MacOS, Linux and Windows.  相似文献   

10.
The molecular clock presents a means of estimating evolutionary rates and timescales using genetic data. These estimates can lead to important insights into evolutionary processes and mechanisms, as well as providing a framework for further biological analyses. To deal with rate variation among genes and among lineages, a diverse range of molecular‐clock methods have been developed. These methods have been implemented in various software packages and differ in their statistical properties, ability to handle different models of rate variation, capacity to incorporate various forms of calibrating information and tractability for analysing large data sets. Choosing a suitable molecular‐clock model can be a challenging exercise, but a number of model‐selection techniques are available. In this review, we describe the different forms of evolutionary rate heterogeneity and explain how they can be accommodated in molecular‐clock analyses. We provide an outline of the various clock methods and models that are available, including the strict clock, local clocks, discrete clocks and relaxed clocks. Techniques for calibration and clock‐model selection are also described, along with methods for handling multilocus data sets. We conclude our review with some comments about the future of molecular clocks.  相似文献   

11.
Several extensions to implied weighting, recently implemented in TNT, allow a better treatment of data sets combining morphological and molecular data sets, as well as those comprising large numbers of missing entries (e.g. palaeontological matrices, or combined matrices with some genes sequenced for few taxa). As there have been recent suggestions that molecular matrices may be better analysed using equal weights (rather than implied weighting), a simple way to apply implied weighting to only some characters (e.g. morphology), leaving other characters with a constant weight (e.g. molecules), is proposed. The new methods also allow weighting entire partitions according to their average homoplasy, giving each of the characters in the partition the same weight (this can be used for dynamically weighting, e.g. entire genes, or first, second, and third positions collectively). Such an approach is easily implemented in schemes like successive weighting, but in the case of implied weighting poses some particular problems. The approach has the peculiar implication that the inclusion of uninformative characters influences the results (by influencing the implied weights for the partitions). Last, the concern that characters with many missing entries may receive artificially inflated weights (because they necessarily display less homoplasy) can be solved by allowing the use of different weighting functions for different characters, in such a way that the cost of additional transformations decreases more rapidly for characters with more missing entries (thus effectively assuming that the unobserved entries are likely to also display some unobserved homoplasy). The conceptual and practical aspects of all these problems, as well as details of the implementation in TNT, are discussed.  相似文献   

12.
In a recent study, the phylogeny of Caseidae (a herbivorous family of Palaeozoic synapsids belonging to the paraphyletic grade known as pelycosaurs) was analysed with a dataset employing more than three hundred continuous morphological characters in an effort to follow the principles of total evidence. Continuous characters are a source of great debate, with disagreements surrounding their suitability for and treatment in phylogenetic analysis. A number of shortcomings were identified in the handling of continuous characters in this study of caseids, including the use of gap weighting to discretize the characters and potential issues with redundancy and character non‐independence. Therefore, an alternative treatment for these characters is suggested here. First, rather than using gap weighting, the continuous characters were analysed in the program TNT, in which the raw values can be treated as continuous rather than discrete. Second, prior to the phylogenetic analysis, the continuous characters were subjected to a log‐ratio principal component analysis, and then the principal components were included in the character matrix rather than the raw ratios. Analysing the original data in TNT produced little difference in the results, but using the principal components as continuous characters resulted in alternative positions for Caseopsis agilis, Ennatosaurus tecton and Caseoides sanangeloensis. The differences are judged to be due to the reduced redundancy of the characters, the smaller number of principal components not overwhelming the discrete characters and the use of a scaling method which allows principal components with a higher variance to have a greater influence on the analysis. The positions of highly fragmentary fossils depended heavily on the method used to treat the missing characters in the principal component analysis, and so the method proposed here is not recommended for analysing very incomplete taxa.  相似文献   

13.
Gene set analysis methods are popular tools for identifying differentially expressed gene sets in microarray data. Most existing methods use a permutation test to assess significance for each gene set. The permutation test's assumption of exchangeable samples is often not satisfied for time‐series data and complex experimental designs, and in addition it requires a certain number of samples to compute p‐values accurately. The method presented here uses a rotation test rather than a permutation test to assess significance. The rotation test can compute accurate p‐values also for very small sample sizes. The method can handle complex designs and is particularly suited for longitudinal microarray data where the samples may have complex correlation structures. Dependencies between genes, modeled with the use of gene networks, are incorporated in the estimation of correlations between samples. In addition, the method can test for both gene sets that are differentially expressed and gene sets that show strong time trends. We show on simulated longitudinal data that the ability to identify important gene sets may be improved by taking the correlation structure between samples into account. Applied to real data, the method identifies both gene sets with constant expression and gene sets with strong time trends.  相似文献   

14.
15.
We present POY version 4, an open source program for the phylogenetic analysis of morphological, prealigned sequence, unaligned sequence, and genomic data. POY allows phylogenetic inference when not only substitutions, but insertions, deletions, and rearrangement events are allowed (computed using the breakpoint or inversion distance). Compared with previous versions, POY 4 provides greater flexibility, a larger number of supported parameter sets, numerous execution time improvements, a vastly improved user interface, greater quality control, and extensive documentation. We introduce POY's basic features, and present a simple example illustrating the performance improvements over previous versions of the application.
© The Willi Hennig Society 2009.  相似文献   

16.
We established a genomic model of quantitative trait with genomic additive and dominance relationships that parallels the traditional quantitative genetics model, which partitions a genotypic value as breeding value plus dominance deviation and calculates additive and dominance relationships using pedigree information. Based on this genomic model, two sets of computationally complementary but mathematically identical mixed model methods were developed for genomic best linear unbiased prediction (GBLUP) and genomic restricted maximum likelihood estimation (GREML) of additive and dominance effects using SNP markers. These two sets are referred to as the CE and QM sets, where the CE set was designed for large numbers of markers and the QM set was designed for large numbers of individuals. GBLUP and associated accuracy formulations for individuals in training and validation data sets were derived for breeding values, dominance deviations and genotypic values. Simulation study showed that GREML and GBLUP generally were able to capture small additive and dominance effects that each accounted for 0.00005–0.0003 of the phenotypic variance and GREML was able to differentiate true additive and dominance heritability levels. GBLUP of the total genetic value as the summation of additive and dominance effects had higher prediction accuracy than either additive or dominance GBLUP, causal variants had the highest accuracy of GREML and GBLUP, and predicted accuracies were in agreement with observed accuracies. Genomic additive and dominance relationship matrices using SNP markers were consistent with theoretical expectations. The GREML and GBLUP methods can be an effective tool for assessing the type and magnitude of genetic effects affecting a phenotype and for predicting the total genetic value at the whole genome level.  相似文献   

17.
We present here a new version of the Arlequin program available under three different forms: a Windows graphical version (Winarl35), a console version of Arlequin (arlecore), and a specific console version to compute summary statistics (arlsumstat). The command-line versions run under both Linux and Windows. The main innovations of the new version include enhanced outputs in XML format, the possibility to embed graphics displaying computation results directly into output files, and the implementation of a new method to detect loci under selection from genome scans. Command-line versions are designed to handle large series of files, and arlsumstat can be used to generate summary statistics from simulated data sets within an Approximate Bayesian Computation framework.  相似文献   

18.
Next‐generation sequencing is a common method for analysing microbial community diversity and composition. Configuring an appropriate sequence processing strategy within the variety of tools and methods is a nontrivial task and can considerably influence the resulting community characteristics. We analysed the V4 region of 18S rRNA gene sequences of marine samples by 454‐pyrosequencing. Along this process, we generated several data sets with QIIME, mothur, and a custom‐made pipeline based on DNAStar and the phylogenetic tree‐based PhyloAssigner. For all processing strategies, default parameter settings and punctual variations were used. Our results revealed strong differences in total number of operational taxonomic units (OTUs), indicating that sequence preprocessing and clustering had a major impact on protist diversity estimates. However, diversity estimates of the abundant biosphere (abundance of ≥1%) were reproducible for all conducted processing pipeline versions. A qualitative comparison of diatom genera emphasized strong differences between the pipelines in which phylogenetic placement of sequences came closest to light microscopy‐based diatom identification. We conclude that diversity studies using different sequence processing strategies are comparable if the focus is on higher taxonomic levels, and if abundance thresholds are used to filter out OTUs of the rare biosphere.  相似文献   

19.
Geography and landscape are important determinants of genetic variation in natural populations, and several ancestry estimation methods have been proposed to investigate population structure using genetic and geographic data simultaneously. Those approaches are often based on computer‐intensive stochastic simulations and do not scale with the dimensions of the data sets generated by high‐throughput sequencing technologies. There is a growing demand for faster algorithms able to analyse genomewide patterns of population genetic variation in their geographic context. In this study, we present TESS3 , a major update of the spatial ancestry estimation program TESS . By combining matrix factorization and spatial statistical methods, TESS3 provides estimates of ancestry coefficients with accuracy comparable to TESS and with run‐times much faster than the Bayesian version. In addition, the TESS3 program can be used to perform genome scans for selection, and separate adaptive from nonadaptive genetic variation using ancestral allele frequency differentiation tests. The main features of TESS3 are illustrated using simulated data and analysing genomic data from European lines of the plant species Arabidopsis thaliana.  相似文献   

20.
spads 1.0 (for ‘Spatial and Population Analysis of DNA Sequences’) is a population genetic toolbox for characterizing genetic variability within and among populations from DNA sequences. In view of the drastic increase in genetic information available through sequencing methods, spads was specifically designed to deal with multilocus data sets of DNA sequences. It computes several summary statistics from populations or groups of populations, performs input file conversions for other population genetic programs and implements locus‐by‐locus and multilocus versions of two clustering algorithms to study the genetic structure of populations. The toolbox also includes two Matlab and r functions, Gdispal and Gdivpal , to display differentiation and diversity patterns across landscapes. These functions aim to generate interpolating surfaces based on multilocus distance and diversity indices. In the case of multiple loci, such surfaces can represent a useful alternative to multiple pie charts maps traditionally used in phylogeography to represent the spatial distribution of genetic diversity. These coloured surfaces can also be used to compare different data sets or different diversity and/or distance measures estimated on the same data set.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号