首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Phylogenomic studies aim to build phylogenies from large sets of homologous genes. Such "genome-sized" data require fast methods, because of the typically large numbers of taxa examined. In this framework, distance-based methods are useful for exploratory studies and building a starting tree to be refined by a more powerful maximum likelihood (ML) approach. However, estimating evolutionary distances directly from concatenated genes gives poor topological signal as genes evolve at different rates. We propose a novel method, named super distance matrix (SDM), which follows the same line as average consensus supertree (ACS; Lapointe and Cucumel, 1997) and combines the evolutionary distances obtained from each gene into a single distance supermatrix to be analyzed using a standard distance-based algorithm. SDM deforms the source matrices, without modifying their topological message, to bring them as close as possible to each other; these deformed matrices are then averaged to obtain the distance supermatrix. We show that this problem is equivalent to the minimization of a least-squares criterion subject to linear constraints. This problem has a unique solution which is obtained by resolving a linear system. As this system is sparse, its practical resolution requires O(naka) time, where n is the number of taxa, k the number of matrices, and a < 2, which allows the distance supermatrix to be quickly obtained. Several uses of SDM are proposed, from fast exploratory studies to more accurate approaches requiring heavier computing time. Using simulations, we show that SDM is a relevant alternative to the standard matrix representation with parsimony (MRP) method, notably when the taxa sets of the different genes have low overlap. We also show that SDM can be used to build an excellent starting tree for an ML approach, which both reduces the computing time and increases the topogical accuracy. We use SDM to analyze the data set of Gatesy et al. (2002, Syst. Biol. 51: 652-664) that involves 48 genes of 75 placental mammals. The results indicate that these genes have strong rate heterogeneity and confirm the simulation conclusions.  相似文献   

2.
Estimating geographical ranges of intra‐specific evolutionary lineages is crucial to the fields of biogeography, evolution, and biodiversity conservation. Models of isolation mechanisms often consider multiple distances in order to explain genetic divergence. Yet, the available methods to estimate the geographical ranges of lineages are based on direct geographical distances, neglecting other distance metrics that can better explain the spatial genetic structure. We extended the phylogeographical interpolation method (phylin ) in order to accommodate user‐defined distance metrics and to incorporate the uncertainty associated with genetic distance calculation. These new features were tested with simulated and empirical data sets. Multiple distance matrices were generated including geographical, resistance, and environmental distances to derive maps of lineage occurrence. The new additions to this method improved the ability to predict lineage occurrence, even with low sample size. We used a regression framework to quantify the relationship between the genetic divergence and competing distance matrices representing potential isolation processes that are subsequently used in the interpolation process. Including uncertainty in tree topology and the different distance matrices improved the robustness of the variograms, allowing a better fit of the theoretical model of spatial dependence. The improvements to the method increase its potential application in other fields. Accurately mapping genetic divergence can help to locate potential contact zones between lineages as well as barriers to gene flow, which has a broad interest in biogeographical and evolutionary studies. Additionally, conservation efforts could benefit from the integration of genetic variation and landscape features in a spatially explicit framework.  相似文献   

3.
Ensemble forecasting is advocated as a way of reducing uncertainty in species distribution modeling (SDM). This is because it is expected to balance accuracy and robustness of SDM models. However, there are little available data regarding the spatial similarity of the combined distribution maps generated by different consensus approaches. Here, using eight niche-based models, nine split-sample calibration bouts (or nine random model-training subsets), and nine climate change scenarios, the distributions of 32 forest tree species in China were simulated under current and future climate conditions. The forecasting ensembles were combined to determine final consensual prediction maps for target species using three simple consensus approaches (average, frequency, and median [PCA]). Species’ geographic ranges changed (area change and shifting distance) in response to climate change, but the three consensual projections did not differ significantly with respect to how much or in which direction, but they did differ with respect to the spatial similarity of the three consensual predictions. Incongruent areas were observed primarily at the edges of species’ ranges. Multiple stepwise regression models showed the three factors (niche marginality and specialization, and niche model accuracy) to be related to the observed variations in consensual prediction maps among consensus approaches. Spatial correspondence among prediction maps was the highest when niche model accuracy was high and marginality and specialization were low. The difference in spatial predictions suggested that more attention should be paid to the range of spatial uncertainty before any decisions regarding specialist species can be made based on map outputs. The niche properties and single-model predictive performance provide promising insights that may further understanding of uncertainties in SDM.  相似文献   

4.
Full genome data sets are currently being explored on a regular basis to infer phylogenetic trees, but there are often discordances among the trees produced by different genes. An important goal in phylogenomics is to identify which individual gene and species produce the same phylogenetic tree and are thus likely to share the same evolutionary history. On the other hand, it is also essential to identify which genes and species produce discordant topologies and therefore evolve in a different way or represent noise in the data. The latter are outlier genes or species and they can provide a wealth of information on potentially interesting biological processes, such as incomplete lineage sorting, hybridization, and horizontal gene transfers. Here, we propose a new method to explore the genomic tree space and detect outlier genes and species based on multiple co-inertia analysis (MCOA), which efficiently captures and compares the similarities in the phylogenetic topologies produced by individual genes. Our method allows the rapid identification of outlier genes and species by extracting the similarities and discrepancies, in terms of the pairwise distances, between all the species in all the trees, simultaneously. This is achieved by using MCOA, which finds successive decomposition axes from individual ordinations (i.e., derived from distance matrices) that maximize a covariance function. The method is freely available as a set of R functions. The source code and tutorial can be found online at http://phylomcoa.cgenomics.org.  相似文献   

5.
Phylogenetic trees based on gene content   总被引:2,自引:0,他引:2  
Comparing gene content between species can be a useful approach for reconstructing phylogenetic trees. In this paper, we derive a maximum-likelihood estimation of evolutionary distance between species under a simple model of gene genesis and gene loss. Using simulated data on a biological tree with 107 taxa (and on a number of randomly generated trees), we compare the accuracy of tree reconstruction using this ML distance measure to an earlier ad hoc distance. We then compare these distance-based approaches to a character-based tree reconstruction method (Dollo parsimony) which seems well suited to the analysis of gene content data. To simplify simulations, we give a formal proof of the well-known 'fact' that the Dollo parsimony score is independent of the choice of root. Our results show a consistent trend, with the character-based method and ML distance measure outperforming the earlier ad hoc distance method. AVAILABILITY: http://www.ab.informatik.uni-tuebingen.de/software/genecontent/welcome_en.html  相似文献   

6.
Qi Y  Sun H  Sun Q  Pan L 《Genomics》2011,97(5):326-329
Microarrays allow researchers to examine the expression of thousands of genes simultaneously. However, identification of genes differentially expressed in microarray experiments is challenging. With an optimal test statistic, we rank genes and estimate a threshold above which genes are considered to be differentially expressed genes (DE). This overcomes the embarrassing shortcoming of many statistical methods to determine the cut-off values in ranking analysis. Experiments demonstrate that our method is a good performance and avoids the problems with graphical examination and multiple hypotheses testing that affect alternative approaches. Comparing to those well known methods, our method is more sensitive to data sets with small differentially expressed values and not biased in favor of data sets based on certain distribution models.  相似文献   

7.
Phylogenetic inference based on matrix representation of trees.   总被引:14,自引:0,他引:14  
Rooted phylogenetic trees can be represented as matrices in which the rows correspond to termini, and columns correspond to internal nodes (elements of the n-tree). Parsimony analysis of such a matrix will fully recover the topology of the original tree. The maximum size of the represented matrix depends only on the number of termini in the tree; for a tree derived from molecular sequences, the represented matrix may be orders of magnitude smaller than the original data matrix. Representations of multiple trees (which may or may not have identical termini) can readily be combined into a single matrix; columns of discrete-character-state data can be added and, if desired, weighted differentially. Parsimony analysis of the resulting composite matrix yields a hybrid supertree which typically provides greater resolution than conventional consensus trees. Use of this method is illustrated with examples involving multiple tRNA genes in organelles and multiple protein-coding genes in eukaryotes.  相似文献   

8.
On gene ranking using replicated microarray time course data   总被引:1,自引:0,他引:1  
Tai YC  Speed TP 《Biometrics》2009,65(1):40-51
Summary .  Consider the ranking of genes using data from replicated microarray time course experiments, where there are multiple biological conditions, and the genes of interest are those whose temporal profiles differ across conditions. We derive a multisample multivariate empirical Bayes' statistic for ranking genes in the order of differential expression, from both longitudinal and cross-sectional replicated developmental microarray time course data. Our longitudinal multisample model assumes that time course replicates are independent and identically distributed multivariate normal vectors. On the other hand, we construct a cross-sectional model using a normal regression framework with any appropriate basis for the design matrices. In both cases, we use natural conjugate priors in our empirical Bayes' setting which guarantee closed form solutions for the posterior odds. The simulations and two case studies using published worm and mouse microarray time course datasets indicate that the proposed approaches perform satisfactorily.  相似文献   

9.
It is now possible to construct genome-scale metabolic networks for particular microorganisms. Extreme pathway analysis is a useful method for analyzing the phenotypic capabilities of these networks. Many extreme pathways are needed to fully describe the functional capabilities of genome-scale metabolic networks, and therefore, a need exists to develop methods to study these large sets of extreme pathways. Singular value decomposition (SVD) of matrices of extreme pathways was used to develop a conceptual framework for the interpretation of large sets of extreme pathways and the steady-state flux solution space they define. The key results of this study were: 1), convex steady-state solution cones describing the potential functions of biochemical networks can be studied using the modes generated by SVD; 2), Helicobacter pylori has a more rigid metabolic network (i.e., a lower dimensional solution space and a more dominant first singular value) than Haemophilus influenzae for the production of amino acids; and 3), SVD allows for direct comparison of different solution cones resulting from the production of different amino acids. SVD was used to identify key network branch points that may identify key control points for regulation. Therefore, SVD of matrices of extreme pathways has proved to be a useful method for analyzing the steady-state solution space of genome-scale metabolic networks.  相似文献   

10.
11.
The amplified fragment length polymorphisms (AFLP) method has become an attractive tool in phylogenetics due to the ease with which large numbers of characters can be generated. In contrast to sequence-based phylogenetic approaches, AFLP data consist of anonymous multilocus markers. However, potential artificial amplifications or amplification failures of fragments contained in the AFLP data set will reduce AFLP reliability especially in phylogenetic inferences. In the present study, we introduce a new automated scoring approach, called “AMARE” (AFLP MAtrix REduction). The approach is based on replicates and makes marker selection dependent on marker reproducibility to control for scoring errors. To demonstrate the effectiveness of our approach we record error rate estimations, resolution scores, PCoA and stemminess calculations. As in general the true tree (i.e. the species phylogeny) is not known, we tested AMARE with empirical, already published AFLP data sets, and compared tree topologies of different AMARE generated character matrices to existing phylogenetic trees and/or other independent sources such as morphological and geographical data. It turns out that the selection of masked character matrices with highest resolution scores gave similar or even better phylogenetic results than the original AFLP data sets.  相似文献   

12.
Gene prioritization through genomic data fusion   总被引:4,自引:0,他引:4  
The identification of genes involved in health and disease remains a challenge. We describe a bioinformatics approach, together with a freely accessible, interactive and flexible software termed Endeavour, to prioritize candidate genes underlying biological processes or diseases, based on their similarity to known genes involved in these phenomena. Unlike previous approaches, ours generates distinct prioritizations for multiple heterogeneous data sources, which are then integrated, or fused, into a global ranking using order statistics. In addition, it offers the flexibility of including additional data sources. Validation of our approach revealed it was able to efficiently prioritize 627 genes in disease data sets and 76 genes in biological pathway sets, identify candidates of 16 mono- or polygenic diseases, and discover regulatory genes of myeloid differentiation. Furthermore, the approach identified a novel gene involved in craniofacial development from a 2-Mb chromosomal region, deleted in some patients with DiGeorge-like birth defects. The approach described here offers an alternative integrative method for gene discovery.  相似文献   

13.

Background

In genomics, hierarchical clustering (HC) is a popular method for grouping similar samples based on a distance measure. HC algorithms do not actually create clusters, but compute a hierarchical representation of the data set. Usually, a fixed height on the HC tree is used, and each contiguous branch of samples below that height is considered a separate cluster. Due to the fixed-height cutting, those clusters may not unravel significant functional coherence hidden deeper in the tree. Besides that, most existing approaches do not make use of available clinical information to guide cluster extraction from the HC. Thus, the identified subgroups may be difficult to interpret in relation to that information.

Results

We develop a novel framework for decomposing the HC tree into clusters by semi-supervised piecewise snipping. The framework, called guided piecewise snipping, utilizes both molecular data and clinical information to decompose the HC tree into clusters. It cuts the given HC tree at variable heights to find a partition (a set of non-overlapping clusters) which does not only represent a structure deemed to underlie the data from which HC tree is derived, but is also maximally consistent with the supplied clinical data. Moreover, the approach does not require the user to specify the number of clusters prior to the analysis. Extensive results on simulated and multiple medical data sets show that our approach consistently produces more meaningful clusters than the standard fixed-height cut and/or non-guided approaches.

Conclusions

The guided piecewise snipping approach features several novelties and advantages over existing approaches. The proposed algorithm is generic, and can be combined with other algorithms that operate on detected clusters. This approach represents an advancement in several regards: (1) a piecewise tree snipping framework that efficiently extracts clusters by snipping the HC tree possibly at variable heights while preserving the HC tree structure; (2) a flexible implementation allowing a variety of data types for both building and snipping the HC tree, including patient follow-up data like survival as auxiliary information.The data sets and R code are provided as supplementary files. The proposed method is available from Bioconductor as the R-package HCsnip.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-014-0448-1) contains supplementary material, which is available to authorized users.  相似文献   

14.
Craniometric measurements represent a useful tool for studying the differentiation of mammal populations. However, the fragility of skulls often leads to incomplete data matrices. Damaged specimens or incomplete sets of measurements are usually discarded prior to statistical analysis. We assessed the performance of two strategies that avoid elimination of observations: (1) pairwise deletion of missing cells, and (2) estimation of missing data using available measurements. The effect of these distinct approaches on the computation of inter-individual distances and population differentiation analyses were evaluated using craniometric measurements obtained from insular populations of deer micePeromyscus maniculatus (Wagner, 1845). In our simulations, Euclidean distances were greatly altered by pairwise deletion, whereas Gower’s distance coefficient corrected for missing data provided accurate results. Among the different estimation methods compared in this paper, the regression-based approximations weighted by coefficients of determination (r 2) outperformed the competing approaches. We further show that incomplete sets of craniometric measurements can be used to compute distance matrices, provided that an appropriate coefficient is selected. However, the application of estimation procedures provides a flexible approach that allows researchers to analyse incomplete data sets.  相似文献   

15.
It is well known among phylogeneticists that adding an extra taxon (e.g. species) to a data set can alter the structure of the optimal phylogenetic tree in surprising ways. However, little is known about this “rogue taxon” effect. In this paper we characterize the behavior of balanced minimum evolution (BME) phylogenetics on data sets of this type using tools from polyhedral geometry. First we show that for any distance matrix there exist distances to a “rogue taxon” such that the BME-optimal tree for the data set with the new taxon does not contain any nontrivial splits (bipartitions) of the optimal tree for the original data. Second, we prove a theorem which restricts the topology of BME-optimal trees for data sets of this type, thus showing that a rogue taxon cannot have an arbitrary effect on the optimal tree. Third, we computationally construct polyhedral cones that give complete answers for BME rogue taxon behavior when our original data fits a tree on four, five, and six taxa. We use these cones to derive sufficient conditions for rogue taxon behavior for four taxa, and to understand the frequency of the rogue taxon effect via simulation.  相似文献   

16.
We propose and study the notion of dense regions for the analysis of categorized gene expression data and present some searching algorithms for discovering them. The algorithms can be applied to any categorical data matrices derived from gene expression level matrices. We demonstrate that dense regions are simple but useful and statistically significant patterns that can be used to 1) identify genes and/or samples of interest and 2) eliminate genes and/or samples corresponding to outliers, noise, or abnormalities. Some theoretical studies on the properties of the dense regions are presented which allow us to characterize dense regions into several classes and to derive tailor-made algorithms for different classes of regions. Moreover, an empirical simulation study on the distribution of the size of dense regions is carried out which is then used to assess the significance of dense regions and to derive effective pruning methods to speed up the searching algorithms. Real microarray data sets are employed to test our methods. Comparisons with six other well-known clustering algorithms using synthetic and real data are also conducted which confirm the superiority of our methods in discovering dense regions. The DRIFT code and a tutorial are available as supplemental material, which can be found on the Computer Society Digital Library at http://computer.org/tcbb/archives.htm.  相似文献   

17.
Distance-based methods have been a valuable tool for ecologists for decades. Indirectly, distance-based ordination and cluster analysis, in particular, have been widely practiced as they allow the visualization of a multivariate data set in a few dimensions. The explicitly distance-based Mantel test and multiple regression on distance matrices (MRM) add hypothesis testing to the toolbox. One concern for ecologists wishing to use these methods lies in deciding whether to combine data vectors into a compound multivariate dissimilarity to analyze them individually. For Euclidean distances on scaled data, the correlation of a pair of multivariate distance matrices can be calculated from the correlations between the two sets of individual distance matrices if one set is orthogonal, demonstrating a clear link between individual and compound distances. The choice between Mantel and MRM should be driven by ecological hypotheses rather than mathematical concerns. The relationship between individual and compound distance matrices also provides a means for calculating the maximum possible value of the Mantel statistic, which can be considerably less than 1 for a given analysis. These relationships are demonstrated with simulated data. Although these mathematical relationships are only strictly true for Euclidean distances when one set of variables is orthogonal, simulations show that they are approximately true for weakly correlated variables and Bray–Curtis dissimilarities.  相似文献   

18.
This paper has two complementary purposes: first, to present a method to perform multiple regression on distance matrices, with permutation testing appropriate for path-length matrices representing evolutionary trees, and then, to apply this method to study the joint evolution of brain, behavior and other characteristics in marsupials. To understand the computation method, consider that the dependent matrix is unfolded as a vector y; similarly, consider X to be a table containing the independent matrices, also unfolded as vectors. A multiple regression is computed to express y as a function of X. The parameters of this regression (R2 and partial regression coefficients) are tested by permutations, as follows. When the dependent matrix variable y represents a simple distance or similarity matrix, permutations are performed in the same manner as the Mantel permutational test. When it is an ultrametric matrix representing a dendrogram, we use the double-permutation method (Lapointe and Legendre 1990, 1991). When it is a path-length matrix representing an additive tree (cladogram), we use the triple-permutation method (Lapointe and Legendre 1992). The independent matrix variables in X are kept fixed with respect to one another during the permutations. Selection of predictors can be accomplished by forward selection, backward elimination, or a stepwise procedure. A phylogenetic tree, derived from marsupial brain morphology data (28 species), is compared to trees depicting the evolution of diet, sociability, locomotion, and habitat in these animals, as well as their taxonomy and geographical relationships. A model is derived in which brain evolution can be predicted from taxonomy, diet, sociability and locomotion (R2 = 0.75). A new tree, derived from the “predicted” data, shows a lot of similarity to the brain evolution tree. The meaning of the taxonomy, diet, sociability, and locomotion predictors are discussed and conclusions are drawn about the evolution of brain and behavior in marsupials.  相似文献   

19.
Until recently, phylogenetic analyses have been routinely based on homologous sequences of a single gene. Given the vast number of gene sequences now available, phylogenetic studies are now based on the analysis of multiple genes. Thus, it has become necessary to devise statistical methods to combine multiple molecular data sets. Here, we compare several models for combining different genes for the purpose of evaluating the likelihood of tree topologies. Three methods of branch length estimation were studied: assuming all genes have the same branch lengths (concatenate model), assuming that branch lengths are proportional among genes (proportional model), or assuming that each gene has a separate set of branch lengths (separate model). We also compared three models of among-site rate variation: the homogenous model, a model that assumes one gamma parameter for all genes, and a model that assumes one gamma parameter for each gene. On the basis of two nuclear and one mitochondrial amino acid data sets, our results suggest that, depending on the data set chosen, either the separate model or the proportional model represents the most appropriate method for branch length analysis. For all the data sets examined, one gamma parameter for each gene represents the best model for among-site rate variation. Using these models we analyzed alternative mammalian tree topologies, and we describe the effect of the assumed model on the maximum likelihood tree. We show that the choice of the model has an impact on the best phylogeny obtained.  相似文献   

20.
Different genes often have different phylogenetic histories. Even within regions having the same phylogenetic history, the mutation rates often vary. We investigate the prospects of phylogenetic reconstruction when all the characters are generated from the same tree topology, but the branch lengths vary (with possibly different tree shapes). Furthering work of Kolaczkowski and Thornton (2004, Nature 431: 980-984) and Chang (1996, Math. Biosci. 134: 189-216), we show examples where maximum likelihood (under a homogeneous model) is an inconsistent estimator of the tree. We then explore the prospects of phylogenetic inference under a heterogeneous model. In some models, there are examples where phylogenetic inference under any method is impossible - despite the fact that there is a common tree topology. In particular, there are nonidentifiable mixture distributions, i.e., multiple topologies generate identical mixture distributions. We address which evolutionary models have nonidentifiable mixture distributions and prove that the following duality theorem holds for most DNA substitution models. The model has either: (i) nonidentifiability - two different tree topologies can produce identical mixture distributions, and hence distinguishing between the two topologies is impossible; or (ii) linear tests - there exist linear tests which identify the common tree topology for character data generated by a mixture distribution. The theorem holds for models whose transition matrices can be parameterized by open sets, which includes most of the popular models, such as Tamura-Nei and Kimura's 2-parameter model. The duality theorem relies on our notion of linear tests, which are related to Lake's linear invariants.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号