首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
MOTIVATION: Current Self-Organizing Maps (SOMs) approaches to gene expression pattern clustering require the user to predefine the number of clusters likely to be expected. Hierarchical clustering methods used in this area do not provide unique partitioning of data. We describe an unsupervised dynamic hierarchical self-organizing approach, which suggests an appropriate number of clusters, to perform class discovery and marker gene identification in microarray data. In the process of class discovery, the proposed algorithm identifies corresponding sets of predictor genes that best distinguish one class from other classes. The approach integrates merits of hierarchical clustering with robustness against noise known from self-organizing approaches. RESULTS: The proposed algorithm applied to DNA microarray data sets of two types of cancers has demonstrated its ability to produce the most suitable number of clusters. Further, the corresponding marker genes identified through the unsupervised algorithm also have a strong biological relationship to the specific cancer class. The algorithm tested on leukemia microarray data, which contains three leukemia types, was able to determine three major and one minor cluster. Prediction models built for the four clusters indicate that the prediction strength for the smaller cluster is generally low, therefore labelled as uncertain cluster. Further analysis shows that the uncertain cluster can be subdivided further, and the subdivisions are related to two of the original clusters. Another test performed using colon cancer microarray data has automatically derived two clusters, which is consistent with the number of classes in data (cancerous and normal). AVAILABILITY: JAVA software of dynamic SOM tree algorithm is available upon request for academic use. SUPPLEMENTARY INFORMATION: A comparison of rectangular and hexagonal topologies for GSOM is available from http://www.mame.mu.oz.au/mechatronics/journalinfo/Hsu2003supp.pdf  相似文献   

3.
Different genes often have different phylogenetic histories. Even within regions having the same phylogenetic history, the mutation rates often vary. We investigate the prospects of phylogenetic reconstruction when all the characters are generated from the same tree topology, but the branch lengths vary (with possibly different tree shapes). Furthering work of Kolaczkowski and Thornton (2004, Nature 431: 980-984) and Chang (1996, Math. Biosci. 134: 189-216), we show examples where maximum likelihood (under a homogeneous model) is an inconsistent estimator of the tree. We then explore the prospects of phylogenetic inference under a heterogeneous model. In some models, there are examples where phylogenetic inference under any method is impossible - despite the fact that there is a common tree topology. In particular, there are nonidentifiable mixture distributions, i.e., multiple topologies generate identical mixture distributions. We address which evolutionary models have nonidentifiable mixture distributions and prove that the following duality theorem holds for most DNA substitution models. The model has either: (i) nonidentifiability - two different tree topologies can produce identical mixture distributions, and hence distinguishing between the two topologies is impossible; or (ii) linear tests - there exist linear tests which identify the common tree topology for character data generated by a mixture distribution. The theorem holds for models whose transition matrices can be parameterized by open sets, which includes most of the popular models, such as Tamura-Nei and Kimura's 2-parameter model. The duality theorem relies on our notion of linear tests, which are related to Lake's linear invariants.  相似文献   

4.
OBJECTIVE: To prospectively review brush smears obtained during endoscopic retrograde cholangiopancreatography (ERCP) primarily from the biliary tree. STUDY DESIGN: A total of 175 specimens from 147 patients were included in the study. The smears, prepared directly from the endoscopic brush, were stained by the Papanicolaou technique and analyzed for standard cytologic features. RESULTS: The smears were categorized into benign/reactive, significant atypia and suspicious/positive. The consistent features seen in suspicious or positive smears were tightly cohesive, small, three-dimensional cell clusters that formed cell balls. The cells in the clusters displayed features of malignant cells. CONCLUSION: ERCP-guided brushing is a safe diagnostic procedure for the evaluation of biliary tree lesions. Small, three-dimensional epithelial clusters with marked atypia signify malignancy and warrant the diagnosis of a malignant neoplasm even when only one or two such clusters are seen in the smears. Single cells, cytoplasmic vacuoles and prominent nucleoli are not essential for a diagnosis of malignancy.  相似文献   

5.

Background

In genomics, hierarchical clustering (HC) is a popular method for grouping similar samples based on a distance measure. HC algorithms do not actually create clusters, but compute a hierarchical representation of the data set. Usually, a fixed height on the HC tree is used, and each contiguous branch of samples below that height is considered a separate cluster. Due to the fixed-height cutting, those clusters may not unravel significant functional coherence hidden deeper in the tree. Besides that, most existing approaches do not make use of available clinical information to guide cluster extraction from the HC. Thus, the identified subgroups may be difficult to interpret in relation to that information.

Results

We develop a novel framework for decomposing the HC tree into clusters by semi-supervised piecewise snipping. The framework, called guided piecewise snipping, utilizes both molecular data and clinical information to decompose the HC tree into clusters. It cuts the given HC tree at variable heights to find a partition (a set of non-overlapping clusters) which does not only represent a structure deemed to underlie the data from which HC tree is derived, but is also maximally consistent with the supplied clinical data. Moreover, the approach does not require the user to specify the number of clusters prior to the analysis. Extensive results on simulated and multiple medical data sets show that our approach consistently produces more meaningful clusters than the standard fixed-height cut and/or non-guided approaches.

Conclusions

The guided piecewise snipping approach features several novelties and advantages over existing approaches. The proposed algorithm is generic, and can be combined with other algorithms that operate on detected clusters. This approach represents an advancement in several regards: (1) a piecewise tree snipping framework that efficiently extracts clusters by snipping the HC tree possibly at variable heights while preserving the HC tree structure; (2) a flexible implementation allowing a variety of data types for both building and snipping the HC tree, including patient follow-up data like survival as auxiliary information.The data sets and R code are provided as supplementary files. The proposed method is available from Bioconductor as the R-package HCsnip.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-014-0448-1) contains supplementary material, which is available to authorized users.  相似文献   

6.
外群选择对隧蜂科(膜翅目:蜜蜂总科)系统重建的影响   总被引:1,自引:0,他引:1  
外群用于给树附根和推断祖先性状状态。通常,来自内群的姐妹群中的多个分类单元被共同选择作为外群。为了在经验上验证这一方法, 我们采用了3种外群选择策略: 姐妹群中的单一分类单元, 姐妹群中的多个分类单元和连续姐妹群中的多个分类单元。以隧蜂科(膜翅目: 蜜蜂总科)的系统发育重建为例, 我们评估了这3种策略对树拓扑结构的影响, 包括最大似然树、 最大简约树和贝叶斯树。初步结果表明: 相比其他两种策略, 采用姐妹群中的多个分类单元作为外群更有利于系统发育重建得到现已被广泛认可的隧蜂科系统发育关系; 相比最大似然法和贝叶斯法, 虽然隧蜂科系统发育关系没有被很好地解决, 但最大简约法在不同外群选择策略下得到了较为一致的拓扑结构  相似文献   

7.
We have developed a phylogenetic tree reconstruction method that detects and reports multiple topologically distant low-cost solutions. Our method is a generalization of the neighbor-joining method of Saitou and Nei and affords a more thorough sampling of the solution space by keeping track of multiple partial solutions during its execution. The scope of the solution space sampling is controlled by a pair of user-specified parameters--the total number of alternate solutions and the number of alternate solutions that are randomly selected--effecting a smooth trade-off between run time and solution quality and diversity. This method can discover topologically distinct low-cost solutions. In tests on biological and synthetic data sets using either the least-squares distance or minimum-evolution criterion, the method consistently performed as well as, or better than, both the neighbor-joining heuristic and the PHYLIP implementation of the Fitch-Margoliash distance measure. In addition, the method identified alternative tree topologies with costs within 1% or 2% of the best, but with topological distances of 9 or more partitions from the best solution (16 taxa); with 32 taxa, topologies were obtained 17 (least-squares) and 22 (minimum-evolution) partitions from the best topology when 200 partial solutions were retained. Thus, the method can find lower-cost tree topologies and near-best tree topologies that are significantly different from the best topology.  相似文献   

8.
Nye TM 《Systematic biology》2008,57(5):785-794
Phylogenetic analysis very commonly produces several alternative trees for a given fixed set of taxa. For example, different sets of orthologous genes may be analyzed, or the analysis may sample from a distribution of probable trees. This article describes an approach to comparing and visualizing multiple alternative phylogenies via the idea of a "tree of trees" or "meta-tree." A meta-tree clusters phylogenies with similar topologies together in the same way that a phylogeny clusters species with similar DNA sequences. Leaf nodes on a meta-tree correspond to the original set of phylogenies given by some analysis, whereas interior nodes correspond to certain consensus topologies. The construction of meta-trees is motivated by analogy with construction of a most parsimonious tree for DNA data, but instead of using DNA letters, in a meta-tree the characters are partitions or splits of the set of taxa. An efficient algorithm for meta-tree construction is described that makes use of a known relationship between the majority consensus and parsimony in terms of gain and loss of splits. To illustrate these ideas meta-trees are constructed for two datasets: a set of gene trees for species of yeast and trees from a bootstrap analysis of a set of gene trees in ray-finned fish. A software tool for constructing meta-trees and comparing alternative phylogenies is available online, and the source code can be obtained from the author.  相似文献   

9.
Development of methods for estimating species trees from multilocus data is a current challenge in evolutionary biology. We propose a method for estimating the species tree topology and branch lengths using approximate Bayesian computation (ABC). The method takes as data a sample of observed rooted gene tree topologies, and then iterates through the following sequence of steps: First, a randomly selected species tree is used to compute the distribution of rooted gene tree topologies. This distribution is then compared to the observed gene topology frequencies, and if the fit between the observed and the predicted distributions is close enough, the proposed species tree is retained. Repeating this many times leads to a collection of retained species trees that are then used to form the estimate of the overall species tree. We test the performance of the method, which we call ST-ABC, using both simulated and empirical data. The simulation study examines both symmetric and asymmetric species trees over a range of branch lengths and sample sizes. The results from the simulation study show that the model performs very well, giving accurate estimates for both the topology and the branch lengths across the conditions studied, and that a sample size of 25 loci appears to be adequate for the method. Further, we apply the method to two empirical cases: a 4-taxon data set for primates and a 7-taxon data set for yeast. In both cases, we find that estimates obtained with ST-ABC agree with previous studies. The method provides efficient estimation of the species tree, and does not require sequence data, but rather the observed distribution of rooted gene topologies without branch lengths. Therefore, this method is a useful alternative to other currently available methods for species tree estimation.  相似文献   

10.
Yu Y  Degnan JH  Nakhleh L 《PLoS genetics》2012,8(4):e1002660
Gene tree topologies have proven a powerful data source for various tasks, including species tree inference and species delimitation. Consequently, methods for computing probabilities of gene trees within species trees have been developed and widely used in probabilistic inference frameworks. All these methods assume an underlying multispecies coalescent model. However, when reticulate evolutionary events such as hybridization occur, these methods are inadequate, as they do not account for such events. Methods that account for both hybridization and deep coalescence in computing the probability of a gene tree topology currently exist for very limited cases. However, no such methods exist for general cases, owing primarily to the fact that it is currently unknown how to compute the probability of a gene tree topology within the branches of a phylogenetic network. Here we present a novel method for computing the probability of gene tree topologies on phylogenetic networks and demonstrate its application to the inference of hybridization in the presence of incomplete lineage sorting. We reanalyze a Saccharomyces species data set for which multiple analyses had converged on a species tree candidate. Using our method, though, we show that an evolutionary hypothesis involving hybridization in this group has better support than one of strict divergence. A similar reanalysis on a group of three Drosophila species shows that the data is consistent with hybridization. Further, using extensive simulation studies, we demonstrate the power of gene tree topologies at obtaining accurate estimates of branch lengths and hybridization probabilities of a given phylogenetic network. Finally, we discuss identifiability issues with detecting hybridization, particularly in cases that involve extinction or incomplete sampling of taxa.  相似文献   

11.
12.
We examine the impact of likelihood surface characteristics on phylogenetic inference. Amino acid data sets simulated from topologies with branch length features chosen to represent varying degrees of difficulty for likelihood maximization are analyzed. We present situations where the tree found to achieve the global maximum in likelihood is often not equal to the true tree. We use the program covSEARCH to demonstrate how the use of adaptively sized pools of candidate trees that are updated using confidence tests results in solution sets that are highly likely to contain the true tree. This approach requires more computation than traditional maximum likelihood methods, hence covSEARCH is best suited to small to medium-sized alignments or large alignments with some constrained nodes. The majority rule consensus tree computed from the confidence sets also proves to be different from the generating topology. Although low phylogenetic signal in the input alignment can result in large confidence sets of trees, some biological information can still be obtained based on nodes that exhibit high support within the confidence set. Two real data examples are analyzed: mammal mitochondrial proteins and a small tubulin alignment. We conclude that the technique of confidence set optimization can significantly improve the robustness of phylogenetic inference at a reasonable computational cost. Additionally, when either very short internal branches or very long terminal branches are present, confident resolution of specific bipartitions or subtrees, rather than whole-tree phylogenies, may be the most realistic goal for phylogenetic methods. [Reviewing Editor: Dr. Nicolas Galtier]  相似文献   

13.
SUMMARY: LumberJack is a phylogenetic tool intended to serve two purposes: to facilitate sampling treespace to find likely tree topologies quickly, and to map phylogenetic signal onto regions of an alignment in a revealing way. LumberJack creates non-random jackknifed alignments by progressively sliding a window of omission along the alignment. A neighbor-joining tree is built from the full alignment and from each jackknifed alignment, and then the likelihood for each topology (given the original full alignment) is calculated. To determine whether any of the topologies generated is significantly more likely than the others, Kishino-Hasegawa, Shimodaira-Hasegawa and ELW tests are implemented. Availability and SUPPLEMENTARY INFORMATION: http://www.plantbio.uga.edu/~russell/software.html  相似文献   

14.
Polytomies and Bayesian phylogenetic inference   总被引:16,自引:0,他引:16  
Bayesian phylogenetic analyses are now very popular in systematics and molecular evolution because they allow the use of much more realistic models than currently possible with maximum likelihood methods. There are, however, a growing number of examples in which large Bayesian posterior clade probabilities are associated with very short branch lengths and low values for non-Bayesian measures of support such as nonparametric bootstrapping. For the four-taxon case when the true tree is the star phylogeny, Bayesian analyses become increasingly unpredictable in their preference for one of the three possible resolved tree topologies as data set size increases. This leads to the prediction that hard (or near-hard) polytomies in nature will cause unpredictable behavior in Bayesian analyses, with arbitrary resolutions of the polytomy receiving very high posterior probabilities in some cases. We present a simple solution to this problem involving a reversible-jump Markov chain Monte Carlo (MCMC) algorithm that allows exploration of all of tree space, including unresolved tree topologies with one or more polytomies. The reversible-jump MCMC approach allows prior distributions to place some weight on less-resolved tree topologies, which eliminates misleadingly high posteriors associated with arbitrary resolutions of hard polytomies. Fortunately, assigning some prior probability to polytomous tree topologies does not appear to come with a significant cost in terms of the ability to assess the level of support for edges that do exist in the true tree. Methods are discussed for applying arbitrary prior distributions to tree topologies of varying resolution, and an empirical example showing evidence of polytomies is analyzed and discussed.  相似文献   

15.
BackgroundWe re-evaluate our RNA-As-Graphs clustering approach, using our expanded graph library and new RNA structures, to identify potential RNA-like topologies for design. Our coarse-grained approach represents RNA secondary structures as tree and dual graphs, with vertices and edges corresponding to RNA helices and loops. The graph theoretical framework facilitates graph enumeration, partitioning, and clustering approaches to study RNA structure and its applications.MethodsClustering graph topologies based on features derived from graph Laplacian matrices and known RNA structures allows us to classify topologies into ‘existing’ or hypothetical, and the latter into, ‘RNA-like’ or ‘non RNA-like’ topologies. Here we update our list of existing tree graph topologies and RAG-3D database of atomic fragments to include newly determined RNA structures. We then use linear and quadratic regression, optionally with dimensionality reduction, to derive graph features and apply several clustering algorithms on our tree-graph library and recently expanded dual-graph library to classify them into the three groups.ResultsThe unsupervised PAM and K-means clustering approaches correctly classify 72–77% of all existing graph topologies and 75–82% of newly added ones as RNA-like. For supervised k-NN clustering, the cross-validation accuracy ranges from 57 to 81%.ConclusionsUsing linear regression with unsupervised clustering, or quadratic regression with supervised clustering, provides better accuracies than supervised/linear clustering. All accuracies are better than random, especially for newly added existing topologies, thus lending credibility to our approach.General significanceOur updated RAG-3D database and motif classification by clustering present new RNA substructures and RNA-like motifs as novel design candidates.  相似文献   

16.
Large-scale gene amplifications may have facilitated the evolution of morphological innovations that accompanied the origin of vertebrates. This hypothesis predicts that the genomes of extant jawless fish, scions of deeply branching vertebrate lineages, should bear a record of these events. Previous work suggests that nonvertebrate chordates have a single Hox cluster, but that gnathostome vertebrates have four or more Hox clusters. Did the duplication events that produced multiple vertebrate Hox clusters occur before or after the divergence of agnathan and gnathostome lineages? Can investigation of lamprey Hox clusters illuminate the origins of the four gnathostome Hox clusters? To approach these questions, we cloned and sequenced 13 Hox cluster genes from cDNA and genomic libraries in the lamprey, Petromyzon marinus. The results suggest that the lamprey has at least four Hox clusters and support the model that gnathostome Hox clusters arose by a two-round-no-cluster-loss mechanism, with tree topology [(AB)(CD)]. A three-round model, however, is not rigorously excluded by the data and, for this model, the tree topologies [(D(C(AB))] and [(C(D(AB))] are most parsimonious. Gene phylogenies suggest that at least one Hox cluster duplication occurred in the lamprey lineage after it diverged from the gnathostome lineage. The results argue against two or more rounds of duplication before the divergence of agnathan and gnathostome vertebrates. If Hox clusters were duplicated in whole-genome duplication events, then these data suggest that, at most, one whole genome duplication occurred before the evolution of vertebrate developmental innovations.  相似文献   

17.
DNA sequences coding for 81% of the ompA gene from 24 chlamydial strains, representing all chlamydial species, were determined from DNA amplified by polymerase chain reactions. Chlamydial strains of serovars and strains with similar chromosomal restriction fragment length polymorphism had identical ompA DNA sequences. The ompA sequences were segregated into 23 different ompA alleles and aligned with each other, and phylogenetic relationships among them were inferred by neighbor-joining and maximum parsimony analyses. The neighbor-joining method produced a single phylogram which was rooted at the branch between two major clusters. One cluster included all Chlamydia trachomatis ompA alleles (trachoma group). The second cluster was composed of three major groups of ompA alleles: psittacosis group (alleles MN, 6BC, A22/M, B577, LW508, FEPN, and GPIC), pneumonia group (Chlamydia pneumoniae AR388 with the allele KOALA), and polyarthritis group (ruminant and porcine chlamydial alleles LW613, 66P130, L71, and 1710S with propensity for polyarthritis). These groups were distinguished through specific DNA sequence signatures. Maximum parsimony analysis yielded two equally most parsimonious phylograms with topologies similar to the ompA tree of neighbor joining. Two phylograms constructed from chlamydial genomic DNA distances had topologies identical to that of the ompA phylogram with respect to branching of the chlamydial species. Human serovars of C. trachomatis with essentially identical genomes represented a single taxonomic unit, while they were divergent in the ompA tree. Consistent with the ompA phylogeny, the porcine isolate S45, previously considered to be Chlamydia psittaci, was identified as C. trachomatis through biochemical characteristics. These data demonstrate that chlamydial ompA allelic relationships, except for human serovars of C. trachomatis, are cognate with chromosomal phylogenies.  相似文献   

18.
Ren F  Ogishima S  Tanaka H 《Gene》2003,317(1-2):89-95
A new method for reconstructing phylogenetic relationships of within-host (patient) viral evolution from noncontemporaneous samples is presented. This method has two important features: noncontemporaneous viral samples can be dealt with by a simple computing algorithm, and both neutral and adaptive evolution patterns occurring during the process of viral evolution can be estimated. In our previous study, we proposed a preliminary formulation of this algorithm that was based on the maximum likelihood method. However, that preliminary formulation was difficult to use because the calculation of the likelihood required an extremely large amount of time and the number of possible tree topologies increased exponentially according to the increase in the number of viral variants. In this paper, we propose another new algorithm, referred to as a distance-based sequential-linking algorithm, in which the neighbor-joining method is employed for reconstruction of the longitudinal phylogenetic tree from serial viral samples. This algorithm is applied to a longitudinal data set of the env gene (V3 region) of human immunodeficiency virus type 1 (HIV-1) obtained over 7 years after the infection of a single patient. The results suggest that this method can successfully reconstruct a longitudinal phylogenetic tree from noncontemporaneous viral samples within a reasonable calculation time. This revised method proved to be a useful tool for estimating the dynamic process of within-host viral evolution.  相似文献   

19.
The genetic relationship among the Escherichia coli pathotypes was investigated. We used random amplified polymorphic DNA (RAPD) data for constructing a dendrogram of 73 strains of diarrheagenic E. coli. A phylogenetic tree encompassing 15 serotypes from different pathotypes was constructed using multilocus sequence typing data. Phylogram clusters were used for validating RAPD data on the clonality of enteropathogenic E. coli (EPEC) O serogroup strains. Both analyses showed very similar topologies, characterized by the presence of two major groups: group A includes EPEC H6 and H34 strains and group B contains the other EPEC strains plus all serotypes belonging to atypical EPEC, enteroaggregative E. coli (EAEC) and enterohemorrhagic E. coli (EHEC). These results confirm the existence of two evolutionary divergent groups in EPEC: one is genetically and serologically very homogeneous whereas the other harbors EPEC and non-EPEC serotypes. The same situation was found for EAEC and EHEC.  相似文献   

20.
Phylogenetic trees based on mtDNA polymorphisms are often used to infer the history of recent human migrations. However, there is no consensus on which method to use. Most methods make strong assumptions which may bias the choice of polymorphisms and result in computational complexity which limits the analysis to a few samples/polymorphisms. For example, parsimony minimizes the number of mutations, which biases the results to minimizing homoplasy events. Such biases may miss the global structure of the polymorphisms altogether, with the risk of identifying a "common" polymorphism as ancient without an internal check on whether it either is homoplasic or is identified as ancient because of sampling bias (from oversampling the population with the polymorphism). A signature of this problem is that different methods applied to the same data or the same method applied to different datasets results in different tree topologies. When the results of such analyses are combined, the consensus trees have a low internal branch consensus. We determine human mtDNA phylogeny from 1737 complete sequences using a new, direct method based on principal component analysis (PCA) and unsupervised consensus ensemble clustering. PCA identifies polymorphisms representing robust variations in the data and consensus ensemble clustering creates stable haplogroup clusters. The tree is obtained from the bifurcating network obtained when the data are split into k = 2,3,4,...,kmax clusters, with equal sampling from each haplogroup. Our method assumes only that the data can be clustered into groups based on mutations, is fast, is stable to sample perturbation, uses all significant polymorphisms in the data, works for arbitrary sample sizes, and avoids sample choice and haplogroup size bias. The internal branches of our tree have a 90% consensus accuracy. In conclusion, our tree recreates the standard phylogeny of the N, M, L0/L1, L2, and L3 clades, confirming the African origin of modern humans and showing that the M and N clades arose in almost coincident migrations. However, the N clade haplogroups split along an East-West geographic divide, with a "European R clade" containing the haplogroups H, V, H/V, J, T, and U and a "Eurasian N subclade" including haplogroups B, R5, F, A, N9, I, W, and X. The haplogroup pairs (N9a, N9b) and (M7a, M7b) within N and M are placed in nonnearest locations in agreement with their expected large TMRCA from studies of their migrations into Japan. For comparison, we also construct consensus maximum likelihood, parsimony, neighbor joining, and UPGMA-based trees using the same polymorphisms and show that these methods give consistent results only for the clade tree. For recent branches, the consensus accuracy for these methods is in the range of 1-20%. From a comparison of our haplogroups to two chimp and one bonobo sequences, and assuming a chimp-human coalescent time of 5 million years before present, we find a human mtDNA TMRCA of 206,000 +/- 14,000 years before present.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号