首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The identification of the genetic structure of populations from multilocus genotype data has become a central component of modern population‐genetic data analysis. Application of model‐based clustering programs often entails a number of steps, in which the user considers different modelling assumptions, compares results across different predetermined values of the number of assumed clusters (a parameter typically denoted K), examines multiple independent runs for each fixed value of K, and distinguishes among runs belonging to substantially distinct clustering solutions. Here, we present Clumpak (Cluster Markov Packager Across K), a method that automates the postprocessing of results of model‐based population structure analyses. For analysing multiple independent runs at a single K value, Clumpak identifies sets of highly similar runs, separating distinct groups of runs that represent distinct modes in the space of possible solutions. This procedure, which generates a consensus solution for each distinct mode, is performed by the use of a Markov clustering algorithm that relies on a similarity matrix between replicate runs, as computed by the software Clumpp . Next, Clumpak identifies an optimal alignment of inferred clusters across different values of K, extending a similar approach implemented for a fixed K in Clumpp and simplifying the comparison of clustering results across different K values. Clumpak incorporates additional features, such as implementations of methods for choosing K and comparing solutions obtained by different programs, models, or data subsets. Clumpak , available at http://clumpak.tau.ac.il , simplifies the use of model‐based analyses of population structure in population genetics and molecular ecology.  相似文献   

2.
Finite mixture models can provide the insights about behavioral patterns as a source of heterogeneity of the various dynamics of time course gene expression data by reducing the high dimensionality and making clear the major components of the underlying structure of the data in terms of the unobservable latent variables. The latent structure of the dynamic transition process of gene expression changes over time can be represented by Markov processes. This paper addresses key problems in the analysis of large gene expression data sets that describe systemic temporal response cascades and dynamic changes to therapeutic doses in multiple tissues, such as liver, skeletal muscle, and kidney from the same animals. Bayesian Finite Markov Mixture Model with a Dirichlet Prior is developed for the identifications of differentially expressed time related genes and dynamic clusters. Deviance information criterion is applied to determine the number of components for model comparisons and selections. The proposed Bayesian models are applied to multiple tissue polygenetic temporal gene expression data and compared to a Bayesian model‐based clustering method, named CAGED. Results show that our proposed Bayesian Finite Markov Mixture model can well capture the dynamic changes and patterns for irregular complex temporal data (© 2009 WILEY‐VCH Verlag GmbH & Co. KGaA, Weinheim)  相似文献   

3.
Bayesian clustering methods have emerged as a popular tool for assessing hybridization using genetic markers. Simulation studies have shown these methods perform well under certain conditions; however, these methods have not been evaluated using empirical data sets with individuals of known ancestry. We evaluated the performance of two clustering programs, baps and structure , with genetic data from a reintroduced red wolf (Canis rufus) population in North Carolina, USA. Red wolves hybridize with coyotes (C. latrans), and a single hybridization event resulted in introgression of coyote genes into the red wolf population. A detailed pedigree has been reconstructed for the wild red wolf population that includes individuals of 50–100% red wolf ancestry, providing an ideal case study for evaluating the ability of these methods to estimate admixture. Using 17 microsatellite loci, we tested the programs using different training set compositions and varying numbers of loci. structure was more likely than baps to detect an admixed genotype and correctly estimate an individual's true ancestry composition. However, structure was more likely to misclassify a pure individual as a hybrid. Both programs were outperformed by a maximum‐likelihood‐based test designed specifically for this system, which never misclassified a hybrid (50–75% red wolf) as a red wolf or vice versa. Training set composition and the number of loci both had an impact on accuracy but their relative importance varied depending on the program. Our findings demonstrate the importance of evaluating methods used for detecting admixture in the context of endangered species management.  相似文献   

4.
Bayesian mixture model based clustering of replicated microarray data   总被引:3,自引:0,他引:3  
MOTIVATION: Identifying patterns of co-expression in microarray data by cluster analysis has been a productive approach to uncovering molecular mechanisms underlying biological processes under investigation. Using experimental replicates can generally improve the precision of the cluster analysis by reducing the experimental variability of measurements. In such situations, Bayesian mixtures allow for an efficient use of information by precisely modeling between-replicates variability. RESULTS: We developed different variants of Bayesian mixture based clustering procedures for clustering gene expression data with experimental replicates. In this approach, the statistical distribution of microarray data is described by a Bayesian mixture model. Clusters of co-expressed genes are created from the posterior distribution of clusterings, which is estimated by a Gibbs sampler. We define infinite and finite Bayesian mixture models with different between-replicates variance structures and investigate their utility by analyzing synthetic and the real-world datasets. Results of our analyses demonstrate that (1) improvements in precision achieved by performing only two experimental replicates can be dramatic when the between-replicates variability is high, (2) precise modeling of intra-gene variability is important for accurate identification of co-expressed genes and (3) the infinite mixture model with the 'elliptical' between-replicates variance structure performed overall better than any other method tested. We also introduce a heuristic modification to the Gibbs sampler based on the 'reverse annealing' principle. This modification effectively overcomes the tendency of the Gibbs sampler to converge to different modes of the posterior distribution when started from different initial positions. Finally, we demonstrate that the Bayesian infinite mixture model with 'elliptical' variance structure is capable of identifying the underlying structure of the data without knowing the 'correct' number of clusters. AVAILABILITY: The MS Windows based program named Gaussian Infinite Mixture Modeling (GIMM) implementing the Gibbs sampler and corresponding C++ code are available at http://homepages.uc.edu/~medvedm/GIMM.htm SUPPLEMENTAL INFORMATION: http://expression.microslu.washington.edu/expression/kayee/medvedovic2003/medvedovic_bioinf2003.html  相似文献   

5.
It is a common occurrence in plant breeding programs to observe missing values in three-way three-mode multi-environment trial (MET) data. We proposed modifications of models for estimating missing observations for these data arrays, and developed a novel approach in terms of hierarchical clustering. Multiple imputation (MI) was used in four ways, multiple agglomerative hierarchical clustering, normal distribution model, normal regression model, and predictive mean match. The later three models used both Bayesian analysis and non-Bayesian analysis, while the first approach used a clustering procedure with randomly selected attributes and assigned real values from the nearest neighbour to the one with missing observations. Different proportions of data entries in six complete datasets were randomly selected to be missing and the MI methods were compared based on the efficiency and accuracy of estimating those values. The results indicated that the models using Bayesian analysis had slightly higher accuracy of estimation performance than those using non-Bayesian analysis but they were more time-consuming. However, the novel approach of multiple agglomerative hierarchical clustering demonstrated the overall best performances.  相似文献   

6.
The projection of age‐stratified cancer incidence and mortality rates is of great interest due to demographic changes, but also therapeutical and diagnostic developments. Bayesian age–period–cohort (APC) models are well suited for the analysis of such data, but are not yet used in routine practice of epidemiologists. Reasons may include that Bayesian APC models have been criticized to produce too wide prediction intervals. Furthermore, the fitting of Bayesian APC models is usually done using Markov chain Monte Carlo (MCMC), which introduces complex convergence concerns and may be subject to additional technical problems. In this paper we address both concerns, developing efficient MCMC‐free software for routine use in epidemiological applications. We apply Bayesian APC models to annual lung cancer data for females in five different countries, previously analyzed in the literature. To assess the predictive quality, we omit the observations from the last 10 years and compare the projections with the actual observed data based on the absolute error and the continuous ranked probability score. Further, we assess calibration of the one‐step‐ahead predictive distributions. In our application, the probabilistic forecasts obtained by the Bayesian APC model are well calibrated and not too wide. A comparison to projections obtained by a generalized Lee–Carter model is also given. The methodology is implemented in the user‐friendly R‐package BAPC using integrated nested Laplace approximations.  相似文献   

7.
Many conservation genetics studies in fishes define populations based on capture location. In salmonid fishes, this traditional a priori designation is made by spawning stream, with subsequent post hoc approaches used to define units of conservation. In this study of bull trout from southwestern Alberta, we provide evidence that a model-based Bayesian genetic clustering method may provide a more parsimonious alternative to designating population structure and units of conservation in comparison to traditional methods. The clustering method captured a hierarchical model of population structure, in which seven local populations were nested within three genetic archipelagos. This was in contrast to using simple F ST based approaches between thirteen a priori designated populations, which found significant differences for nearly every pairwise comparison. In addition, assignment tests results from Bayesian clustering revealed that movement may be common between sampling locations. These clustering methods are easy to use, intuitive and provide substantial information on populations of fish; this study provides an example of their utility for local fisheries management and conservation.  相似文献   

8.
On the basis of simulated data, this study compares the relative performances of the Bayesian clustering computer programs structure , geneland , geneclust and a new program named tess . While these four programs can detect population genetic structure from multilocus genotypes, only the last three ones include simultaneous analysis from geographical data. The programs are compared with respect to their abilities to infer the number of populations, to estimate membership probabilities, and to detect genetic discontinuities and clinal variation. The results suggest that combining analyses using tess and structure offers a convenient way to address inference of spatial population structure.  相似文献   

9.
We investigated the spatial genetic structure of the tiger meta‐population in the Satpura–Maikal landscape of central India using population‐ and individual‐based genetic clustering methods on multilocus genotypic data from 273 individuals. The Satpura–Maikal landscape is classified as a global‐priority Tiger Conservation Landscape (TCL) due to its potential for providing sufficient habitat that will allow the long‐term persistence of tigers. We found that the tiger meta‐population in the Satpura–Maikal landscape has high genetic variation and very low genetic subdivision. Individual‐based Bayesian clustering algorithms reveal two highly admixed genetic populations. We attribute this to forest connectivity and high gene flow in this landscape. However, deforestation, road widening, and mining may sever this connectivity, impede gene exchange, and further exacerbate the genetic division of tigers in central India.  相似文献   

10.
Dropouts are common in longitudinal study. If the dropout probability depends on the missing observations at or after dropout, this type of dropout is called informative (or nonignorable) dropout (ID). Failure to accommodate such dropout mechanism into the model will bias the parameter estimates. We propose a conditional autoregressive model for longitudinal binary data with an ID model such that the probabilities of positive outcomes as well as the drop‐out indicator in each occasion are logit linear in some covariates and outcomes. This model adopting a marginal model for outcomes and a conditional model for dropouts is called a selection model. To allow for the heterogeneity and clustering effects, the outcome model is extended to incorporate mixture and random effects. Lastly, the model is further extended to a novel model that models the outcome and dropout jointly such that their dependency is formulated through an odds ratio function. Parameters are estimated by a Bayesian approach implemented using the user‐friendly Bayesian software WinBUGS. A methadone clinic dataset is analyzed to illustrate the proposed models. Result shows that the treatment time effect is still significant but weaker after allowing for an ID process in the data. Finally the effect of drop‐out on parameter estimates is evaluated through simulation studies.  相似文献   

11.
Hu Y  Guo Y  Qi D  Zhan X  Wu H  Bruford MW  Wei F 《Molecular ecology》2011,20(13):2662-2675
Clarification of the genetic structure and population history of a species can shed light on the impacts of landscapes, historical climate change and contemporary human activities and thus enables evidence‐based conservation decisions for endangered organisms. The red panda (Ailurus fulgens) is an endangered species distributing at the edge of the Qinghai‐Tibetan Plateau and is currently subject to habitat loss, fragmentation and population decline, thus representing a good model to test the influences of the above‐mentioned factors on a plateau edge species. We combined nine microsatellite loci and 551 bp of mitochondrial control region (mtDNA CR) to explore the genetic structure and demographic history of this species. A total of 123 individuals were sampled from 23 locations across five populations. High levels of genetic variation were identified for both mtDNA and microsatellites. Phylogeographic analyses indicated little geographic structure, suggesting historically wide gene flow. However, microsatellite‐based Bayesian clustering clearly identified three groups (Qionglai‐Liangshan, Xiaoxiangling and Gaoligong‐Tibet). A significant isolation‐by‐distance pattern was detected only after removing Xiaoxiangling. For mtDNA data, there was no statistical support for a historical population expansion or contraction for the whole sample or any population except Xiaoxiangling where a signal of contraction was detected. However, Bayesian simulations of population history using microsatellite data did pinpoint population declines for Qionglai, Xiaoxiangling and Gaoligong, demonstrating significant influences of human activity on demography. The unique history of the Xiaoxiangling population plays a critical role in shaping the genetic structure of this species, and large‐scale habitat loss and fragmentation is hampering gene flow among populations. The implications of our findings for the biogeography of the Qinghai‐Tibetan Plateau, subspecies classification and conservation of red pandas are discussed.  相似文献   

12.
Count data sets are traditionally analyzed using the ordinary Poisson distribution. However, such a model has its applicability limited as it can be somewhat restrictive to handle specific data structures. In this case, it arises the need for obtaining alternative models that accommodate, for example, (a) zero‐modification (inflation or deflation at the frequency of zeros), (b) overdispersion, and (c) individual heterogeneity arising from clustering or repeated (correlated) measurements made on the same subject. Cases (a)–(b) and (b)–(c) are often treated together in the statistical literature with several practical applications, but models supporting all at once are less common. Hence, this paper's primary goal was to jointly address these issues by deriving a mixed‐effects regression model based on the hurdle version of the Poisson–Lindley distribution. In this framework, the zero‐modification is incorporated by assuming that a binary probability model determines which outcomes are zero‐valued, and a zero‐truncated process is responsible for generating positive observations. Approximate posterior inferences for the model parameters were obtained from a fully Bayesian approach based on the Adaptive Metropolis algorithm. Intensive Monte Carlo simulation studies were performed to assess the empirical properties of the Bayesian estimators. The proposed model was considered for the analysis of a real data set, and its competitiveness regarding some well‐established mixed‐effects models for count data was evaluated. A sensitivity analysis to detect observations that may impact parameter estimates was performed based on standard divergence measures. The Bayesian ‐value and the randomized quantile residuals were considered for model diagnostics.  相似文献   

13.
The application of mixed nucleotide/doublet substitution models has recently received attention in RNA‐based phylogenetics. Within a Bayesian approach, it was shown that mixed models outperformed analyses relying on simple nucleotide models. We analysed an mt RNA data set of dragonflies representing all major lineages of Anisoptera plus outgroups, using a mixed model in a Bayesian and parsimony (MP) approach. We used a published mt 16S rRNA secondary consensus structure model and inferred consensus models for the mt 12S rRNA and tRNA valine. Secondary structure information was used to set data partitions for paired and unpaired sites on which doublet or nucleotide models were applied, respectively. Several different doublet models are currently available of which we chose the most appropriate one by a Bayes factor test. The MP reconstructions relied on recoded data for paired sites in order to account for character covariance and an application of the ratchet strategy to find most parsimonious trees. Bayesian and parsimony reconstructions are partly differently resolved, indicating sensitivity of the reconstructions to model specification. Our analyses depict a tree in which the damselfly family Lestidae is sister group to a monophyletic clade Epiophlebia + Anisoptera, contradicting recent morphological and molecular work. In Bayesian analyses, we found a deep split between Libelluloidea and a clade ‘Aeshnoidea’ within Anisoptera largely congruent with Tillyard’s early ideas of anisopteran evolution, which had been based on evidently plesiomorphic character states. However, parsimony analysis did not support a clade ‘Aeshnoidea’, but instead, placed Gomphidae as sister taxon to Libelluloidea. Monophyly of Libelluloidea is only modestly supported, and many inter‐family relationships within Libelluloidea do not receive substantial support in Bayesian and parsimony analyses. We checked whether high Bayesian node support was inflated owing to either: (i) wrong secondary consensus structures; (ii) under‐sampling of the MCMC process, thereby missing other local maxima; or (iii) unrealistic prior assumptions on topologies or branch lengths. We found that different consensus structure models exert strong influence on the reconstruction, which demonstrates the importance of taxon‐specific realistic secondary structure models in RNA phylogenetics.  相似文献   

14.
Several statistical methods have been proposed for estimating the infection prevalence based on pooled samples, but these methods generally presume the application of perfect diagnostic tests, which in practice do not exist. To optimize prevalence estimation based on pooled samples, currently available and new statistical models were described and compared. Three groups were tested: (a) Frequentist models, (b) Monte Carlo Markov‐Chain (MCMC) Bayesian models, and (c) Exact Bayesian Computation (EBC) models. Simulated data allowed the comparison of the models, including testing the performance under complex situations such as imperfect tests with a sensitivity varying according to the pool weight. In addition, all models were applied to data derived from the literature, to demonstrate the influence of the model on real‐prevalence estimates. All models were implemented in the freely available R and OpenBUGS software and are presented in Appendix S1. Bayesian models can flexibly take into account the imperfect sensitivity and specificity of the diagnostic test (as well as the influence of pool‐related or external variables) and are therefore the method of choice for calculating population prevalence based on pooled samples. However, when using such complex models, very precise information on test characteristics is needed, which may in general not be available.  相似文献   

15.
MOTIVATION: Bioinformatics clustering tools are useful at all levels of proteomic data analysis. Proteomics studies can provide a wealth of information and rapidly generate large quantities of data from the analysis of biological specimens. The high dimensionality of data generated from these studies requires the development of improved bioinformatics tools for efficient and accurate data analyses. For proteome profiling of a particular system or organism, a number of specialized software tools are needed. Indeed, significant advances in the informatics and software tools necessary to support the analysis and management of these massive amounts of data are needed. Clustering algorithms based on probabilistic and Bayesian models provide an alternative to heuristic algorithms. The number of clusters (diseased and non-diseased groups) is reduced to the choice of the number of components of a mixture of underlying probability. The Bayesian approach is a tool for including information from the data to the analysis. It offers an estimation of the uncertainties of the data and the parameters involved. RESULTS: We present novel algorithms that can organize, cluster and derive meaningful patterns of expression from large-scaled proteomics experiments. We processed raw data using a graphical-based algorithm by transforming it from a real space data-expression to a complex space data-expression using discrete Fourier transformation; then we used a thresholding approach to denoise and reduce the length of each spectrum. Bayesian clustering was applied to the reconstructed data. In comparison with several other algorithms used in this study including K-means, (Kohonen self-organizing map (SOM), and linear discriminant analysis, the Bayesian-Fourier model-based approach displayed superior performances consistently, in selecting the correct model and the number of clusters, thus providing a novel approach for accurate diagnosis of the disease. Using this approach, we were able to successfully denoise proteomic spectra and reach up to a 99% total reduction of the number of peaks compared to the original data. In addition, the Bayesian-based approach generated a better classification rate in comparison with other classification algorithms. This new finding will allow us to apply the Fourier transformation for the selection of the protein profile for each sample, and to develop a novel bioinformatic strategy based on Bayesian clustering for biomarker discovery and optimal diagnosis.  相似文献   

16.
Modeling protein structures is critical for understanding protein functions in various biological and biotechnological studies. Among representative protein structure modeling approaches, template‐based modeling (TBM) is by far the most reliable and most widely used approach to model protein structures. However, it still remains as a challenge to select appropriate software programs for pairwise alignments and model building, two major steps of the TBM. In this paper, pairwise alignment methods for TBM are first compared with respect to the quality of structure models built using these methods. This comparative study is conducted using comprehensive datasets, which cover 6185 domain sequences from Structural Classification of Proteins extended for soluble proteins, and 259 Protein Data Bank entries (whole protein sequences) from Orientations of Proteins in Membranes database for membrane proteins. Overall, a profile‐based method, especially PSI‐BLAST, consistently shows high performance across the datasets and model evaluation metrics used. Next, use of two model building programs, MODELLER and SWISS‐MODEL, does not seem to significantly affect the quality of protein structure models built except for the Hard group (a group of relatively less homologous proteins) of membrane proteins. The results presented in this study will be useful for more accurate implementation of TBM.  相似文献   

17.
Introgression in admixed populations can be used to identify candidate loci that might underlie adaptation or reproductive isolation. The Bayesian genomic cline model provides a framework for quantifying variable introgression in admixed populations and identifying regions of the genome with extreme introgression that are potentially associated with variation in fitness. Here we describe the bgc software, which uses Markov chain Monte Carlo to estimate the joint posterior probability distribution of the parameters in the Bayesian genomic cline model and designate outlier loci. This software can be used with next‐generation sequence data, accounts for uncertainty in genotypic state, and can incorporate information from linked loci on a genetic map. Output from the analysis is written to an HDF5 file for efficient storage and manipulation. This software is written in C++ . The source code, software manual, compilation instructions and example data sets are available under the GNU Public License at http://sites.google.com/site/bgcsoftware/ .  相似文献   

18.
Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data, little attention has been paid to uncertainty in the results obtained. Dirichlet process mixture (DPM) models provide a nonparametric Bayesian alternative to the bootstrap approach to modeling uncertainty in gene expression clustering. Most previously published applications of Bayesian model-based clustering methods have been to short time series data. In this paper, we present a case study of the application of nonparametric Bayesian clustering methods to the clustering of high-dimensional nontime series gene expression data using full Gaussian covariances. We use the probability that two genes belong to the same cluster in a DPM model as a measure of the similarity of these gene expression profiles. Conversely, this probability can be used to define a dissimilarity measure, which, for the purposes of visualization, can be input to one of the standard linkage algorithms used for hierarchical clustering. Biologically plausible results are obtained from the Rosetta compendium of expression profiles which extend previously published cluster analyses of this data.  相似文献   

19.
Jian Zhang  Faming Liang 《Biometrics》2010,66(4):1078-1086
Summary Clustering is a widely used method in extracting useful information from gene expression data, where unknown correlation structures in genes are believed to persist even after normalization. Such correlation structures pose a great challenge on the conventional clustering methods, such as the Gaussian mixture (GM) model, k‐means (KM), and partitioning around medoids (PAM), which are not robust against general dependence within data. Here we use the exponential power mixture model to increase the robustness of clustering against general dependence and nonnormality of the data. An expectation–conditional maximization algorithm is developed to calculate the maximum likelihood estimators (MLEs) of the unknown parameters in these mixtures. The Bayesian information criterion is then employed to determine the numbers of components of the mixture. The MLEs are shown to be consistent under sparse dependence. Our numerical results indicate that the proposed procedure outperforms GM, KM, and PAM when there are strong correlations or non‐Gaussian components in the data.  相似文献   

20.
SUMMARY: It makes intuitive sense to model the natural history of breast cancer, tumor progression from preclinical screen-detectable phase (PCDP) to clinical disease, as a multistate process, but the multilevel structure of the available data, which generally comes from cluster (family)-based service screening programs, makes the required parameter estimation intractable because there is a correlation between screening rounds in the same individual, and between subjects within clusters (families). There is also residual heterogeneity after adjusting for covariates. We therefore proposed a Bayesian hierarchical multistate Markov model with fixed and random effects and applied it to data from a high-risk group (women with a family history of breast cancer) participating in a family-based screening program for breast cancer. A total of 4867 women attended (representing 4464 families) and by the end of 2002, a total of 130 breast cancer cases were identified. Parameter estimation was carried out using Markov chain Monte Carlo (MCMC) simulation and the Bayesian software package WinBUGS. Models with and without random effects were considered. Our preferred model included a random-effect term for the transition rate from preclinical to clinical disease (sigma(2)(2f)), which was estimated to be 0.50 (95% credible interval = 0.22-1.49). Failure to account for this random effect was shown to lead to bias. The incorporation of covariates into multistate models with random effect not only reduced residual heterogeneity but also improved the convergence of stationary distribution. Our proposed Bayesian hierarchical multistate model is a valuable tool for estimating the rate of transitions between disease states in the natural history of breast cancer (and possibly other conditions). Unlike existing models, it can cope with the correlation structure of multilevel screening data, covariates, and residual (unexplained) variation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号