首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Haplotype information plays an important role in many genetic analyses. However, the identification of haplotypes based on sequencing methods is both expensive and time consuming. Current sequencing methods are only efficient to determine conflated data of haplotypes, that is, genotypes. This raises the need to develop computational methods to infer haplotypes from genotypes.Haplotype inference by pure parsimony is an NP-hard problem and still remains a challenging task in bioinformatics. In this paper, we propose an efficient ant colony optimization (ACO) heuristic method, named ACOHAP, to solve the problem. The main idea is based on the construction of a binary tree structure through which ants can travel and resolve conflated data of all haplotypes from site to site. Experiments with both small and large data sets show that ACOHAP outperforms other state-of-the-art heuristic methods. ACOHAP is as good as the currently best exact method, RPoly, on small data sets. However, it is much better than RPoly on large data sets. These results demonstrate the efficiency of the ACOHAP algorithm to solve the haplotype inference by pure parsimony problem for both small and large data sets.  相似文献   

2.

Background

Translating a known metabolic network into a dynamic model requires reasonable guesses of all enzyme parameters. In Bayesian parameter estimation, model parameters are described by a posterior probability distribution, which scores the potential parameter sets, showing how well each of them agrees with the data and with the prior assumptions made.

Results

We compute posterior distributions of kinetic parameters within a Bayesian framework, based on integration of kinetic, thermodynamic, metabolic, and proteomic data. The structure of the metabolic system (i.e., stoichiometries and enzyme regulation) needs to be known, and the reactions are modelled by convenience kinetics with thermodynamically independent parameters. The parameter posterior is computed in two separate steps: a first posterior summarises the available data on enzyme kinetic parameters; an improved second posterior is obtained by integrating metabolic fluxes, concentrations, and enzyme concentrations for one or more steady states. The data can be heterogenous, incomplete, and uncertain, and the posterior is approximated by a multivariate log-normal distribution. We apply the method to a model of the threonine synthesis pathway: the integration of metabolic data has little effect on the marginal posterior distributions of individual model parameters. Nevertheless, it leads to strong correlations between the parameters in the joint posterior distribution, which greatly improve the model predictions by the following Monte-Carlo simulations.

Conclusion

We present a standardised method to translate metabolic networks into dynamic models. To determine the model parameters, evidence from various experimental data is combined and weighted using Bayesian parameter estimation. The resulting posterior parameter distribution describes a statistical ensemble of parameter sets; the parameter variances and correlations can account for missing knowledge, measurement uncertainties, or biological variability. The posterior distribution can be used to sample model instances and to obtain probabilistic statements about the model's dynamic behaviour.  相似文献   

3.
Swartz MD  Kimmel M  Mueller P  Amos CI 《Biometrics》2006,62(2):495-503
Mapping the genes for a complex disease, such as diabetes or rheumatoid arthritis (RA), involves finding multiple genetic loci that may contribute to the onset of the disease. Pairwise testing of the loci leads to the problem of multiple testing. Looking at haplotypes, or linear sets of loci, avoids multiple tests but results in a contingency table with sparse counts, especially when using marker loci with multiple alleles. We propose a hierarchical Bayesian model for case-parent triad data that uses a conditional logistic regression likelihood to model the probability of transmission to a diseased child. We define hierarchical prior distributions on the allele main effects to model the genetic dependencies present in the human leukocyte antigen (HLA) region of chromosome 6. First, we add a hierarchical level for model selection that accounts for both locus and allele selection. This allows us to cast the problem of identifying genetic loci relevant to the disease into a problem of Bayesian variable selection. Second, we attempt to include linkage disequilibrium as a covariance structure in the prior for model coefficients. We evaluate the performance of the procedure with some simulated examples and then apply our procedure to identifying genetic markers in the HLA region that influence risk for RA. Our software is available on the website http://www.epigenetic.org/Linkage/ssgs-public/.  相似文献   

4.
Summary .  A variety of flexible approaches have been proposed for functional data analysis, allowing both the mean curve and the distribution about the mean to be unknown. Such methods are most useful when there is limited prior information. Motivated by applications to modeling of temperature curves in the menstrual cycle, this article proposes a flexible approach for incorporating prior information in semiparametric Bayesian analyses of hierarchical functional data. The proposed approach is based on specifying the distribution of functions as a mixture of a parametric hierarchical model and a nonparametric contamination. The parametric component is chosen based on prior knowledge, while the contamination is characterized as a functional Dirichlet process. In the motivating application, the contamination component allows unanticipated curve shapes in unhealthy menstrual cycles. Methods are developed for posterior computation, and the approach is applied to data from a European fecundability study.  相似文献   

5.
Bayesian adaptive Markov chain Monte Carlo estimation of genetic parameters   总被引:2,自引:0,他引:2  
Accurate and fast estimation of genetic parameters that underlie quantitative traits using mixed linear models with additive and dominance effects is of great importance in both natural and breeding populations. Here, we propose a new fast adaptive Markov chain Monte Carlo (MCMC) sampling algorithm for the estimation of genetic parameters in the linear mixed model with several random effects. In the learning phase of our algorithm, we use the hybrid Gibbs sampler to learn the covariance structure of the variance components. In the second phase of the algorithm, we use this covariance structure to formulate an effective proposal distribution for a Metropolis-Hastings algorithm, which uses a likelihood function in which the random effects have been integrated out. Compared with the hybrid Gibbs sampler, the new algorithm had better mixing properties and was approximately twice as fast to run. Our new algorithm was able to detect different modes in the posterior distribution. In addition, the posterior mode estimates from the adaptive MCMC method were close to the REML (residual maximum likelihood) estimates. Moreover, our exponential prior for inverse variance components was vague and enabled the estimated mode of the posterior variance to be practically zero, which was in agreement with the support from the likelihood (in the case of no dominance). The method performance is illustrated using simulated data sets with replicates and field data in barley.  相似文献   

6.
Yi N  George V  Allison DB 《Genetics》2003,164(3):1129-1138
In this article, we utilize stochastic search variable selection methodology to develop a Bayesian method for identifying multiple quantitative trait loci (QTL) for complex traits in experimental designs. The proposed procedure entails embedding multiple regression in a hierarchical normal mixture model, where latent indicators for all markers are used to identify the multiple markers. The markers with significant effects can be identified as those with higher posterior probability included in the model. A simple and easy-to-use Gibbs sampler is employed to generate samples from the joint posterior distribution of all unknowns including the latent indicators, genetic effects for all markers, and other model parameters. The proposed method was evaluated using simulated data and illustrated using a real data set. The results demonstrate that the proposed method works well under typical situations of most QTL studies in terms of number of markers and marker density.  相似文献   

7.
Existing methods for analyzing nucleotide diversity require investigators to identify relevant hierarchical levels before beginning the analysis. We describe a method that partitions diversity into hierarchical components while allowing any structure present in the data to emerge naturally. We present an unbiased version of NEI's nucleotide diversity statistics and show that our modification has the same properties as WRIGHT's F(ST). We compare its statistical properties with several other F(ST) estimators, and we describe how to use these statistics to produce a rooted tree of relationships among the sampled populations in which the mean time to coalescence of haplotypes drawn from populations belonging to the same node is smaller than the mean time to coalescence of haplotypes drawn from populations belonging to different nodes. We illustrate the method by applying it to data from a recent survey of restriction site variation in the chloroplast genome of Coreopsis grandiflora.  相似文献   

8.
Recent advances in big data and analytics research have provided a wealth of large data sets that are too big to be analyzed in their entirety, due to restrictions on computer memory or storage size. New Bayesian methods have been developed for data sets that are large only due to large sample sizes. These methods partition big data sets into subsets and perform independent Bayesian Markov chain Monte Carlo analyses on the subsets. The methods then combine the independent subset posterior samples to estimate a posterior density given the full data set. These approaches were shown to be effective for Bayesian models including logistic regression models, Gaussian mixture models and hierarchical models. Here, we introduce the R package parallelMCMCcombine which carries out four of these techniques for combining independent subset posterior samples. We illustrate each of the methods using a Bayesian logistic regression model for simulation data and a Bayesian Gamma model for real data; we also demonstrate features and capabilities of the R package. The package assumes the user has carried out the Bayesian analysis and has produced the independent subposterior samples outside of the package. The methods are primarily suited to models with unknown parameters of fixed dimension that exist in continuous parameter spaces. We envision this tool will allow researchers to explore the various methods for their specific applications and will assist future progress in this rapidly developing field.  相似文献   

9.
Dunson DB  Perreault SD 《Biometrics》2001,57(1):302-308
This article describes a general class of factor analytic models for the analysis of clustered multivariate data in the presence of informative missingness. We assume that there are distinct sets of cluster-level latent variables related to the primary outcomes and to the censoring process, and we account for dependency between these latent variables through a hierarchical model. A linear model is used to relate covariates and latent variables to the primary outcomes for each subunit. A generalized linear model accounts for covariate and latent variable effects on the probability of censoring for subunits within each cluster. The model accounts for correlation within clusters and within subunits through a flexible factor analytic framework that allows multiple latent variables and covariate effects on the latent variables. The structure of the model facilitates implementation of Markov chain Monte Carlo methods for posterior estimation. Data from a spermatotoxicity study are analyzed to illustrate the proposed approach.  相似文献   

10.
A Bayesian method for fine mapping is presented, which deals with multiallelic markers (with two or more alleles), unknown phase, missing data, multiple causal variants, and both continuous and binary phenotypes. We consider small chromosomal segments spanned by a dense set of closely linked markers and putative genes only at marker points. In the phenotypic model, locus-specific indicator variables are used to control inclusion in or exclusion from marker contributions. To account for covariance between consecutive loci and to control fluctuations in association signals along a candidate region we introduce a joint prior for the indicators that depends on genetic or physical map distances. The potential of the method, including posterior estimation of trait-associated loci, their effects, linkage disequilibrium pattern due to close linkage of loci, and the age of a causal variant (time to most recent common ancestor), is illustrated with the well-known cystic fibrosis and Friedreich ataxia data sets by assuming that haplotypes were not available. In addition, simulation analysis with large genetic distances is shown. Estimation of model parameters is based on Markov chain Monte Carlo (MCMC) sampling and is implemented using WinBUGS. The model specification code is freely available for research purposes from http://www.rni.helsinki.fi/~mjs/.  相似文献   

11.
Kitada S  Kishino H 《Genetics》2004,167(4):2003-2013
We propose a new method for simultaneously detecting linkage disequilibrium and genetic structure in subdivided populations. Taking subpopulation structure into account with a hierarchical model, we estimate the magnitude of genetic differentiation and linkage disequilibrium in a metapopulation on the basis of geographical samples, rather than decompose a population into a finite number of random-mating subpopulations. We assume that Hardy-Weinberg equilibrium is satisfied in each locality, but do not assume independence between marker loci. Linkage states remain unknown. Genetic differentiation and linkage disequilibrium are expressed as hyperparameters describing the prior distribution of genotypes or haplotypes. We estimate related parameters by maximizing marginal-likelihood functions and detect linkage equilibrium or disequilibrium by the Akaike information criterion. Our empirical Bayesian model analyzes genotype and haplotype frequencies regardless of haploid or diploid data, so it can be applied to most commonly used genetic markers. The performance of our procedure is examined via numerical simulations in comparison with classical procedures. Finally, we analyze isozyme data of ayu, a severely exploited fish species, and single-nucleotide polymorphisms in human ALDH2.  相似文献   

12.
Choi SC  Hey J 《Genetics》2011,189(2):561-577
A new approach to assigning individuals to populations using genetic data is described. Most existing methods work by maximizing Hardy-Weinberg and linkage equilibrium within populations, neither of which will apply for many demographic histories. By including a demographic model, within a likelihood framework based on coalescent theory, we can jointly study demographic history and population assignment. Genealogies and population assignments are sampled from a posterior distribution using a general isolation-with-migration model for multiple populations. A measure of partition distance between assignments facilitates not only the summary of a posterior sample of assignments, but also the estimation of the posterior density for the demographic history. It is shown that joint estimates of assignment and demographic history are possible, including estimation of population phylogeny for samples from three populations. The new method is compared to results of a widely used assignment method, using simulated and published empirical data sets.  相似文献   

13.
A new method is presented for inferring evolutionary trees using nucleotide sequence data. The birth-death process is used as a model of speciation and extinction to specify the prior distribution of phylogenies and branching times. Nucleotide substitution is modeled by a continuous-time Markov process. Parameters of the branching model and the substitution model are estimated by maximum likelihood. The posterior probabilities of different phylogenies are calculated and the phylogeny with the highest posterior probability is chosen as the best estimate of the evolutionary relationship among species. We refer to this as the maximum posterior probability (MAP) tree. The posterior probability provides a natural measure of the reliability of the estimated phylogeny. Two example data sets are analyzed to infer the phylogenetic relationship of human, chimpanzee, gorilla, and orangutan. The best trees estimated by the new method are the same as those from the maximum likelihood analysis of separate topologies, but the posterior probabilities are quite different from the bootstrap proportions. The results of the method are found to be insensitive to changes in the rate parameter of the branching process. Correspondence to: Z. Yang  相似文献   

14.
RFLP haplotypes at the alpha-globin gene complex have been examined in 190 individuals from the Niokolo Mandenka population of Senegal: haplotypes were assigned unambiguously for 210 chromosomes. The Mandenka share with other African populations a sample size-independent haplotype diversity that is much greater than that in any non-African population: the number of haplotypes observed in the Mandenka is typically twice that seen in the non-African populations sampled to date. Of these haplotypes, 17.3% had not been observed in any previous surveys, and a further 19.1% have previously been reported only in African populations. The haplotype distribution shows clear differences between African and non-African peoples, but this is on the basis of population-specific haplotypes combined with haplotypes common to all. The relationship of the newly reported haplotypes to those previously recorded suggests that several mutation processes, particularly recombination as homologous exchange or gene conversion, have been involved in their production. A computer program based on the expectation-maximization (EM) algorithm was used to obtain maximum-likelihood estimates of haplotype frequencies for the entire data set: good concordance between the unambiguous and EM-derived sets was seen for the overall haplotype frequencies. Some of the low-frequency haplotypes reported by the estimation algorithm differ greatly, in structure, from those haplotypes known to be present in human populations, and they may not represent haplotypes actually present in the sample.  相似文献   

15.
Inferences of population structure and more precisely the identification of genetically homogeneous groups of individuals are essential to the fields of ecology, evolutionary biology and conservation biology. Such population structure inferences are routinely investigated via the program structure implementing a Bayesian algorithm to identify groups of individuals at Hardy–Weinberg and linkage equilibrium. While the method is performing relatively well under various population models with even sampling between subpopulations, the robustness of the method to uneven sample size between subpopulations and/or hierarchical levels of population structure has not yet been tested despite being commonly encountered in empirical data sets. In this study, I used simulated and empirical microsatellite data sets to investigate the impact of uneven sample size between subpopulations and/or hierarchical levels of population structure on the detected population structure. The results demonstrated that uneven sampling often leads to wrong inferences on hierarchical structure and downward‐biased estimates of the true number of subpopulations. Distinct subpopulations with reduced sampling tended to be merged together, while at the same time, individuals from extensively sampled subpopulations were generally split, despite belonging to the same panmictic population. Four new supervised methods to detect the number of clusters were developed and tested as part of this study and were found to outperform the existing methods using both evenly and unevenly sampled data sets. Additionally, a subsampling strategy aiming to reduce sampling unevenness between subpopulations is presented and tested. These results altogether demonstrate that when sampling evenness is accounted for, the detection of the correct population structure is greatly improved.  相似文献   

16.
Although many algorithms exist for estimating haplotypes from genotype data, none of them take full account of both the decay of linkage disequilibrium (LD) with distance and the order and spacing of genotyped markers. Here, we describe an algorithm that does take these factors into account, using a flexible model for the decay of LD with distance that can handle both "blocklike" and "nonblocklike" patterns of LD. We compare the accuracy of this approach with a range of other available algorithms in three ways: for reconstruction of randomly paired, molecularly determined male X chromosome haplotypes; for reconstruction of haplotypes obtained from trios in an autosomal region; and for estimation of missing genotypes in 50 autosomal genes that have been completely resequenced in 24 African Americans and 23 individuals of European descent. For the autosomal data sets, our new approach clearly outperforms the best available methods, whereas its accuracy in inferring the X chromosome haplotypes is only slightly superior. For estimation of missing genotypes, our method performed slightly better when the two subsamples were combined than when they were analyzed separately, which illustrates its robustness to population stratification. Our method is implemented in the software package PHASE (v2.1.1), available from the Stephens Lab Web site.  相似文献   

17.
Analysis of DNA Diversity by Spatial Autocorrelation   总被引:11,自引:1,他引:10  
G. Bertorelle  G. Barbujani 《Genetics》1995,140(2):811-819
Two statistics are proposed for summarizing spatial patterns of DNA diversity. These autocorrelation indices for DNA analysis, or AIDAs, can be applied to RFLP and sequence data; the resulting set of autocorrelation coefficients, or correlogram, measures whether, and to what extent, individual DNA sequences or haplotypes resemble the haplotypes sampled at arbitrarily chosen spatial distances. Analyses of computer-generated sets of data, and of RFLP data from two natural populations, show that AIDAs allow one to objectively and simply identify basic patterns in the spatial distribution of haplotypes. These statistics, therefore, seem to be a useful tool both to explore the genetic structure of a population and to suggest hypotheses on the evolutionary processes that shaped the observed patterns.  相似文献   

18.
Estimating species trees using multiple-allele DNA sequence data   总被引:3,自引:0,他引:3  
Several techniques, such as concatenation and consensus methods, are available for combining data from multiple loci to produce a single statement of phylogenetic relationships. However, when multiple alleles are sampled from individual species, it becomes more challenging to estimate relationships at the level of species, either because concatenation becomes inappropriate due to conflicts among individual gene trees, or because the species from which multiple alleles have been sampled may not form monophyletic groups in the estimated tree. We propose a Bayesian hierarchical model to reconstruct species trees from multiple-allele, multilocus sequence data, building on a recently proposed method for estimating species trees from single allele multilocus data. A two-step Markov Chain Monte Carlo (MCMC) algorithm is adopted to estimate the posterior distribution of the species tree. The model is applied to estimate the posterior distribution of species trees for two multiple-allele datasets--yeast (Saccharomyces) and birds (Manacus-manakins). The estimates of the species trees using our method are consistent with those inferred from other methods and genetic markers, but in contrast to other species tree methods, it provides credible regions for the species tree. The Bayesian approach described here provides a powerful framework for statistical testing and integration of population genetics and phylogenetics.  相似文献   

19.
I introduce the software JML that tests for the presence of hybridization in multispecies sequence data sets by posterior predictive checking following Joly, McLenachan and Lockhart (2009, American Naturalist 174, e54). Although their method could potentially be applied on any data set, the lack of appropriate software made its application difficult. The software JML thus fills a need for an easy application of the method but also includes improvements such as the possibility to incorporate uncertainty in the species tree topology. The JML software uses a posterior distribution of species trees, population sizes and branch lengths to simulate replicate sequence data sets using the coalescent with no migration. A test quantity, defined as the minimum pairwise sequence distance between sequences of two species, is then evaluated on the simulated data sets and compared to the one estimated from the original data. Because the test quantity is a good predictor of hybridization events, departure from the bifurcating species tree model could be interpreted as evidence of hybridization. Software performance in terms of computing time is evaluated for several parameters. I also show an application example of the software for detecting hybridization among native diploid North American roses.  相似文献   

20.
We describe four extensions to existing Bayesian methods for the analysis of genetic structure in populations: (i) use of beta distributions to approximate the posterior distribution of f and theta(B); (ii) use of an entropy statistic to describe the amount of information about a parameter derived from the data; (iii) use of the Deviance Information Criterion (DIC) as a model choice criterion for determining whether there is evidence for inbreeding within populations or genetic differentiation among populations; and (iv) use of samples from the posterior distributions for f and theta(B) derived from different data sets to determine whether the estimates are consistent with one another. We illustrate each of these extensions by applying them to data derived from previous allozyme and random amplified polymorphic DNA surveys of an endangered orchid, Platanthera leucophaea, and we conclude that differences in theta(B) from the two data sets may represent differences in the underlying mutational processes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号