首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Beyer J  May B 《Molecular ecology》2003,12(8):2243-2250
We present an algorithm to partition a single generation of individuals into full-sib families using single-locus co-dominant marker data. Pairwise likelihood ratios are used to create a graph that represents the full-sib relationships within the data set. Connected-component and minimum-cut algorithms from the graph theory are then employed to find the full-sib families within the graph. The results of a large-scale simulation study show that the algorithm is able to produce accurate partitions when applied to data sets with eight or more loci. Although the algorithm performs best when the distribution of allele frequencies and family sizes in a data set is uniform, the inclusion of more loci or alleles per locus allows accurate partitions to be created from data sets in which these distributions are highly skewed.  相似文献   

2.
Genotype calling procedures vary from laboratory to laboratory for many microsatellite markers. Even within the same laboratory, application of different experimental protocols often leads to ambiguities. The impact of these ambiguities ranges from irksome to devastating. Resolving the ambiguities can increase effective sample size and preserve evidence in favor of disease-marker associations. Because different data sets may contain different numbers of alleles, merging is unfortunately not a simple process of matching alleles one to one. Merging data sets manually is difficult, time-consuming, and error-prone due to differences in genotyping hardware, binning methods, molecular weight standards, and curve fitting algorithms. Merging is particularly difficult if few or no samples occur in common, or if samples are drawn from ethnic groups with widely varying allele frequencies. It is dangerous to align alleles simply by adding a constant number of base pairs to the alleles of one of the data sets. To address these issues, we have developed a Bayesian model and a Markov chain Monte Carlo (MCMC) algorithm for sampling the posterior distribution under the model. Our computer program, MicroMerge, implements the algorithm and almost always accurately and efficiently finds the most likely correct alignment. Common allele frequencies across laboratories in the same ethnic group are the single most important cue in the model. MicroMerge computes the allelic alignments with the greatest posterior probabilities under several merging options. It also reports when data sets cannot be confidently merged. These features are emphasized in our analysis of simulated and real data.  相似文献   

3.
We study the problem of reconstructing haplotype configurations from genotypes on pedigree data with missing alleles under the Mendelian law of inheritance and the minimum-recombination principle, which is important for the construction of haplotype maps and genetic linkage/association analyses. Our previous results show that the problem of finding a minimum-recombinant haplotype configuration (MRHC) is in general NP-hard. This paper presents an effective integer linear programming (ILP) formulation of the MRHC problem with missing data and a branch-and-bound strategy that utilizes a partial order relationship and some other special relationships among variables to decide the branching order. Nontrivial lower and upper bounds on the optimal number of recombinants are introduced at each branching node to effectively prune the search tree. When multiple solutions exist, a best haplotype configuration is selected based on a maximum likelihood approach. The paper also shows for the first time how to incorporate marker interval distance into a rule-based haplotyping algorithm. Our results on simulated data show that the algorithm could recover haplotypes with 50 loci from a pedigree of size 29 in seconds on a Pentium IV computer. Its accuracy is more than 99.8% for data with no missing alleles and 98.3% for data with 20% missing alleles in terms of correctly recovered phase information at each marker locus. A comparison with a statistical approach SimWalk2 on simulated data shows that the ILP algorithm runs much faster than SimWalk2 and reports better or comparable haplotypes on average than the first and second runs of SimWalk2. As an application of the algorithm to real data, we present some test results on reconstructing haplotypes from a genome-scale SNP dataset consisting of 12 pedigrees that have 0.8% to 14.5% missing alleles.  相似文献   

4.
The evolutionary tree reconstruction algorithm called SEMPHY using structural expectation maximization (SEM) is an efficient approach but has local optimality problem. To improve SEMPHY, a new algorithm named HSEMPHY based on the homotopy continuation principle is proposed in the present study for reconstructing evolutionary trees. The HSEMPHY algorithm computes the condition probability of hidden variables in the structural through maximum entropy principle. It can reduce the influence of the initial value of the final resolution by simulating the process of the homotopy principle and by introducing the homotopy parameter beta. HSEMPHY is tested on real datasets and simulated dataset to compare with SEMPHY and the two most popular reconstruction approaches PHYML and RAXML. Experimental results show that HSEMPHY is at least as good as PHYML and RAXML and is very robust to poor starting trees.  相似文献   

5.
Maximum likelihood haplotyping for general pedigrees   总被引:3,自引:0,他引:3  
Haplotype data is valuable in mapping disease-susceptibility genes in the study of Mendelian and complex diseases. We present algorithms for inferring a most likely haplotype configuration for general pedigrees, implemented in the newest version of the genetic linkage analysis system SUPERLINK. In SUPERLINK, genetic linkage analysis problems are represented internally using Bayesian networks. The use of Bayesian networks enables efficient maximum likelihood haplotyping for more complex pedigrees than was previously possible. Furthermore, to support efficient haplotyping for larger pedigrees, we have also incorporated a novel algorithm for determining a better elimination order for the variables of the Bayesian network. The presented optimization algorithm also improves likelihood computations. We present experimental results for the new algorithms on a variety of real and semiartificial data sets, and use our software to evaluate MCMC approximations for haplotyping.  相似文献   

6.
F Qin  A Auerbach    F Sachs 《Biophysical journal》1996,70(1):264-280
We present here a maximal likelihood algorithm for estimating single-channel kinetic parameters from idealized patch-clamp data. The algorithm takes into account missed events caused by limited time resolution of the recording system. Assuming a fixed dead time, we derive an explicit expression for the corrected transition rate matrix by generalizing the theory of Roux and Sauve (1985, Biophys. J. 48:149-158) to the case of multiple conductance levels. We use a variable metric optimizer with analytical derivatives for rapidly maximizing the likelihood. The algorithm is applicable to data containing substates and multiple identical or nonidentical channels. It allows multiple data sets obtained under different experimental conditions, e.g., concentration, voltage, and force, to be fit simultaneously. It also permits a variety of constraints on rate constants and provides standard errors for all estimates of model parameters. The algorithm has been tested extensively on a variety of kinetic models with both simulated and experimental data. It is very efficient and robust; rate constants for a multistate model can often be extracted in a processing time of approximately 1 min, largely independent of the starting values.  相似文献   

7.
The analysis of nonlinear function-valued characters is very important in genetic studies, especially for growth traits of agricultural and laboratory species. Inference in nonlinear mixed effects models is, however, quite complex and is usually based on likelihood approximations or Bayesian methods. The aim of this paper was to present an efficient stochastic EM procedure, namely the SAEM algorithm, which is much faster to converge than the classical Monte Carlo EM algorithm and Bayesian estimation procedures, does not require specification of prior distributions and is quite robust to the choice of starting values. The key idea is to recycle the simulated values from one iteration to the next in the EM algorithm, which considerably accelerates the convergence. A simulation study is presented which confirms the advantages of this estimation procedure in the case of a genetic analysis. The SAEM algorithm was applied to real data sets on growth measurements in beef cattle and in chickens. The proposed estimation procedure, as the classical Monte Carlo EM algorithm, provides significance tests on the parameters and likelihood based model comparison criteria to compare the nonlinear models with other longitudinal methods.  相似文献   

8.
The generalized Gibbs sampler (GGS) is a recently developed Markov chain Monte Carlo (MCMC) technique that enables Gibbs-like sampling of state spaces that lack a convenient representation in terms of a fixed coordinate system. This paper describes a new sampler, called the tree sampler, which uses the GGS to sample from a state space consisting of phylogenetic trees. The tree sampler is useful for a wide range of phylogenetic applications, including Bayesian, maximum likelihood, and maximum parsimony methods. A fast new algorithm to search for a maximum parsimony phylogeny is presented, using the tree sampler in the context of simulated annealing. The mathematics underlying the algorithm is explained and its time complexity is analyzed. The method is tested on two large data sets consisting of 123 sequences and 500 sequences, respectively. The new algorithm is shown to compare very favorably in terms of speed and accuracy to the program DNAPARS from the PHYLIP package.  相似文献   

9.
Bayesian adaptive Markov chain Monte Carlo estimation of genetic parameters   总被引:2,自引:0,他引:2  
Accurate and fast estimation of genetic parameters that underlie quantitative traits using mixed linear models with additive and dominance effects is of great importance in both natural and breeding populations. Here, we propose a new fast adaptive Markov chain Monte Carlo (MCMC) sampling algorithm for the estimation of genetic parameters in the linear mixed model with several random effects. In the learning phase of our algorithm, we use the hybrid Gibbs sampler to learn the covariance structure of the variance components. In the second phase of the algorithm, we use this covariance structure to formulate an effective proposal distribution for a Metropolis-Hastings algorithm, which uses a likelihood function in which the random effects have been integrated out. Compared with the hybrid Gibbs sampler, the new algorithm had better mixing properties and was approximately twice as fast to run. Our new algorithm was able to detect different modes in the posterior distribution. In addition, the posterior mode estimates from the adaptive MCMC method were close to the REML (residual maximum likelihood) estimates. Moreover, our exponential prior for inverse variance components was vague and enabled the estimated mode of the posterior variance to be practically zero, which was in agreement with the support from the likelihood (in the case of no dominance). The method performance is illustrated using simulated data sets with replicates and field data in barley.  相似文献   

10.
Large sample theory of semiparametric models based on maximum likelihood estimation (MLE) with shape constraint on the nonparametric component is well studied. Relatively less attention has been paid to the computational aspect of semiparametric MLE. The computation of semiparametric MLE based on existing approaches such as the expectation‐maximization (EM) algorithm can be computationally prohibitive when the missing rate is high. In this paper, we propose a computational framework for semiparametric MLE based on an inexact block coordinate ascent (BCA) algorithm. We show theoretically that the proposed algorithm converges. This computational framework can be applied to a wide range of data with different structures, such as panel count data, interval‐censored data, and degradation data, among others. Simulation studies demonstrate favorable performance compared with existing algorithms in terms of accuracy and speed. Two data sets are used to illustrate the proposed computational method. We further implement the proposed computational method in R package BCA1SG , available at CRAN.  相似文献   

11.
The reconstruction and synthesis of ancestral RNAs is a feasible goal for paleogenetics. This will require new bioinformatics methods, including a robust statistical framework for reconstructing histories of substitutions, indels and structural changes. We describe a “transducer composition” algorithm for extending pairwise probabilistic models of RNA structural evolution to models of multiple sequences related by a phylogenetic tree. This algorithm draws on formal models of computational linguistics as well as the 1985 protosequence algorithm of David Sankoff. The output of the composition algorithm is a multiple-sequence stochastic context-free grammar. We describe dynamic programming algorithms, which are robust to null cycles and empty bifurcations, for parsing this grammar. Example applications include structural alignment of non-coding RNAs, propagation of structural information from an experimentally-characterized sequence to its homologs, and inference of the ancestral structure of a set of diverged RNAs. We implemented the above algorithms for a simple model of pairwise RNA structural evolution; in particular, the algorithms for maximum likelihood (ML) alignment of three known RNA structures and a known phylogeny and inference of the common ancestral structure. We compared this ML algorithm to a variety of related, but simpler, techniques, including ML alignment algorithms for simpler models that omitted various aspects of the full model and also a posterior-decoding alignment algorithm for one of the simpler models. In our tests, incorporation of basepair structure was the most important factor for accurate alignment inference; appropriate use of posterior-decoding was next; and fine details of the model were least important. Posterior-decoding heuristics can be substantially faster than exact phylogenetic inference, so this motivates the use of sum-over-pairs heuristics where possible (and approximate sum-over-pairs). For more exact probabilistic inference, we discuss the use of transducer composition for ML (or MCMC) inference on phylogenies, including possible ways to make the core operations tractable.  相似文献   

12.
We examine the impact of likelihood surface characteristics on phylogenetic inference. Amino acid data sets simulated from topologies with branch length features chosen to represent varying degrees of difficulty for likelihood maximization are analyzed. We present situations where the tree found to achieve the global maximum in likelihood is often not equal to the true tree. We use the program covSEARCH to demonstrate how the use of adaptively sized pools of candidate trees that are updated using confidence tests results in solution sets that are highly likely to contain the true tree. This approach requires more computation than traditional maximum likelihood methods, hence covSEARCH is best suited to small to medium-sized alignments or large alignments with some constrained nodes. The majority rule consensus tree computed from the confidence sets also proves to be different from the generating topology. Although low phylogenetic signal in the input alignment can result in large confidence sets of trees, some biological information can still be obtained based on nodes that exhibit high support within the confidence set. Two real data examples are analyzed: mammal mitochondrial proteins and a small tubulin alignment. We conclude that the technique of confidence set optimization can significantly improve the robustness of phylogenetic inference at a reasonable computational cost. Additionally, when either very short internal branches or very long terminal branches are present, confident resolution of specific bipartitions or subtrees, rather than whole-tree phylogenies, may be the most realistic goal for phylogenetic methods. [Reviewing Editor: Dr. Nicolas Galtier]  相似文献   

13.
Stephens and Donnelly have introduced a simple yet powerful importance sampling scheme for computing the likelihood in population genetic models. Fundamental to the method is an approximation to the conditional probability of the allelic type of an additional gene, given those currently in the sample. As noted by Li and Stephens, the product of these conditional probabilities for a sequence of draws that gives the frequency of allelic types in a sample is an approximation to the likelihood, and can be used directly in inference. The aim of this note is to demonstrate the high level of accuracy of "product of approximate conditionals" (PAC) likelihood when used with microsatellite data. Results obtained on simulated microsatellite data show that this strategy leads to a negligible bias over a wide range of the scaled mutation parameter theta. Furthermore, the sampling variance of likelihood estimates as well as the computation time are lower than that obtained with importance sampling on the whole range of theta. It follows that this approach represents an efficient substitute to IS algorithms in computer intensive (e.g. MCMC) inference methods in population genetics.  相似文献   

14.
We consider the problem of reconstructing near-perfect phylogenetic trees using binary character states (referred to as BNPP). A perfect phylogeny assumes that every character mutates at most once in the evolutionary tree, yielding an algorithm for binary character states that is computationally efficient but not robust to imperfections in real data. A near-perfect phylogeny relaxes the perfect phylogeny assumption by allowing at most a constant number of additional mutations. We develop two algorithms for constructing optimal near-perfect phylogenies and provide empirical evidence of their performance. The first simple algorithm is fixed parameter tractable when the number of additional mutations and the number of characters that share four gametes with some other character are constants. The second, more involved algorithm for the problem is fixed parameter tractable when only the number of additional mutations is fixed. We have implemented both algorithms and shown them to be extremely efficient in practice on biologically significant data sets. This work proves the BNPP problem fixed parameter tractable and provides the first practical phylogenetic tree reconstruction algorithms that find guaranteed optimal solutions while being easily implemented and computationally feasible for data sets of biologically meaningful size and complexity.  相似文献   

15.
Vos RA 《Systematic biology》2003,52(3):368-373
The existence of multiple likelihood maxima necessitates algorithms that explore a large part of the tree space. However, because of computational constraints, stepwise addition-based tree-searching methods do not allow for this exploration in reasonable time. Here, I present an algorithm that increases the speed at which the likelihood landscape can be explored. The iterative algorithm combines the computational speed of distance-based tree construction methods to arrive at approximations of the global optimum with the accuracy of optimality criterion based branch-swapping methods to improve on the result of the starting tree. The algorithm moves between local optima by iteratively perturbing the tree landscape through a process of reweighting randomly drawn samples of the underlying sequence data set. Tests on simulated and real data sets demonstrated that the optimal solution obtained using stepwise addition-based heuristic searches was found faster using the algorithm presented here. Tests on a previously published data set that established the presence of tree islands under maximum likelihood demonstrated that the algorithm identifies the same tree islands in a shorter amount of time than that needed using stepwise addition. The algorithm can be readily applied using standard software for phylogenetic inference.  相似文献   

16.
Bayesian statistical methods based on simulation techniques have recently been shown to provide powerful tools for the analysis of genetic population structure. We have previously developed a Markov chain Monte Carlo (MCMC) algorithm for characterizing genetically divergent groups based on molecular markers and geographical sampling design of the dataset. However, for large-scale datasets such algorithms may get stuck to local maxima in the parameter space. Therefore, we have modified our earlier algorithm to support multiple parallel MCMC chains, with enhanced features that enable considerably faster and more reliable estimation compared to the earlier version of the algorithm. We consider also a hierarchical tree representation, from which a Bayesian model-averaged structure estimate can be extracted. The algorithm is implemented in a computer program that features a user-friendly interface and built-in graphics. The enhanced features are illustrated by analyses of simulated data and an extensive human molecular dataset. AVAILABILITY: Freely available at http://www.rni.helsinki.fi/~jic/bapspage.html.  相似文献   

17.
Molecular motors, such as kinesin, myosin, or dynein, convert chemical energy into mechanical energy by hydrolyzing ATP. The mechanical energy is used for moving in discrete steps along the cytoskeleton and carrying a molecular load. High resolution single molecule recordings of motor steps appear as a stochastic sequence of dwells, resembling a staircase. Staircase data can also be obtained from other molecular machines such as F1 -ATPase, RNA polymerase, or topoisomerase. We developed a maximum likelihood algorithm that estimates the rate constants between different conformational states of the protein, including motor steps. We model the motor with a periodic Markov model that reflects the repetitive chemistry of the motor step. We estimated the kinetics from the idealized dwell-sequence by numerical maximization of the likelihood function for discrete-time Markov models. This approach eliminates the need for missed event correction. The algorithm can fit kinetic models of arbitrary complexity, such as uniform or alternating step chemistry, reversible or irreversible kinetics, ATP concentration and mechanical force-dependent rates, etc. The method allows global fitting across stationary and nonstationary experimental conditions, and user-defined a priori constraints on rate constants. The algorithm was tested with simulated data, and implemented in the free QuB software.  相似文献   

18.
Summary .  We compare two Monte Carlo (MC) procedures, sequential importance sampling (SIS) and Markov chain Monte Carlo (MCMC), for making Bayesian inferences about the unknown states and parameters of state–space models for animal populations. The procedures were applied to both simulated and real pup count data for the British grey seal metapopulation, as well as to simulated data for a Chinook salmon population. The MCMC implementation was based on tailor-made proposal distributions combined with analytical integration of some of the states and parameters. SIS was implemented in a more generic fashion. For the same computing time MCMC tended to yield posterior distributions with less MC variation across different runs of the algorithm than the SIS implementation with the exception in the seal model of some states and one of the parameters that mixed quite slowly. The efficiency of the SIS sampler greatly increased by analytically integrating out unknown parameters in the observation model. We consider that a careful implementation of MCMC for cases where data are informative relative to the priors sets the gold standard, but that SIS samplers are a viable alternative that can be programmed more quickly. Our SIS implementation is particularly competitive in situations where the data are relatively uninformative; in other cases, SIS may require substantially more computer power than an efficient implementation of MCMC to achieve the same level of MC error.  相似文献   

19.
It is a challenging issue to map Quantitative Trait Loci (QTL) underlying complex discrete traits,which usually show discontinuous distribution and less information,using conventional statisti-cal methods. Bayesian-Markov chain Monte Carlo (Bayesian-MCMC) approach is the key procedure in mapping QTL for complex binary traits,which provides a complete posterior distribution for QTL parameters using all prior information. As a consequence,Bayesian estimates of all interested vari-ables can be obtained straightforwardly basing on their posterior samples simulated by the MCMC algorithm. In our study,utilities of Bayesian-MCMC are demonstrated using simulated several ani-mal outbred full-sib families with different family structures for a complex binary trait underlied by both a QTL and polygene. Under the Identity-by-Descent-Based variance component random model,three samplers basing on MCMC,including Gibbs sampling,Metropolis algorithm and reversible jump MCMC,were implemented to generate the joint posterior distribution of all unknowns so that the QTL parameters were obtained by Bayesian statistical inferring. The results showed that Bayesian-MCMC approach could work well and robust under different family structures and QTL effects. As family size increases and the number of family decreases,the accuracy of the parameter estimates will be im-proved. When the true QTL has a small effect,using outbred population experiment design with large family size is the optimal mapping strategy.  相似文献   

20.
It is a challenging issue to map Quantitative Trait Loci (QTL) underlying complex discrete traits, which usually show discontinuous distribution and less information, using conventional statistical methods. Bayesian-Markov chain Monte Carlo (Bayesian-MCMC) approach is the key procedure in mapping QTL for complex binary traits, which provides a complete posterior distribution for QTL parameters using all prior information. As a consequence, Bayesian estimates of all interested variables can be obtained straightforwardly basing on their posterior samples simulated by the MCMC algorithm. In our study, utilities of Bayesian-MCMC are demonstrated using simulated several animal outbred full-sib families with different family structures for a complex binary trait underlied by both a QTL and polygene. Under the Identity-by-Descent-Based variance component random model, three samplers basing on MCMC, including Gibbs sampling, Metropolis algorithm and reversible jump MCMC, were implemented to generate the joint posterior distribution of all unknowns so that the QTL parameters were obtained by Bayesian statistical inferring. The results showed that Bayesian-MCMC approach could work well and robust under different family structures and QTL effects. As family size increases and the number of family decreases, the accuracy of the parameter estimates will be improved. When the true QTL has a small effect, using outbred population experiment design with large family size is the optimal mapping strategy.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号