首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
Coalescent likelihood is the probability of observing the given population sequences under the coalescent model. Computation of coalescent likelihood under the infinite sites model is a classic problem in coalescent theory. Existing methods are based on either importance sampling or Markov chain Monte Carlo and are inexact. In this paper, we develop a simple method that can compute the exact coalescent likelihood for many data sets of moderate size, including real biological data whose likelihood was previously thought to be difficult to compute exactly. Our method works for both panmictic and subdivided populations. Simulations demonstrate that the practical range of exact coalescent likelihood computation for panmictic populations is significantly larger than what was previously believed. We investigate the application of our method in estimating mutation rates by maximum likelihood. A main application of the exact method is comparing the accuracy of approximate methods. To demonstrate the usefulness of the exact method, we evaluate the accuracy of program Genetree in computing the likelihood for subdivided populations.  相似文献   

2.
The problem of exact conditional inference for discrete multivariate case-control data has two forms. The first is grouped case-control data, where Monte Carlo computations can be done using the importance sampling method of Booth and Butler (1999, Biometrika86, 321-332), or a proposed alternative sequential importance sampling method. The second form is matched case-control data. For this analysis we propose a new exact sampling method based on the conditional-Poisson distribution for conditional testing with one binary and one integral ordered covariate. This method makes computations on data sets with large numbers of matched sets fast and accurate. We provide detailed derivation of the constraints and conditional distributions for conditional inference on grouped and matched data. The methods are illustrated on several new and old data sets.  相似文献   

3.
The Exact Test for Cytonuclear Disequilibria   总被引:2,自引:0,他引:2       下载免费PDF全文
C. J. Basten  M. A. Asmussen 《Genetics》1997,146(3):1165-1171
We extend the analysis of the statistical properties of cytonuclear disequilibria in two major ways. First, we develop the asymptotic sampling theory for the nonrandom associations between the alleles at a haploid cytoplasmic locus and the alleles and genotypes at a diploid nuclear locus, when there are an arbitrary number of alleles at each marker. This includes the derivation of the maximum likelihood estimators and their sampling variances for each disequilibrium measure, together with simple tests of the null hypothesis of no disequilibrium. In addition to these new asymptotic tests, we provide the first implementation of Fisher's exact test for the genotypic cytonuclear disequilibria and some approximations of the exact test. We also outline an exact test for allelic cytonuclear disequilibria in multiallelic systems. An exact test should be used for data sets when either the marginal frequencies are extreme or the sample size is small. The utility of this new sampling theory is illustrated through applications to recent nuclear-mtDNA and nuclear-cpDNA data sets. The results also apply to population surveys of nuclear loci in conjunction with markers in cytoplasmically inherited microorganisms.  相似文献   

4.
Codon-based substitution models have been widely used to identify amino acid sites under positive selection in comparative analysis of protein-coding DNA sequences. The nonsynonymous-synonymous substitution rate ratio (d(N)/d(S), denoted omega) is used as a measure of selective pressure at the protein level, with omega > 1 indicating positive selection. Statistical distributions are used to model the variation in omega among sites, allowing a subset of sites to have omega > 1 while the rest of the sequence may be under purifying selection with omega < 1. An empirical Bayes (EB) approach is then used to calculate posterior probabilities that a site comes from the site class with omega > 1. Current implementations, however, use the naive EB (NEB) approach and fail to account for sampling errors in maximum likelihood estimates of model parameters, such as the proportions and omega ratios for the site classes. In small data sets lacking information, this approach may lead to unreliable posterior probability calculations. In this paper, we develop a Bayes empirical Bayes (BEB) approach to the problem, which assigns a prior to the model parameters and integrates over their uncertainties. We compare the new and old methods on real and simulated data sets. The results suggest that in small data sets the new BEB method does not generate false positives as did the old NEB approach, while in large data sets it retains the good power of the NEB approach for inferring positively selected sites.  相似文献   

5.
The Poincaré plot is a popular two-dimensional, time series analysis tool because of its intuitive display of dynamic system behavior. Poincaré plots have been used to visualize heart rate and respiratory pattern variabilities. However, conventional quantitative analysis relies primarily on statistical measurements of the cumulative distribution of points, making it difficult to interpret irregular or complex plots. Moreover, the plots are constructed to reflect highly correlated regions of the time series, reducing the amount of nonlinear information that is presented and thereby hiding potentially relevant features. We propose temporal Poincaré variability (TPV), a novel analysis methodology that uses standard techniques to quantify the temporal distribution of points and to detect nonlinear sources responsible for physiological variability. In addition, the analysis is applied across multiple time delays, yielding a richer insight into system dynamics than the traditional circle return plot. The method is applied to data sets of R-R intervals and to synthetic point process data extracted from the Lorenz time series. The results demonstrate that TPV complements the traditional analysis and can be applied more generally, including Poincaré plots with multiple clusters, and more consistently than the conventional measures and can address questions regarding potential structure underlying the variability of a data set.  相似文献   

6.
Inferring speciation times under an episodic molecular clock   总被引:5,自引:0,他引:5  
We extend our recently developed Markov chain Monte Carlo algorithm for Bayesian estimation of species divergence times to allow variable evolutionary rates among lineages. The method can use heterogeneous data from multiple gene loci and accommodate multiple fossil calibrations. Uncertainties in fossil calibrations are described using flexible statistical distributions. The prior for divergence times for nodes lacking fossil calibrations is specified by use of a birth-death process with species sampling. The prior for lineage-specific substitution rates is specified using either a model with autocorrelated rates among adjacent lineages (based on a geometric Brownian motion model of rate drift) or a model with independent rates among lineages specified by a log-normal probability distribution. We develop an infinite-sites theory, which predicts that when the amount of sequence data approaches infinity, the width of the posterior credibility interval and the posterior mean of divergence times form a perfect linear relationship, with the slope indicating uncertainties in time estimates that cannot be reduced by sequence data alone. Simulations are used to study the influence of among-lineage rate variation and the number of loci sampled on the uncertainty of divergence time estimates. The analysis suggests that posterior time estimates typically involve considerable uncertainties even with an infinite amount of sequence data, and that the reliability and precision of fossil calibrations are critically important to divergence time estimation. We apply our new algorithms to two empirical data sets and compare the results with those obtained in previous Bayesian and likelihood analyses. The results demonstrate the utility of our new algorithms.  相似文献   

7.
The use of parameter-rich substitution models in molecular phylogenetics has been criticized on the basis that these models can cause a reduction both in accuracy and in the ability to discriminate among competing topologies. We have explored the relationship between nucleotide substitution model complexity and nonparametric bootstrap support under maximum likelihood (ML) for six data sets for which the true relationships are known with a high degree of certainty. We also performed equally weighted maximum parsimony analyses in order to assess the effects of ignoring branch length information during tree selection. We observed that maximum parsimony gave the lowest mean estimate of bootstrap support for the correct set of nodes relative to the ML models for every data set except one. For several data sets, we established that the exact distribution used to model among-site rate variation was critical for a successful phylogenetic analysis. Site-specific rate models were shown to perform very poorly relative to gamma and invariable sites models for several of the data sets most likely because of the gross underestimation of branch lengths. The invariable sites model also performed poorly for several data sets where this model had a poor fit to the data, suggesting that addition of the gamma distribution can be critical. Estimates of bootstrap support for the correct nodes often increased under gamma and invariable sites models relative to equal rates models. Our observations are contrary to the prediction that such models cause reduced confidence in phylogenetic hypotheses. Our results raise several issues regarding the process of model selection, and we briefly discuss model selection uncertainty and the role of sensitivity analyses in molecular phylogenetics.  相似文献   

8.
Wu B  Müller JD 《Biophysical journal》2005,89(4):2721-2735
We introduce a new analysis technique for fluorescence fluctuation data. Time-integrated fluorescence cumulant analysis (TIFCA) extracts information from the cumulants of the integrated fluorescence intensity. TIFCA builds on our earlier FCA theory, but in contrast to FCA or photon counting histogram (PCH) analysis is valid for arbitrary sampling times. The motivation for long sampling times lies in the improvement of the signal/noise ratio of the data. Because FCA and PCH theory are not valid in this regime, we first derive a theoretical model of cumulant functions for arbitrary sampling times. TIFCA is the first exact theory that describes the effects of sampling time on fluorescence fluctuation experiments. We calculate factorial cumulants of the photon counts for various sampling times by rebinning of the original data. Fits of the data to models determine the brightness, the occupation number, and the diffusion time of each species. To provide the tools for a rigorous error analysis of TIFCA, expressions for the variance of cumulants are developed and tested. We demonstrate that over a limited range rebinning reduces the relative error of higher order cumulants, and therefore improves the signal/noise ratio. The first four cumulant functions are explicitly calculated and are applied to simple dye systems to test the validity of TIFCA and demonstrate its ability to resolve species.  相似文献   

9.
To extract full information from samples of DNA sequence data, it is necessary to use sophisticated model-based techniques such as importance sampling under the coalescent. However, these are limited in the size of datasets they can handle efficiently. Chen and Liu (2000) introduced the idea of stopping-time resampling and showed that it can dramatically improve the efficiency of importance sampling methods under a finite-alleles coalescent model. In this paper, a new framework is developed for designing stopping-time resampling schemes under more general models. It is implemented on data both from infinite sites and stepwise models of mutation, and extended to incorporate crossover recombination. A simulation study shows that this new framework offers a substantial improvement in the accuracy of likelihood estimation over a range of parameters, while a direct application of the scheme of Chen and Liu (2000) can actually diminish the estimate. The method imposes no additional computational burden and is robust to the choice of parameters.  相似文献   

10.
In this paper, we develop a physiological oscillator model of which the output mimics the shape of the R-R interval Poincaré plot. To validate the model, simulations of various nervous conditions are compared with heart rate variability (HRV) data obtained from subjects under each prescribed condition. For a variety of sympathovagal balances, our model generates Poincaré plots that undergo alterations strongly resembling those of actual R-R intervals. By exploiting the oscillator basis of our model, we detail the way that low- and high-frequency modulation of the sinus node translates into R-R interval Poincaré plot shape by way of simulations and analytic results. With the use of our model, we establish that the length and width of a Poincaré plot are a weighted combination of low- and high-frequency power. This provides a theoretical link between frequency-domain spectral analysis techniques and time-domain Poincaré plot analysis. We ascertain the degree to which these principles apply to real R-R intervals by testing the mathematical relationships on a set of data and establish that the principles are clearly evident in actual HRV records.  相似文献   

11.
A geostatistical perspective on spatial genetic structure may explain methodological issues of quantifying spatial genetic structure and suggest new approaches to addressing them. We use a variogram approach to (i) derive a spatial partitioning of molecular variance, gene diversity, and genotypic diversity for microsatellite data under the infinite allele model (IAM) and the stepwise mutation model (SMM), (ii) develop a weighting of sampling units to reflect ploidy levels or multiple sampling of genets, and (iii) show how variograms summarize the spatial genetic structure within a population under isolation-by-distance. The methods are illustrated with data from a population of the epiphytic lichen Lobaria pulmonaria, using six microsatellite markers. Variogram-based analysis not only avoids bias due to the underestimation of population variance in the presence of spatial autocorrelation, but also provides estimates of population genetic diversity and the degree and extent of spatial genetic structure accounting for autocorrelation.  相似文献   

12.
Recent developments in marginal likelihood estimation for model selection in the field of Bayesian phylogenetics and molecular evolution have emphasized the poor performance of the harmonic mean estimator (HME). Although these studies have shown the merits of new approaches applied to standard normally distributed examples and small real-world data sets, not much is currently known concerning the performance and computational issues of these methods when fitting complex evolutionary and population genetic models to empirical real-world data sets. Further, these approaches have not yet seen widespread application in the field due to the lack of implementations of these computationally demanding techniques in commonly used phylogenetic packages. We here investigate the performance of some of these new marginal likelihood estimators, specifically, path sampling (PS) and stepping-stone (SS) sampling for comparing models of demographic change and relaxed molecular clocks, using synthetic data and real-world examples for which unexpected inferences were made using the HME. Given the drastically increased computational demands of PS and SS sampling, we also investigate a posterior simulation-based analogue of Akaike's information criterion (AIC) through Markov chain Monte Carlo (MCMC), a model comparison approach that shares with the HME the appealing feature of having a low computational overhead over the original MCMC analysis. We confirm that the HME systematically overestimates the marginal likelihood and fails to yield reliable model classification and show that the AICM performs better and may be a useful initial evaluation of model choice but that it is also, to a lesser degree, unreliable. We show that PS and SS sampling substantially outperform these estimators and adjust the conclusions made concerning previous analyses for the three real-world data sets that we reanalyzed. The methods used in this article are now available in BEAST, a powerful user-friendly software package to perform Bayesian evolutionary analyses.  相似文献   

13.
14.
In this paper, we establish an upper bound for time to convergence to stationarity for the discrete time infinite alleles Moran model. If M is the population size and μ is the mutation rate, this bound gives a cutoff time of log(M μ)/μ generations. The stationary distribution for this process in the case of sampling without replacement is the Ewens sampling formula. We show that the bound for the total variation distance from the generation t distribution to the Ewens sampling formula is well approximated by one of the extreme value distributions, namely, a standard Gumbel distribution. Beginning with the card shuffling examples of Aldous and Diaconis and extending the ideas of Donnelly and Rodrigues for the two allele model, this model adds to the list of Markov chains that show evidence for the cutoff phenomenon. Because of the broad use of infinite alleles models, this cutoff sets the time scale of applicability for statistical tests based on the Ewens sampling formula and other tests of neutrality in a number of population genetic studies.  相似文献   

15.
This note considers sampling theory for a selectively neutral locus where it is supposed that the data provide nucleotide sequences for the genes sampled. It thus anticipates that technical advances will soon provide data of this form in volume approaching that currently obtained from electrophoresis. The assumption made on the nature of the data will require us to use, in the terminology ofKimura (Theor. Pop. Biol.2, 174–208 (1971)), the “infinite sites” model of Karlin and McGregor (Proc. Fifth Berkeley Symp. Math. Statist. Prob.4, 415–438 (1967)) rather that the “infinite alleles” model of Kimura and Crow (Genetics49, 174–738 (1964)). We emphasize that these two models refer not to two different real-world circumstances, but rather to two different assumptions concerning our capacity to investigate the real world. We compare our results where appropriate with corresponding sampling theory of Ewens (Theor. Pop. Biol.3, 87–112 (1972)) for the “infinite alleles” model. Note finally that some of our results depend on an assumption of independence of behavior at individual sites; a parallel paper byWatterson (submitted for publication (1974)) assumes no recombination between sites. Real-world behavior will lie between these two assumptions, closer to the situation assumed by Watterson than in this note. Our analysis provides upper bounds for increased efficiency in using complete nucleotide sequences.  相似文献   

16.
The problem of ascertainment in segregation analysis arises when families are selected for study through ascertainment of affected individuals. In this case, ascertainment must be corrected for in data analysis. However, methods for ascertainment correction are not available for many common sampling schemes, e.g., sequential sampling of extended pedigrees (except in the case of "single" selection). Concerns about whether ascertainment correction is even required for large pedigrees, about whether and how multiple probands in the same pedigree can be taken into account properly, and about how to apply sequential sampling strategies have occupied many investigators in recent years. We address these concerns by reconsidering a central issue, namely, how to handle pedigree structure (including size). We introduce a new distinction, between sampling in such a way that observed pedigree structure does not depend on which pedigree members are probands (proband-independent [PI] sampling) and sampling in such a way that observed pedigree structure does depend on who are the probands (proband-dependent [PD] sampling). This distinction corresponds roughly (but not exactly) to the distinction between fixed-structure and sequential sampling. We show that conditioning on observed pedigree structure in ascertained data sets obtained under PD sampling is not in general correct (with the exception of "single" selection), while PI sampling of pedigree structures larger than simple sibships is generally not possible. Yet, in practice one has little choice but to condition on observed pedigree structure. We conclude that the problem of genetic modeling in ascertained data sets is, in most situations, literally intractable. We recommend that future efforts focus on the development of robust approximate approaches to the problem.  相似文献   

17.
Ecosystem research benefits enormously from the fact that comprehensive data sets of high quality, and covering long time periods are now increasingly more available. However, facing apparently complex interdependencies between numerous ecosystem components, there is urgent need rethinking our approaches in ecosystem research and applying new tools of data analysis.The concept presented in this paper is based on two pillars. Firstly, it postulates that ecosystems are multiple feedback systems and thus are highly constrained. Consequently, the effective dimensionality of multivariate ecosystem data sets is expected to be rather low compared to the number of observables. Secondly, it assumes that ecosystems are characterized by continuity in time and space as well as between entities which are often treated as distinct units.Implementing this concept in ecosystem research requires new tools for analysing large multivariate data sets. This study presents some of them, which were applied to a comprehensive water quality data set from a long-term monitoring program in Northeast Germany in the Uckermark region, one of the LTER-D (Long Term Ecological Research network, Germany) sites.The effective dimensionality was assessed by the Correlation Dimension approach as well as by a Principal Component Analysis and was in fact substantially lower than the number of observables. Continuity in time, space and between different types of water bodies was studied by combining Self-Organizing Maps with Sammon's Mapping. Groundwater, kettle hole and stream water samples exhibited some overlap, confirming continuity between different types of water bodies. Clear long-term shifts were found at the stream sampling sites. There was strong evidence that the intensity of single processes had changed at these sites rather than that new processes developed. Thus the more recent data did not occupy new subregions of the phase space of observations.Short-term variability of the kettle hole water samples differed substantially from that of the stream water samples, suggesting different processes generating the dynamics in these two types of water bodies. However, again, this seemed to be due to differing intensities of single processes rather than to completely different processes.We feel that research aiming at elucidating apparently complex interactions in ecosystems could make much more efficient use from now available large monitoring data sets by implementing the suggested concept and using corresponding innovative tools of system analysis.  相似文献   

18.
We propose a new algorithm for identifying cis-regulatory modules in genomic sequences. The proposed algorithm, named RISO, uses a new data structure, called box-link, to store the information about conserved regions that occur in a well-ordered and regularly spaced manner in the data set sequences. This type of conserved regions, called structured motifs, is extremely relevant in the research of gene regulatory mechanisms since it can effectively represent promoter models. The complexity analysis shows a time and space gain over the best known exact algorithms that is exponential in the spacings between binding sites. A full implementation of the algorithm was developed and made available online. Experimental results show that the algorithm is much faster than existing ones, sometimes by more than four orders of magnitude. The application of the method to biological data sets shows its ability to extract relevant consensi.  相似文献   

19.
20.
Computer fitting of binding data is discussed and it is concluded that the main problem is the choice of starting estimates and internal scaling parameters, not the optimization software. Solving linear overdetermined systems of equations for starting estimates is investigated. A function, Q, is introduced to study model discrimination with binding isotherms and the behaviour of Q as a function of model parameters is calculated for the case of 2 and 3 sites. The power function of the F test is estimated for models with 2 to 5 binding sites and necessary constraints on parameters for correct model discrimination are given. The sampling distribution of F test statistics is compared to an exact F distribution using the Chi-squared and Kolmogorov-Smirnov tests. For low order modes (n less than 3) the F test statistics are approximately F distributed but for higher order models the test statistics are skewed to the left of the F distribution. The parameter covariance matrix obtained by inverting the Hessian matrix of the objective function is shown to be a good approximation to the estimate obtained by Monte Carlo sampling for low order models (n less than 3). It is concluded that analysis of up to 2 or 3 binding sites presents few problems and linear, normal statistical results are valid. To identify correctly 4 sites is much more difficult, requiring very precise data and extreme parameter values. Discrimination of 5 from 4 sites is an upper limit to the usefulness of the F test.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号