首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A central task in the study of molecular evolution is the reconstruction of a phylogenetic tree from sequences of current-day taxa. The most established approach to tree reconstruction is maximum likelihood (ML) analysis. Unfortunately, searching for the maximum likelihood phylogenetic tree is computationally prohibitive for large data sets. In this paper, we describe a new algorithm that uses Structural Expectation Maximization (EM) for learning maximum likelihood phylogenetic trees. This algorithm is similar to the standard EM method for edge-length estimation, except that during iterations of the Structural EM algorithm the topology is improved as well as the edge length. Our algorithm performs iterations of two steps. In the E-step, we use the current tree topology and edge lengths to compute expected sufficient statistics, which summarize the data. In the M-Step, we search for a topology that maximizes the likelihood with respect to these expected sufficient statistics. We show that searching for better topologies inside the M-step can be done efficiently, as opposed to standard methods for topology search. We prove that each iteration of this procedure increases the likelihood of the topology, and thus the procedure must converge. This convergence point, however, can be a suboptimal one. To escape from such "local optima," we further enhance our basic EM procedure by incorporating moves in the flavor of simulated annealing. We evaluate these new algorithms on both synthetic and real sequence data and show that for protein sequences even our basic algorithm finds more plausible trees than existing methods for searching maximum likelihood phylogenies. Furthermore, our algorithms are dramatically faster than such methods, enabling, for the first time, phylogenetic analysis of large protein data sets in the maximum likelihood framework.  相似文献   

2.
In large cohort studies, it is common that a subset of the regressors may be missing for some study subjects by design or happenstance. In this article, we apply the multiple data augmentation techniques to semiparametric models for epidemiologic data when a subset of the regressors are missing for some subjects, under the assumption that the data are missing at random in the sense of Rubin (2004) and that the missingness probabilities depend jointly on the observable subset of regressors, on a set of observable extraneous variables and on the outcome. Computational algorithms for the Poor Man's and the Asymptotic Normal data augmentations are investigated. Simulation studies show that the data augmentation approach generates satisfactory estimates and is computationally affordable. Under certain simulation scenarios, the proposed approach can achieve asymptotic efficiency similar to the maximum likelihood approach. We apply the proposed technique to the Multi-Ethic Study of Atherosclerosis (MESA) data and the South Wales Nickel Worker Study data.  相似文献   

3.
Generalized estimating equation (GEE) is widely adopted for regression modeling for longitudinal data, taking account of potential correlations within the same subjects. Although the standard GEE assumes common regression coefficients among all the subjects, such an assumption may not be realistic when there is potential heterogeneity in regression coefficients among subjects. In this paper, we develop a flexible and interpretable approach, called grouped GEE analysis, to modeling longitudinal data with allowing heterogeneity in regression coefficients. The proposed method assumes that the subjects are divided into a finite number of groups and subjects within the same group share the same regression coefficient. We provide a simple algorithm for grouping subjects and estimating the regression coefficients simultaneously, and show the asymptotic properties of the proposed estimator. The number of groups can be determined by the cross validation with averaging method. We demonstrate the proposed method through simulation studies and an application to a real data set.  相似文献   

4.
Liu M  Lu W  Shao Y 《Biometrics》2006,62(4):1053-1061
Interval mapping using normal mixture models has been an important tool for analyzing quantitative traits in experimental organisms. When the primary phenotype is time-to-event, it is natural to use survival models such as Cox's proportional hazards model instead of normal mixtures to model the phenotype distribution. An extra challenge for modeling time-to-event data is that the underlying population may consist of susceptible and nonsusceptible subjects. In this article, we propose a semiparametric proportional hazards mixture cure model which allows missing covariates. We discuss applications to quantitative trait loci (QTL) mapping when the primary trait is time-to-event from a population of mixed susceptibility. This model can be used to characterize QTL effects on both susceptibility and time-to-event distribution, and to estimate QTL location. The model can naturally incorporate covariate effects of other risk factors. Maximum likelihood estimates for the parameters in the model as well as their corresponding variance estimates can be obtained numerically using an EM-type algorithm. The proposed methods are assessed by simulations under practical settings and illustrated using a real data set containing survival times of mice after infection with Listeria monocytogenes. An extension to multiple intervals is also discussed.  相似文献   

5.
A commonly used tool in disease association studies is the search for discrepancies between the haplotype distribution in the case and control populations. In order to find this discrepancy, the haplotypes frequency in each of the populations is estimated from the genotypes. We present a new method HAPLOFREQ to estimate haplotype frequencies over a short genomic region given the genotypes or haplotypes with missing data or sequencing errors. Our approach incorporates a maximum likelihood model based on a simple random generative model which assumes that the genotypes are independently sampled from the population. We first show that if the phased haplotypes are given, possibly with missing data, we can estimate the frequency of the haplotypes in the population by finding the global optimum of the likelihood function in polynomial time. If the haplotypes are not phased, finding the maximum value of the likelihood function is NP-hard. In this case, we define an alternative likelihood function which can be thought of as a relaxed likelihood function. We show that the maximum relaxed likelihood can be found in polynomial time and that the optimal solution of the relaxed likelihood approaches asymptotically to the haplotype frequencies in the population. In contrast to previous approaches, our algorithms are guaranteed to converge in polynomial time to a global maximum of the different likelihood functions. We compared the performance of our algorithm to the widely used program PHASE, and we found that our estimates are at least 10% more accurate than PHASE and about ten times faster than PHASE. Our techniques involve new algorithms in convex optimization. These algorithms may be of independent interest. Particularly, they may be helpful in other maximum likelihood problems arising from survey sampling.  相似文献   

6.
Liu W  Wu L 《Biometrics》2007,63(2):342-350
Semiparametric nonlinear mixed-effects (NLME) models are flexible for modeling complex longitudinal data. Covariates are usually introduced in the models to partially explain interindividual variations. Some covariates, however, may be measured with substantial errors. Moreover, the responses may be missing and the missingness may be nonignorable. We propose two approximate likelihood methods for semiparametric NLME models with covariate measurement errors and nonignorable missing responses. The methods are illustrated in a real data example. Simulation results show that both methods perform well and are much better than the commonly used naive method.  相似文献   

7.
It has been recognized that genetic mutations in specific nucleotides may give rise to cancer via the alteration of signaling pathways. Thus, the detection of those cancer-causing mutations has received considerable interest in cancer genetic research. Here, we propose a statistical model for characterizing genes that lead to cancer through point mutations using genome-wide single nucleotide polymorphism (SNP) data. The basic idea of the model is that mutated genes may be in high association with their nearby SNPs because of evolutionary forces. By genotyping SNPs in both normal and cancer cells, we formulate a polynomial likelihood to estimate the population genetic parameters related to cancer, such as allele frequencies of cancer-causing alleles, mutation rates of alleles derived from maternal or paternal parents, and zygotic linkage disequilibria between different loci after the mutation occurs. We implement the EM algorithm to estimate some of these parameters because of the missing information in the likelihood construction. The model allows the elegant tests of the significant associations between mutated cancer genes and genome-wide SNPs, thus providing a way for predicting the occurrence and formation of cancer with genetic information. The model, validated through computer simulation, may help cancer geneticists design efficient experiments and formulate hypotheses for cancer gene identification.  相似文献   

8.
Understanding the transmission dynamics of infectious diseases is important for both biological research and public health applications. It has been widely demonstrated that statistical modeling provides a firm basis for inferring relevant epidemiological quantities from incidence and molecular data. However, the complexity of transmission dynamic models presents two challenges: (1) the likelihood function of the models is generally not computable, and computationally intensive simulation-based inference methods need to be employed, and (2) the model may not be fully identifiable from the available data. While the first difficulty can be tackled by computational and algorithmic advances, the second obstacle is more fundamental. Identifiability issues may lead to inferences that are driven more by prior assumptions than by the data themselves. We consider a popular and relatively simple yet analytically intractable model for the spread of tuberculosis based on classical IS6110 fingerprinting data. We report on the identifiability of the model, also presenting some methodological advances regarding the inference. Using likelihood approximations, we show that the reproductive value cannot be identified from the data available and that the posterior distributions obtained in previous work have likely been substantially dominated by the assumed prior distribution. Further, we show that the inferences are influenced by the assumed infectious population size, which generally has been kept fixed in previous work. We demonstrate that the infectious population size can be inferred if the remaining epidemiological parameters are already known with sufficient precision.  相似文献   

9.
Risk assessment for quantitative responses using a mixture model   总被引:5,自引:0,他引:5  
Razzaghi M  Kodell RL 《Biometrics》2000,56(2):519-527
A problem that frequently occurs in biological experiments with laboratory animals is that some subjects are less susceptible to the treatment than others. A mixture model has traditionally been proposed to describe the distribution of responses in treatment groups for such experiments. Using a mixture dose-response model, we derive an upper confidence limit on additional risk, defined as the excess risk over the background risk due to an added dose. Our focus will be on experiments with continuous responses for which risk is the probability of an adverse effect defined as an event that is extremely rare in controls. The asymptotic distribution of the likelihood ratio statistic is used to obtain the upper confidence limit on additional risk. The method can also be used to derive a benchmark dose corresponding to a specified level of increased risk. The EM algorithm is utilized to find the maximum likelihood estimates of model parameters and an extension of the algorithm is proposed to derive the estimates when the model is subject to a specified level of added risk. An example is used to demonstrate the results, and it is shown that by using the mixture model a more accurate measure of added risk is obtained.  相似文献   

10.
Whether the balance between integration and segregation of information in the brain is damaged in Mild Cognitive Impairment (MCI) subjects is still a matter of debate. Here we characterize the functional network architecture of MCI subjects by means of complex networks analysis. Magnetoencephalograms (MEG) time series obtained during a memory task were evaluated by synchronization likelihood (SL), to quantify the statistical dependence between MEG signals and to obtain the functional networks. Graphs from MCI subjects show an enhancement of the strength of connections, together with an increase in the outreach parameter, suggesting that memory processing in MCI subjects is associated with higher energy expenditure and a tendency toward random structure, which breaks the balance between integration and segregation. All features are reproduced by an evolutionary network model that simulates the degenerative process of a healthy functional network to that associated with MCI. Due to the high rate of conversion from MCI to Alzheimer Disease (AD), these results show that the analysis of functional networks could be an appropriate tool for the early detection of both MCI and AD.  相似文献   

11.
Estimation in a Cox proportional hazards cure model   总被引:7,自引:0,他引:7  
Sy JP  Taylor JM 《Biometrics》2000,56(1):227-236
Some failure time data come from a population that consists of some subjects who are susceptible to and others who are nonsusceptible to the event of interest. The data typically have heavy censoring at the end of the follow-up period, and a standard survival analysis would not always be appropriate. In such situations where there is good scientific or empirical evidence of a nonsusceptible population, the mixture or cure model can be used (Farewell, 1982, Biometrics 38, 1041-1046). It assumes a binary distribution to model the incidence probability and a parametric failure time distribution to model the latency. Kuk and Chen (1992, Biometrika 79, 531-541) extended the model by using Cox's proportional hazards regression for the latency. We develop maximum likelihood techniques for the joint estimation of the incidence and latency regression parameters in this model using the nonparametric form of the likelihood and an EM algorithm. A zero-tail constraint is used to reduce the near nonidentifiability of the problem. The inverse of the observed information matrix is used to compute the standard errors. A simulation study shows that the methods are competitive to the parametric methods under ideal conditions and are generally better when censoring from loss to follow-up is heavy. The methods are applied to a data set of tonsil cancer patients treated with radiation therapy.  相似文献   

12.
Microarray-CGH (comparative genomic hybridization) experiments are used to detect and map chromosomal imbalances. A CGH profile can be viewed as a succession of segments that represent homogeneous regions in the genome whose representative sequences share the same relative copy number on average. Segmentation methods constitute a natural framework for the analysis, but they do not provide a biological status for the detected segments. We propose a new model for this segmentation/clustering problem, combining a segmentation model with a mixture model. We present a new hybrid algorithm called dynamic programming-expectation maximization (DP-EM) to estimate the parameters of the model by maximum likelihood. This algorithm combines DP and the EM algorithm. We also propose a model selection heuristic to select the number of clusters and the number of segments. An example of our procedure is presented, based on publicly available data sets. We compare our method to segmentation methods and to hidden Markov models, and we show that the new segmentation/clustering model is a promising alternative that can be applied in the more general context of signal processing.  相似文献   

13.
Since the seminal work of Prentice and Pyke, the prospective logistic likelihood has become the standard method of analysis for retrospectively collected case‐control data, in particular for testing the association between a single genetic marker and a disease outcome in genetic case‐control studies. In the study of multiple genetic markers with relatively small effects, especially those with rare variants, various aggregated approaches based on the same prospective likelihood have been developed to integrate subtle association evidence among all the markers considered. Many of the commonly used tests are derived from the prospective likelihood under a common‐random‐effect assumption, which assumes a common random effect for all subjects. We develop the locally most powerful aggregation test based on the retrospective likelihood under an independent‐random‐effect assumption, which allows the genetic effect to vary among subjects. In contrast to the fact that disease prevalence information cannot be used to improve efficiency for the estimation of odds ratio parameters in logistic regression models, we show that it can be utilized to enhance the testing power in genetic association studies. Extensive simulations demonstrate the advantages of the proposed method over the existing ones. A real genome‐wide association study is analyzed for illustration.  相似文献   

14.
Health researchers are often interested in assessing the direct effect of a treatment or exposure on an outcome variable, as well as its indirect (or mediation) effect through an intermediate variable (or mediator). For an outcome following a nonlinear model, the mediation formula may be used to estimate causally interpretable mediation effects. This method, like others, assumes that the mediator is observed. However, as is common in structural equations modeling, we may wish to consider a latent (unobserved) mediator. We follow a potential outcomes framework and assume a generalized structural equations model (GSEM). We provide maximum‐likelihood estimation of GSEM parameters using an approximate Monte Carlo EM algorithm, coupled with a mediation formula approach to estimate natural direct and indirect effects. The method relies on an untestable sequential ignorability assumption; we assess robustness to this assumption by adapting a recently proposed method for sensitivity analysis. Simulation studies show good properties of the proposed estimators in plausible scenarios. Our method is applied to a study of the effect of mother education on occurrence of adolescent dental caries, in which we examine possible mediation through latent oral health behavior.  相似文献   

15.
Approximate likelihood ratios for general estimating functions   总被引:1,自引:0,他引:1  
The method of estimating functions (Godambe, 1991) is commonlyused when one desires to conduct inference about some parametersof interest but the full distribution of the observations isunknown. However, this approach may have limited utility, dueto multiple roots for the estimating function, a poorly behavedWald test, or lack of a goodness-of-fit test. This paper presentsapproximate likelihood ratios that can be used along with estimatingfunctions when any of these three problems occurs. We show thatthe approximate likelihood ratio provides correct large sampleinference under very general circumstances, including clustereddata and misspecified weights in the estimating function. Twomethods of constructing the approximate likelihood ratio, onebased on the quasi-likelihood approach and the other based onthe linear projection approach, are compared and shown to beclosely related. In particular we show that quasi-likelihoodis the limit of the projection approach. We illustrate the techniquewith two applications.  相似文献   

16.
In this paper, we consider the problem of estimating the size N of a finite and closed population, using data obtained from capture-recapture experiments. By defining an appropriate model, we investigate the maximum of the likelihood, of the profile likelihood and of an orthogonal adjusted profile likelihood (COX and REID, 1987) function. We show that they all may present infinity as the maximum likelihood estimator of N. This seems to be a characteristic of the likelihood approach in this problem. Further, we present a Bayesian approach with minimum prior information as a way of countering this difficulty. Exact analytical expressions for the posterior modes are also obtained.  相似文献   

17.
18.
On estimation and prediction for spatial generalized linear mixed models   总被引:4,自引:0,他引:4  
Zhang H 《Biometrics》2002,58(1):129-136
We use spatial generalized linear mixed models (GLMM) to model non-Gaussian spatial variables that are observed at sampling locations in a continuous area. In many applications, prediction of random effects in a spatial GLMM is of great practical interest. We show that the minimum mean-squared error (MMSE) prediction can be done in a linear fashion in spatial GLMMs analogous to linear kriging. We develop a Monte Carlo version of the EM gradient algorithm for maximum likelihood estimation of model parameters. A by-product of this approach is that it also produces the MMSE estimates for the realized random effects at the sampled sites. This method is illustrated through a simulation study and is also applied to a real data set on plant root diseases to obtain a map of disease severity that can facilitate the practice of precision agriculture.  相似文献   

19.
Using a four-taxon example under a simple model of evolution, we show that the methods of maximum likelihood and maximum posterior probability (which is a Bayesian method of inference) may not arrive at the same optimal tree topology. Some patterns that are separately uninformative under the maximum likelihood method are separately informative under the Bayesian method. We also show that this difference has impact on the bootstrap frequencies and the posterior probabilities of topologies, which therefore are not necessarily approximately equal. Efron et al. (Proc. Natl. Acad. Sci. USA 93:13429-13434, 1996) stated that bootstrap frequencies can, under certain circumstances, be interpreted as posterior probabilities. This is true only if one includes a non-informative prior distribution of the possible data patterns, and most often the prior distributions are instead specified in terms of topology and branch lengths. [Bayesian inference; maximum likelihood method; Phylogeny; support.].  相似文献   

20.
Schafer DW 《Biometrics》2001,57(1):53-61
This paper presents an EM algorithm for semiparametric likelihood analysis of linear, generalized linear, and nonlinear regression models with measurement errors in explanatory variables. A structural model is used in which probability distributions are specified for (a) the response and (b) the measurement error. A distribution is also assumed for the true explanatory variable but is left unspecified and is estimated by nonparametric maximum likelihood. For various types of extra information about the measurement error distribution, the proposed algorithm makes use of available routines that would be appropriate for likelihood analysis of (a) and (b) if the true x were available. Simulations suggest that the semiparametric maximum likelihood estimator retains a high degree of efficiency relative to the structural maximum likelihood estimator based on correct distributional assumptions and can outperform maximum likelihood based on an incorrect distributional assumption. The approach is illustrated on three examples with a variety of structures and types of extra information about the measurement error distribution.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号