首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We propose a statistical model for estimating gene expression using data from multiple laser scans at different settings of hybridized microarrays. A functional regression model is used, based on a non-linear relationship with both additive and multiplicative error terms. The function is derived as the expected value of a pixel, given that values are censored at 65 535, the maximum detectable intensity for double precision scanning software. Maximum likelihood estimation based on a Cauchy distribution is used to fit the model, which is able to estimate gene expressions taking account of outliers and the systematic bias caused by signal censoring of highly expressed genes. We have applied the method to experimental data. Simulation studies suggest that the model can estimate the true gene expression with negligible bias. AVAILABILITY: FORTRAN 90 code for implementing the method can be obtained from the authors.  相似文献   

2.
Hokeun Sun  Hongzhe Li 《Biometrics》2012,68(4):1197-1206
Summary Gaussian graphical models have been widely used as an effective method for studying the conditional independency structure among genes and for constructing genetic networks. However, gene expression data typically have heavier tails or more outlying observations than the standard Gaussian distribution. Such outliers in gene expression data can lead to wrong inference on the dependency structure among the genes. We propose a l1 penalized estimation procedure for the sparse Gaussian graphical models that is robustified against possible outliers. The likelihood function is weighted according to how the observation is deviated, where the deviation of the observation is measured based on its own likelihood. An efficient computational algorithm based on the coordinate gradient descent method is developed to obtain the minimizer of the negative penalized robustified‐likelihood, where nonzero elements of the concentration matrix represents the graphical links among the genes. After the graphical structure is obtained, we re‐estimate the positive definite concentration matrix using an iterative proportional fitting algorithm. Through simulations, we demonstrate that the proposed robust method performs much better than the graphical Lasso for the Gaussian graphical models in terms of both graph structure selection and estimation when outliers are present. We apply the robust estimation procedure to an analysis of yeast gene expression data and show that the resulting graph has better biological interpretation than that obtained from the graphical Lasso.  相似文献   

3.
Han F  Pan W 《Biometrics》2012,68(1):307-315
Many statistical tests have been proposed for case-control data to detect disease association with multiple single nucleotide polymorphisms (SNPs) in linkage disequilibrium. The main reason for the existence of so many tests is that each test aims to detect one or two aspects of many possible distributional differences between cases and controls, largely due to the lack of a general and yet simple model for discrete genotype data. Here we propose a latent variable model to represent SNP data: the observed SNP data are assumed to be obtained by discretizing a latent multivariate Gaussian variate. Because the latent variate is multivariate Gaussian, its distribution is completely characterized by its mean vector and covariance matrix, in contrast to much more complex forms of a general distribution for discrete multivariate SNP data. We propose a composite likelihood approach for parameter estimation. A direct application of this latent variable model is to association testing with multiple SNPs in a candidate gene or region. In contrast to many existing tests that aim to detect only one or two aspects of many possible distributional differences of discrete SNP data, we can exclusively focus on testing the mean and covariance parameters of the latent Gaussian distributions for cases and controls. Our simulation results demonstrate potential power gains of the proposed approach over some existing methods.  相似文献   

4.
An estimation method for the semiparametric mixed effects model   总被引:6,自引:0,他引:6  
Tao H  Palta M  Yandell BS  Newton MA 《Biometrics》1999,55(1):102-110
A semiparametric mixed effects regression model is proposed for the analysis of clustered or longitudinal data with continuous, ordinal, or binary outcome. The common assumption of Gaussian random effects is relaxed by using a predictive recursion method (Newton and Zhang, 1999) to provide a nonparametric smooth density estimate. A new strategy is introduced to accelerate the algorithm. Parameter estimates are obtained by maximizing the marginal profile likelihood by Powell's conjugate direction search method. Monte Carlo results are presented to show that the method can improve the mean squared error of the fixed effects estimators when the random effects distribution is not Gaussian. The usefulness of visualizing the random effects density itself is illustrated in the analysis of data from the Wisconsin Sleep Survey. The proposed estimation procedure is computationally feasible for quite large data sets.  相似文献   

5.
In longitudinal studies, measurements of the same individuals are taken repeatedly through time. Often, the primary goal is to characterize the change in response over time and the factors that influence change. Factors can affect not only the location but also more generally the shape of the distribution of the response over time. To make inference about the shape of a population distribution, the widely popular mixed-effects regression, for example, would be inadequate, if the distribution is not approximately Gaussian. We propose a novel linear model for quantile regression (QR) that includes random effects in order to account for the dependence between serial observations on the same subject. The notion of QR is synonymous with robust analysis of the conditional distribution of the response variable. We present a likelihood-based approach to the estimation of the regression quantiles that uses the asymmetric Laplace density. In a simulation study, the proposed method had an advantage in terms of mean squared error of the QR estimator, when compared with the approach that considers penalized fixed effects. Following our strategy, a nearly optimal degree of shrinkage of the individual effects is automatically selected by the data and their likelihood. Also, our model appears to be a robust alternative to the mean regression with random effects when the location parameter of the conditional distribution of the response is of interest. We apply our model to a real data set which consists of self-reported amount of labor pain measurements taken on women repeatedly over time, whose distribution is characterized by skewness, and the significance of the parameters is evaluated by the likelihood ratio statistic.  相似文献   

6.
7.
We propose a statistical method for uncovering gene pathways that characterize cancer heterogeneity. To incorporate knowledge of the pathways into the model, we define a set of activities of pathways from microarray gene expression data based on the Sparse Probabilistic Principal Component Analysis (SPPCA). A pathway activity logistic regression model is then formulated for cancer phenotype. To select pathway activities related to binary cancer phenotypes, we use the elastic net for the parameter estimation and derive a model selection criterion for selecting tuning parameters included in the model estimation. Our proposed method can also reverse-engineer gene networks based on the identified multiple pathways that enables us to discover novel gene-gene associations relating with the cancer phenotypes. We illustrate the whole process of the proposed method through the analysis of breast cancer gene expression data.  相似文献   

8.
  1. When we collect the growth curves of many individuals, orderly variation in the curves is often observed rather than a completely random mixture of various curves. Small individuals may exhibit similar growth curves, but the curves differ from those of large individuals, whereby the curves gradually vary from small to large individuals. It has been recognized that after standardization with the asymptotes, if all the growth curves are the same (anamorphic growth curve set), the growth curve sets can be estimated using nonchronological data; otherwise, that is, if the growth curves are not identical after standardization with the asymptotes (polymorphic growth curve set), this estimation is not feasible. However, because a given set of growth curves determines the variation in the observed data, it may be possible to estimate polymorphic growth curve sets using nonchronological data.
  2. In this study, we developed an estimation method by deriving the likelihood function for polymorphic growth curve sets. The method involves simple maximum likelihood estimation. The weighted nonlinear regression and least‐squares method after the log‐transform of the anamorphic growth curve sets were included as special cases.
  3. The growth curve sets of the height of cypress (Chamaecyparis obtusa) and larch (Larix kaempferi) trees were estimated. With the model selection process using the AIC and likelihood ratio test, the growth curve set for cypress was found to be polymorphic, whereas that for larch was found to be anamorphic. Improved fitting using the polymorphic model for cypress is due to resolving underdispersion (less dispersion in real data than model prediction).
  4. The likelihood function for model estimation depends not only on the distribution type of asymptotes, but the definition of the growth curve set as well. Consideration of these factors may be necessary, even if environmental explanatory variables and random effects are introduced.
  相似文献   

9.
Longitudinal data usually consist of a number of short time series. A group of subjects or groups of subjects are followed over time and observations are often taken at unequally spaced time points, and may be at different times for different subjects. When the errors and random effects are Gaussian, the likelihood of these unbalanced linear mixed models can be directly calculated, and nonlinear optimization used to obtain maximum likelihood estimates of the fixed regression coefficients and parameters in the variance components. For binary longitudinal data, a two state, non-homogeneous continuous time Markov process approach is used to model serial correlation within subjects. Formulating the model as a continuous time Markov process allows the observations to be equally or unequally spaced. Fixed and time varying covariates can be included in the model, and the continuous time model allows the estimation of the odds ratio for an exposure variable based on the steady state distribution. Exact likelihoods can be calculated. The initial probability distribution on the first observation on each subject is estimated using logistic regression that can involve covariates, and this estimation is embedded in the overall estimation. These models are applied to an intervention study designed to reduce children's sun exposure.  相似文献   

10.
Meta-analysis is a statistical methodology for combining information from diverse sources so that a more reliable and efficient conclusion can be reached. It can be conducted by either synthesizing study-level summary statistics or drawing inference from an overarching model for individual participant data (IPD) if available. The latter is often viewed as the “gold standard.” For random-effects models, however, it remains not fully understood whether the use of IPD indeed gains efficiency over summary statistics. In this paper, we examine the relative efficiency of the two methods under a general likelihood inference setting. We show theoretically and numerically that summary-statistics-based analysis is at most as efficient as IPD analysis, provided that the random effects follow the Gaussian distribution, and maximum likelihood estimation is used to obtain summary statistics. More specifically, (i) the two methods are equivalent in an asymptotic sense; and (ii) summary-statistics-based inference can incur an appreciable loss of efficiency if the sample sizes are not sufficiently large. Our results are established under the assumption that the between-study heterogeneity parameter remains constant regardless of the sample sizes, which is different from a previous study. Our findings are confirmed by the analyses of simulated data sets and a real-world study of alcohol interventions.  相似文献   

11.
12.
Summary .   Motivated by the spatial modeling of aberrant crypt foci (ACF) in colon carcinogenesis, we consider binary data with probabilities modeled as the sum of a nonparametric mean plus a latent Gaussian spatial process that accounts for short-range dependencies. The mean is modeled in a general way using regression splines. The mean function can be viewed as a fixed effect and is estimated with a penalty for regularization. With the latent process viewed as another random effect, the model becomes a generalized linear mixed model. In our motivating data set and other applications, the sample size is too large to easily accommodate maximum likelihood or restricted maximum likelihood estimation (REML), so pairwise likelihood, a special case of composite likelihood, is used instead. We develop an asymptotic theory for models that are sufficiently general to be used in a wide variety of applications, including, but not limited to, the problem that motivated this work. The splines have penalty parameters that must converge to zero asymptotically: we derive theory for this along with a data-driven method for selecting the penalty parameter, a method that is shown in simulations to improve greatly upon standard devices, such as likelihood crossvalidation. Finally, we apply the methods to the data from our experiment ACF. We discover an unexpected location for peak formation of ACF.  相似文献   

13.
Xiang L  Yau KK  Van Hui Y  Lee AH 《Biometrics》2008,64(2):508-518
Summary .   The k-component Poisson regression mixture with random effects is an effective model in describing the heterogeneity for clustered count data arising from several latent subpopulations. However, the residual maximum likelihood estimation (REML) of regression coefficients and variance component parameters tend to be unstable and may result in misleading inferences in the presence of outliers or extreme contamination. In the literature, the minimum Hellinger distance (MHD) estimation has been investigated to obtain robust estimation for finite Poisson mixtures. This article aims to develop a robust MHD estimation approach for k-component Poisson mixtures with normally distributed random effects. By applying the Gaussian quadrature technique to approximate the integrals involved in the marginal distribution, the marginal probability function of the k-component Poisson mixture with random effects can be approximated by the summation of a set of finite Poisson mixtures. Simulation study shows that the MHD estimates perform satisfactorily for data without outlying observation(s), and outperform the REML estimates when data are contaminated. Application to a data set of recurrent urinary tract infections (UTI) with random institution effects demonstrates the practical use of the robust MHD estimation method.  相似文献   

14.
In many observational studies, individuals are measured repeatedly over time, although not necessarily at a set of pre-specified occasions. Instead, individuals may be measured at irregular intervals, with those having a history of poorer health outcomes being measured with somewhat greater frequency and regularity. In this paper, we consider likelihood-based estimation of the regression parameters in marginal models for longitudinal binary data when the follow-up times are not fixed by design, but can depend on previous outcomes. In particular, we consider assumptions regarding the follow-up time process that result in the likelihood function separating into two components: one for the follow-up time process, the other for the outcome measurement process. The practical implication of this separation is that the follow-up time process can be ignored when making likelihood-based inferences about the marginal regression model parameters. That is, maximum likelihood (ML) estimation of the regression parameters relating the probability of success at a given time to covariates does not require that a model for the distribution of follow-up times be specified. However, to obtain consistent parameter estimates, the multinomial distribution for the vector of repeated binary outcomes must be correctly specified. In general, ML estimation requires specification of all higher-order moments and the likelihood for a marginal model can be intractable except in cases where the number of repeated measurements is relatively small. To circumvent these difficulties, we propose a pseudolikelihood for estimation of the marginal model parameters. The pseudolikelihood uses a linear approximation for the conditional distribution of the response at any occasion, given the history of previous responses. The appeal of this approximation is that the conditional distributions are functions of the first two moments of the binary responses only. When the follow-up times depend only on the previous outcome, the pseudolikelihood requires correct specification of the conditional distribution of the current outcome given the outcome at the previous occasion only. Results from a simulation study and a study of asymptotic bias are presented. Finally, we illustrate the main results using data from a longitudinal observational study that explored the cardiotoxic effects of doxorubicin chemotherapy for the treatment of acute lymphoblastic leukemia in children.  相似文献   

15.
In many observational studies, individuals are measured repeatedly over time, although not necessarily at a set of prespecified occasions. Instead, individuals may be measured at irregular intervals, with those having a history of poorer health outcomes being measured with somewhat greater frequency and regularity; i.e., those individuals with poorer health outcomes may have more frequent follow-up measurements and the intervals between their repeated measurements may be shorter. In this article, we consider estimation of regression parameters in models for longitudinal data where the follow-up times are not fixed by design but can depend on previous outcomes. In particular, we focus on general linear models for longitudinal data where the repeated measures are assumed to have a multivariate Gaussian distribution. We consider assumptions regarding the follow-up time process that result in the likelihood function separating into two components: one for the follow-up time process, the other for the outcome process. The practical implication of this separation is that the former process can be ignored when making likelihood-based inferences about the latter; i.e., maximum likelihood (ML) estimation of the regression parameters relating the mean of the longitudinal outcomes to covariates does not require that a model for the distribution of follow-up times be specified. As a result, standard statistical software, e.g., SAS PROC MIXED (Littell et al., 1996, SAS System for Mixed Models), can be used to analyze the data. However, we also demonstrate that misspecification of the model for the covariance among the repeated measures will, in general, result in regression parameter estimates that are biased. Furthermore, results of a simulation study indicate that the potential bias due to misspecification of the covariance can be quite considerable in this setting. Finally, we illustrate these results using data from a longitudinal observational study (Lipshultz et al., 1995, New England Journal of Medicine 332, 1738-1743) that explored the cardiotoxic effects of doxorubicin chemotherapy for the treatment of acute lymphoblastic leukemia in children.  相似文献   

16.
Pepe MS  Cai T  Longton G 《Biometrics》2006,62(1):221-229
No single biomarker for cancer is considered adequately sensitive and specific for cancer screening. It is expected that the results of multiple markers will need to be combined in order to yield adequately accurate classification. Typically, the objective function that is optimized for combining markers is the likelihood function. In this article, we consider an alternative objective function-the area under the empirical receiver operating characteristic curve (AUC). We note that it yields consistent estimates of parameters in a generalized linear model for the risk score but does not require specifying the link function. Like logistic regression, it yields consistent estimation with case-control or cohort data. Simulation studies suggest that AUC-based classification scores have performance comparable with logistic likelihood-based scores when the logistic regression model holds. Analysis of data from a proteomics biomarker study shows that performance can be far superior to logistic regression derived scores when the logistic regression model does not hold. Model fitting by maximizing the AUC rather than the likelihood should be considered when the goal is to derive a marker combination score for classification or prediction.  相似文献   

17.
Joint regression analysis of correlated data using Gaussian copulas   总被引:2,自引:0,他引:2  
Song PX  Li M  Yuan Y 《Biometrics》2009,65(1):60-68
Summary .  This article concerns a new joint modeling approach for correlated data analysis. Utilizing Gaussian copulas, we present a unified and flexible machinery to integrate separate one-dimensional generalized linear models (GLMs) into a joint regression analysis of continuous, discrete, and mixed correlated outcomes. This essentially leads to a multivariate analogue of the univariate GLM theory and hence an efficiency gain in the estimation of regression coefficients. The availability of joint probability models enables us to develop a full maximum likelihood inference. Numerical illustrations are focused on regression models for discrete correlated data, including multidimensional logistic regression models and a joint model for mixed normal and binary outcomes. In the simulation studies, the proposed copula-based joint model is compared to the popular generalized estimating equations, which is a moment-based estimating equation method to join univariate GLMs. Two real-world data examples are used in the illustration.  相似文献   

18.
Zero‐truncated data arises in various disciplines where counts are observed but the zero count category cannot be observed during sampling. Maximum likelihood estimation can be used to model these data; however, due to its nonstandard form it cannot be easily implemented using well‐known software packages, and additional programming is often required. Motivated by the Rao–Blackwell theorem, we develop a weighted partial likelihood approach to estimate model parameters for zero‐truncated binomial and Poisson data. The resulting estimating function is equivalent to a weighted score function for standard count data models, and allows for applying readily available software. We evaluate the efficiency for this new approach and show that it performs almost as well as maximum likelihood estimation. The weighted partial likelihood approach is then extended to regression modelling and variable selection. We examine the performance of the proposed methods through simulation and present two case studies using real data.  相似文献   

19.
Gu X 《Genetics》2004,167(1):531-542
Microarray technology has produced massive expression data that are invaluable for investigating the genome-wide evolutionary pattern of gene expression. To this end, phylogenetic expression analysis is highly desirable. On the basis of the Brownian process, we developed a statistical framework (called the E(0) model), assuming the independent expression of evolution between lineages. Several evolutionary mechanisms are integrated to characterize the pattern of expression diversity after gene duplications, including gradual drift and dramatic shift (punctuated equilibrium). When the phylogeny of a gene family is given, we show that the likelihood function follows a multivariate normal distribution; the variance-covariance matrix is determined by the phylogenetic topology and evolutionary parameters. Maximum-likelihood methods for multiple microarray experiments are developed, and likelihood-ratio tests are designed for testing the evolutionary pattern of gene expression. To reconstruct the evolutionary trace of expression diversity after gene (or genome) duplications, we developed a Bayesian-based method and use the posterior mean as predictors. Potential applications in evolutionary genomics are discussed.  相似文献   

20.
Peng Y  Dear KB 《Biometrics》2000,56(1):237-243
Nonparametric methods have attracted less attention than their parametric counterparts for cure rate analysis. In this paper, we study a general nonparametric mixture model. The proportional hazards assumption is employed in modeling the effect of covariates on the failure time of patients who are not cured. The EM algorithm, the marginal likelihood approach, and multiple imputations are employed to estimate parameters of interest in the model. This model extends models and improves estimation methods proposed by other researchers. It also extends Cox's proportional hazards regression model by allowing a proportion of event-free patients and investigating covariate effects on that proportion. The model and its estimation method are investigated by simulations. An application to breast cancer data, including comparisons with previous analyses using a parametric model and an existing nonparametric model by other researchers, confirms the conclusions from the parametric model but not those from the existing nonparametric model.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号