首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Microbiome data are characterized by several aspects that make them challenging to analyse statistically: they are compositional, high dimensional and rich in zeros. A large array of statistical methods exist to analyse these data. Some are borrowed from other fields, such as ecology or RNA-sequencing, while others are custom-made for microbiome data. The large range of available methods, and which is continuously expanding, means that researchers have to invest considerable effort in choosing what method(s) to apply. In this paper we list 14 statistical methods or approaches that we think should be generally avoided. In several cases this is because we believe the assumptions behind the method are unlikely to be met for microbiome data. In other cases we see methods that are used in ways they are not intended to be used. We believe researchers would be helped by more critical evaluations of existing methods, as not all methods in use are suitable or have been sufficiently reviewed. We hope this paper contributes to a critical discussion on what methods are appropriate to use in the analysis of microbiome data.  相似文献   

2.
Sufficient dimension reduction (SDR) that effectively reduces the predictor dimension in regression has been popular in high‐dimensional data analysis. Under the presence of censoring, however, most existing SDR methods suffer. In this article, we propose a new algorithm to perform SDR with censored responses based on the quantile‐slicing scheme recently proposed by Kim et al. First, we estimate the conditional quantile function of the true survival time via the censored kernel quantile regression (Shin et al.) and then slice the data based on the estimated censored regression quantiles instead of the responses. Both simulated and real data analysis demonstrate promising performance of the proposed method.  相似文献   

3.
4.
The analysis of global gene expression data from microarrays is breaking new ground in genetics research, while confronting modelers and statisticians with many critical issues. In this paper, we consider data sets in which a categorical or continuous response is recorded, along with gene expression, on a given number of experimental samples. Data of this type are usually employed to create a prediction mechanism for the response based on gene expression, and to identify a subset of relevant genes. This defines a regression setting characterized by a dramatic under-resolution with respect to the predictors (genes), whose number exceeds by orders of magnitude the number of available observations (samples). We present a dimension reduction strategy that, under appropriate assumptions, allows us to restrict attention to a few linear combinations of the original expression profiles, and thus to overcome under-resolution. These linear combinations can then be used to build and validate a regression model with standard techniques. Moreover, they can be used to rank original predictors, and ultimately to select a subset of them through comparison with a background 'chance scenario' based on a number of independent randomizations. We apply this strategy to publicly available data on leukemia classification.  相似文献   

5.
Conventional canonical correlation analysis (CCA) measures the association between two datasets and identifies relevant contributors. However, it encounters issues with execution and interpretation when the sample size is smaller than the number of variables or there are more than two datasets. Our motivating example is a stroke-related clinical study on pigs. The data are multimodal and consist of measurements taken at multiple time points and have many more variables than observations. This study aims to uncover important biomarkers and stroke recovery patterns based on physiological changes. To address the issues in the data, we develop two sparse CCA methods for multiple datasets. Various simulated examples are used to illustrate and contrast the performance of the proposed methods with that of the existing methods. In analyzing the pig stroke data, we apply the proposed sparse CCA methods along with dimension reduction techniques, interpret the recovery patterns, and identify influential variables in recovery.  相似文献   

6.
Horton NJ  Laird NM 《Biometrics》2001,57(1):34-42
This article presents a new method for maximum likelihood estimation of logistic regression models with incomplete covariate data where auxiliary information is available. This auxiliary information is extraneous to the regression model of interest but predictive of the covariate with missing data. Ibrahim (1990, Journal of the American Statistical Association 85, 765-769) provides a general method for estimating generalized linear regression models with missing covariates using the EM algorithm that is easily implemented when there is no auxiliary data. Vach (1997, Statistics in Medicine 16, 57-72) describes how the method can be extended when the outcome and auxiliary data are conditionally independent given the covariates in the model. The method allows the incorporation of auxiliary data without making the conditional independence assumption. We suggest tests of conditional independence and compare the performance of several estimators in an example concerning mental health service utilization in children. Using an artificial dataset, we compare the performance of several estimators when auxiliary data are available.  相似文献   

7.
8.
Varying‐coefficient models have become a common tool to determine whether and how the association between an exposure and an outcome changes over a continuous measure. These models are complicated when the exposure itself is time‐varying and subjected to measurement error. For example, it is well known that longitudinal physical fitness has an impact on cardiovascular disease (CVD) mortality. It is not known, however, how the effect of longitudinal physical fitness on CVD mortality varies with age. In this paper, we propose a varying‐coefficient generalized odds rate model that allows flexible estimation of age‐modified effects of longitudinal physical fitness on CVD mortality. In our model, the longitudinal physical fitness is measured with error and modeled using a mixed‐effects model, and its associated age‐varying coefficient function is represented by cubic B‐splines. An expectation‐maximization algorithm is developed to estimate the parameters in the joint models of longitudinal physical fitness and CVD mortality. A modified pseudoadaptive Gaussian‐Hermite quadrature method is adopted to compute the integrals with respect to random effects involved in the E‐step. The performance of the proposed method is evaluated through extensive simulation studies and is further illustrated with an application to cohort data from the Aerobic Center Longitudinal Study.  相似文献   

9.
Nematodes play an important role in ecosystem processes, yet the relevance of nematode species diversity to ecology is unknown. Because nematode identification of all individuals at the species level using standard techniques is difficult and time-consuming, nematode communities are not resolved down to the species level, leaving ecological analysis ambiguous. We assessed the suitability of massively parallel sequencing for analysis of nematode diversity from metagenomic samples. We set up four artificial metagenomic samples involving 41 diverse reference nematodes in known abundances. Two samples came from pooling polymerase chain reaction products amplified from single nematode species. Two additional metagenomic samples consisted of amplified products of DNA extracted from pooled nematode species. Amplified products involved two rapidly evolving ~400-bp sections coding for the small and large subunit of rRNA. The total number of reads ranged from 4159 to 14771 per metagenomic sample. Of these, 82% were > 199 bp in length. Among the reads > 199 bp, 86% matched the referenced species with less than three nucleotide differences from a reference sequence. Although neither rDNA section recovered all nematode species, the use of both loci improved the detection level of nematode species from 90 to 97%. Overall, results support the suitability of massively parallel sequencing for identification of nematodes. In contrast, the frequency of reads representing individual species did not correlate with the number of individuals in the metagenomic samples, suggesting that further methodological work is necessary before it will be justified for inferring the relative abundances of species within a nematode community.  相似文献   

10.
The discovery of rare genetic variants through next generation sequencing is a very challenging issue in the field of human genetics. We propose a novel region‐based statistical approach based on a Bayes Factor (BF) to assess evidence of association between a set of rare variants (RVs) located on the same genomic region and a disease outcome in the context of case‐control design. Marginal likelihoods are computed under the null and alternative hypotheses assuming a binomial distribution for the RV count in the region and a beta or mixture of Dirac and beta prior distribution for the probability of RV. We derive the theoretical null distribution of the BF under our prior setting and show that a Bayesian control of the false Discovery Rate can be obtained for genome‐wide inference. Informative priors are introduced using prior evidence of association from a Kolmogorov‐Smirnov test statistic. We use our simulation program, sim1000G, to generate RV data similar to the 1000 genomes sequencing project. Our simulation studies showed that the new BF statistic outperforms standard methods (SKAT, SKAT‐O, Burden test) in case‐control studies with moderate sample sizes and is equivalent to them under large sample size scenarios. Our real data application to a lung cancer case‐control study found enrichment for RVs in known and novel cancer genes. It also suggests that using the BF with informative prior improves the overall gene discovery compared to the BF with noninformative prior.  相似文献   

11.
This paper presents procedures for implementing the EM algorithm to compute REML estimates of variance covariance components in Gaussian mixed models for longitudinal data analysis. The class of models considered includes random coefficient factors, stationary time processes and measurement errors. The EM algorithm allows separation of the computations pertaining to parameters involved in the random coefficient factors from those pertaining to the time processes and errors. The procedures are illustrated with Pothoff and Roy''s data example on growth measurements taken on 11 girls and 16 boys at four ages. Several variants and extensions are discussed.  相似文献   

12.
13.
14.
Clustered interval‐censored data commonly arise in many studies of biomedical research where the failure time of interest is subject to interval‐censoring and subjects are correlated for being in the same cluster. A new semiparametric frailty probit regression model is proposed to study covariate effects on the failure time by accounting for the intracluster dependence. Under the proposed normal frailty probit model, the marginal distribution of the failure time is a semiparametric probit model, the regression parameters can be interpreted as both the conditional covariate effects given frailty and the marginal covariate effects up to a multiplicative constant, and the intracluster association can be summarized by two nonparametric measures in simple and explicit form. A fully Bayesian estimation approach is developed based on the use of monotone splines for the unknown nondecreasing function and a data augmentation using normal latent variables. The proposed Gibbs sampler is straightforward to implement since all unknowns have standard form in their full conditional distributions. The proposed method performs very well in estimating the regression parameters as well as the intracluster association, and the method is robust to frailty distribution misspecifications as shown in our simulation studies. Two real‐life data sets are analyzed for illustration.  相似文献   

15.
Wang L  Zhou J  Qu A 《Biometrics》2012,68(2):353-360
We consider the penalized generalized estimating equations (GEEs) for analyzing longitudinal data with high-dimensional covariates, which often arise in microarray experiments and large-scale health studies. Existing high-dimensional regression procedures often assume independent data and rely on the likelihood function. Construction of a feasible joint likelihood function for high-dimensional longitudinal data is challenging, particularly for correlated discrete outcome data. The penalized GEE procedure only requires specifying the first two marginal moments and a working correlation structure. We establish the asymptotic theory in a high-dimensional framework where the number of covariates p(n) increases as the number of clusters n increases, and p(n) can reach the same order as n. One important feature of the new procedure is that the consistency of model selection holds even if the working correlation structure is misspecified. We evaluate the performance of the proposed method using Monte Carlo simulations and demonstrate its application using a yeast cell-cycle gene expression data set.  相似文献   

16.
Alzheimer's disease (AD) is a common and complex neurodegenerative disease. Age at onset (AAO) of AD is an important component phenotype with a genetic basis, and identification of genes in which variation affects AAO would contribute to identification of factors that affect timing of onset. Increase in AAO through prevention or therapeutic measures would have enormous benefits by delaying AD and its associated morbidities. In this paper, we performed a family‐based genome‐wide association study for AAO of late‐onset AD in whole exome sequence data generated in multigenerational families with multiple AD cases. We conducted single marker and gene‐based burden tests for common and rare variants, respectively. We combined association analyses with variance component linkage analysis, and with reference to prior studies, in order to enhance evidence of the identified genes. For variants and genes implicated by the association study, we performed a gene‐set enrichment analysis to identify potential novel pathways associated with AAO of AD. We found statistically significant association with AAO for three genes (WRN, NTN4 and LAMC3) with common associated variants, and for four genes (SLC8A3, SLC19A3, MADD and LRRK2) with multiple rare‐associated variants that have a plausible biological function related to AD. The genes we have identified are in pathways that are strong candidates for involvement in the development of AD pathology and may lead to a better understanding of AD pathogenesis.  相似文献   

17.
18.
Yi GY  He W 《Biometrics》2009,65(2):618-625
Summary .  Recently, median regression models have received increasing attention. When continuous responses follow a distribution that is quite different from a normal distribution, usual mean regression models may fail to produce efficient estimators whereas median regression models may perform satisfactorily. In this article, we discuss using median regression models to deal with longitudinal data with dropouts. Weighted estimating equations are proposed to estimate the median regression parameters for incomplete longitudinal data, where the weights are determined by modeling the dropout process. Consistency and the asymptotic distribution of the resultant estimators are established. The proposed method is used to analyze a longitudinal data set arising from a controlled trial of HIV disease ( Volberding et al., 1990 , The New England Journal of Medicine 322, 941–949). Simulation studies are conducted to assess the performance of the proposed method under various situations. An extension to estimation of the association parameters is outlined.  相似文献   

19.
20.
Motivated by investigating the relationship between progesterone and the days in a menstrual cycle in a longitudinal study, we propose a multikink quantile regression model for longitudinal data analysis. It relaxes the linearity condition and assumes different regression forms in different regions of the domain of the threshold covariate. In this paper, we first propose a multikink quantile regression for longitudinal data. Two estimation procedures are proposed to estimate the regression coefficients and the kink points locations: one is a computationally efficient profile estimator under the working independence framework while the other one considers the within-subject correlations by using the unbiased generalized estimation equation approach. The selection consistency of the number of kink points and the asymptotic normality of two proposed estimators are established. Second, we construct a rank score test based on partial subgradients for the existence of the kink effect in longitudinal studies. Both the null distribution and the local alternative distribution of the test statistic have been derived. Simulation studies show that the proposed methods have excellent finite sample performance. In the application to the longitudinal progesterone data, we identify two kink points in the progesterone curves over different quantiles and observe that the progesterone level remains stable before the day of ovulation, then increases quickly in 5 to 6 days after ovulation and then changes to stable again or drops slightly.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号