首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 250 毫秒
1.
Summary .   In surveys of natural populations of animals, a sampling protocol is often spatially replicated to collect a representative sample of the population. In these surveys, differences in abundance of animals among sample locations may induce spatial heterogeneity in the counts associated with a particular sampling protocol. For some species, the sources of heterogeneity in abundance may be unknown or unmeasurable, leading one to specify the variation in abundance among sample locations stochastically. However, choosing a parametric model for the distribution of unmeasured heterogeneity is potentially subject to error and can have profound effects on predictions of abundance at unsampled locations. In this article, we develop an alternative approach wherein a Dirichlet process prior is assumed for the distribution of latent abundances. This approach allows for uncertainty in model specification and for natural clustering in the distribution of abundances in a data-adaptive way. We apply this approach in an analysis of counts based on removal samples of an endangered fish species, the Okaloosa darter. Results of our data analysis and simulation studies suggest that our implementation of the Dirichlet process prior has several attractive features not shared by conventional, fully parametric alternatives.  相似文献   

2.
Advances in sequencing technologies and bioinformatics tools have vastly improved our ability to collect and analyze data from complex microbial communities. A major goal of microbiome studies is to correlate the overall microbiome composition with clinical or environmental variables. La Rosa et al. recently proposed a parametric test for comparing microbiome populations between two or more groups of subjects. However, this method is not applicable for testing the association between the community composition and a continuous variable. Although multivariate nonparametric methods based on permutations are widely used in ecology studies, they lack interpretability and can be inefficient for analyzing microbiome data. We consider the problem of testing for independence between the microbial community composition and a continuous or many-valued variable. By partitioning the range of the variable into a few slices, we formulate the problem as a problem of comparing multiple groups of microbiome samples, with each group indexed by a slice. To model multivariate and over-dispersed count data, we use the Dirichlet-multinomial distribution. We propose an adaptive likelihood-ratio test by learning a good partition or slicing scheme from the data. A dynamic programming algorithm is developed for numerical optimization. We demonstrate the superiority of the proposed test by numerically comparing it with that of La Rosa et al. and other popular approaches on the same topic including PERMANOVA, the distance covariance test, and the microbiome regression-based kernel association test. We further apply it to test the association of gut microbiome with age in three geographically distinct populations and show how the learned partition facilitates differential abundance analysis.  相似文献   

3.
Current advances in next-generation sequencing techniques have allowed researchers to conduct comprehensive research on the microbiome and human diseases, with recent studies identifying associations between the human microbiome and health outcomes for a number of chronic conditions. However, microbiome data structure, characterized by sparsity and skewness, presents challenges to building effective classifiers. To address this, we present an innovative approach for distance-based classification using mixture distributions (DCMD). The method aims to improve classification performance using microbiome community data, where the predictors are composed of sparse and heterogeneous count data. This approach models the inherent uncertainty in sparse counts by estimating a mixture distribution for the sample data and representing each observation as a distribution, conditional on observed counts and the estimated mixture, which are then used as inputs for distance-based classification. The method is implemented into a k-means classification and k-nearest neighbours framework. We develop two distance metrics that produce optimal results. The performance of the model is assessed using simulated and human microbiome study data, with results compared against a number of existing machine learning and distance-based classification approaches. The proposed method is competitive when compared to the other machine learning approaches, and shows a clear improvement over commonly used distance-based classifiers, underscoring the importance of modelling sparsity for achieving optimal results. The range of applicability and robustness make the proposed method a viable alternative for classification using sparse microbiome count data. The source code is available at https://github.com/kshestop/DCMD for academic use.  相似文献   

4.
A random sample is drawn from a distribution which admits aminimal sufficient statistic for the parameters. The Gibbs sampleris proposed to generate samples, called conditionally sufficientor co-sufficient samples, from the conditional distributionof the sample given its value of the sufficient statistic. Theprocedure is illustrated for the gamma distribution. Co-sufficientsamples may be used to give exact tests of fit; for the gammadistribution these are compared for size and power with approximatetests based on the parametric bootstrap.  相似文献   

5.
Machine learning-based classification approaches are widely used to predict host phenotypes from microbiome data. Classifiers are typically employed by considering operational taxonomic units or relative abundance profiles as input features. Such types of data are intrinsically sparse, which opens the opportunity to make predictions from the presence/absence rather than the relative abundance of microbial taxa. This also poses the question whether it is the presence rather than the abundance of particular taxa to be relevant for discrimination purposes, an aspect that has been so far overlooked in the literature. In this paper, we aim at filling this gap by performing a meta-analysis on 4,128 publicly available metagenomes associated with multiple case-control studies. At species-level taxonomic resolution, we show that it is the presence rather than the relative abundance of specific microbial taxa to be important when building classification models. Such findings are robust to the choice of the classifier and confirmed by statistical tests applied to identifying differentially abundant/present taxa. Results are further confirmed at coarser taxonomic resolutions and validated on 4,026 additional 16S rRNA samples coming from 30 public case-control studies.  相似文献   

6.
Fisher's logseries is widely used to characterize species abundance pattern, and some previous studies used it to predict species richness. However, this model, derived from the negative binomial model, degenerates at the zero‐abundance point (i.e., its probability mass fully concentrates at zero abundance, leading to an odd situation that no species can occur in the studied sample). Moreover, it is not directly related to the sampling area size. In this sense, the original Fisher's alpha (correspondingly, species richness) is incomparable among ecological communities with varying area sizes. To overcome these limitations, we developed a novel area‐based logseries model that can account for the compounding effect of the sampling area. The new model can be used to conduct area‐based rarefaction and extrapolation of species richness, with the advantage of accurately predicting species richness in a large region that has an area size being hundreds or thousands of times larger than that of a locally observed sample, provided that data follow the proposed model. The power of our proposed model has been validated by extensive numerical simulations and empirically tested through tree species richness extrapolation and interpolation in Brazilian Atlantic forests. Our parametric model is data parsimonious as it is still applicable when only the information on species number, community size, or the numbers of singleton and doubleton species in the local sample is available. Notably, in comparison with the original Fisher's method, our area‐based model can provide asymptotically unbiased variance estimation (therefore correct 95% confidence interval) for species richness. In conclusion, the proposed area‐based Fisher's logseries model can be of broad applications with clear and proper statistical background. Particularly, it is very suitable for being applied to hyperdiverse ecological assemblages in which nonparametric richness estimators were found to greatly underestimate species richness.  相似文献   

7.
Publication bias and related types of small-study effects threaten the validity of systematic reviews. The existence of small-study effects has been demonstrated in empirical studies. Small-study effects are graphically diagnosed by inspection of the funnel plot. Though observed funnel plot asymmetry cannot be easily linked to a specific reason, tests based on funnel plot asymmetry have been proposed. Beyond a vast range of funnel plot tests, there exist several methods for adjusting treatment effect estimates for these biases. In this article, we consider the trim-and-fill method, the Copas selection model, and more recent regression-based approaches. The methods are exemplified using a meta-analysis from the literature and compared in a simulation study, based on binary response data. They are also applied to a large set of meta-analyses. Some fundamental differences between the approaches are discussed. An assumption common to the trim-and-fill method and the Copas selection model is that the small-study effect is caused by selection. The trim-and-fill method corresponds to an unknown implicit model generated by the symmetry assumption, whereas the Copas selection model is a parametric statistical model. However, it requires a sensitivity analysis. Regression-based approaches are easier to implement and not based on a specific selection model. Both simulations and applications suggest that in the presence of strong selection both the trim-and-fill method and the Copas selection model may not fully eliminate bias, while regression-based approaches seem to be a promising alternative.  相似文献   

8.
Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.  相似文献   

9.

Background

Using suitable error models for gene expression measurements is essential in the statistical analysis of microarray data. However, the true probabilistic model underlying gene expression intensity readings is generally not known. Instead, in currently used approaches some simple parametric model is assumed (usually a transformed normal distribution) or the empirical distribution is estimated. However, both these strategies may not be optimal for gene expression data, as the non-parametric approach ignores known structural information whereas the fully parametric models run the risk of misspecification. A further related problem is the choice of a suitable scale for the model (e.g. observed vs. log-scale).

Results

Here a simple semi-parametric model for gene expression measurement error is presented. In this approach inference is based an approximate likelihood function (the extended quasi-likelihood). Only partial knowledge about the unknown true distribution is required to construct this function. In case of gene expression this information is available in the form of the postulated (e.g. quadratic) variance structure of the data.As the quasi-likelihood behaves (almost) like a proper likelihood, it allows for the estimation of calibration and variance parameters, and it is also straightforward to obtain corresponding approximate confidence intervals. Unlike most other frameworks, it also allows analysis on any preferred scale, i.e. both on the original linear scale as well as on a transformed scale. It can also be employed in regression approaches to model systematic (e.g. array or dye) effects.

Conclusions

The quasi-likelihood framework provides a simple and versatile approach to analyze gene expression data that does not make any strong distributional assumptions about the underlying error model. For several simulated as well as real data sets it provides a better fit to the data than competing models. In an example it also improved the power of tests to identify differential expression.
  相似文献   

10.
Community assembly processes determine patterns of species distribution and abundance which are central to the ecology of microbiomes. When studying plant root microbiome assembly, it is typical to sample at the whole plant root system scale. However, sampling at these relatively large spatial scales may hinder the observability of intermediate processes. To study the relative importance of these processes, we employed millimetre-scale sampling of the cell elongation zone of individual roots. Both the rhizosphere and rhizoplane microbiomes were examined in fibrous and taproot model systems, represented by wheat and faba bean, respectively. Like others, we found that the plant root microbiome assembly is mainly driven by plant selection. However, based on variability between replicate millimetre-scale samples and comparisons with randomized null models, we infer that either priority effects during early root colonization or variable selection among replicate plant roots also determines root microbiome assembly.Subject terms: Soil microbiology, Microbial ecology  相似文献   

11.
The human microbiome plays critical roles in human health and has been linked to many diseases. While advanced sequencing technologies can characterize the composition of the microbiome in unprecedented detail, it remains challenging to disentangle the complex interplay between human microbiome and disease risk factors due to the complicated nature of microbiome data. Excessive numbers of zero values, high dimensionality, the hierarchical phylogenetic tree and compositional structure are compounded and consequently make existing methods inadequate to appropriately address these issues. We propose a multivariate two-part zero-inflated logistic-normal model to analyze the association of disease risk factors with individual microbial taxa and overall microbial community composition. This approach can naturally handle excessive numbers of zeros and the compositional data structure with the discrete part and the logistic-normal part of the model. For parameter estimation, an estimating equations approach is employed that enables us to address the complex inter-taxa correlation structure induced by the hierarchical phylogenetic tree structure and the compositional data structure. This model is able to incorporate standard regularization approaches to deal with high dimensionality. Simulation shows that our model outperforms existing methods. Our approach is also compared to others using the analysis of real microbiome data.  相似文献   

12.
《Mathematical biosciences》1986,78(2):247-263
Some clinical investigations of irreversible chronic diseases begin with a sample of healthy subjects and follow them through the process of disease onset and death. Researchers often ask if the presence of the disease diminishes a person's life expectancy. When parametric models are used for the analysis, hypotheses about the death rates for those people with and without the disease are usually tested by means of standard normal tests on the difference in the parameter estimates or by likelihood-ratio chi-square tests. This paper uses Monte Carlo simulation techniques to examine the precision of parameter estimates and the power of these tests for a model used for a previous study of senile dementia, Alzheimer's type.  相似文献   

13.

Background

The complex microbiome of the ceca of chickens plays an important role in nutrient utilization, growth and well-being of these animals. Since we have a very limited understanding of the capabilities of most species present in the cecum, we investigated the role of the microbiome by comparative analyses of both the microbial community structure and functional gene content using random sample pyrosequencing. The overall goal of this study was to characterize the chicken cecal microbiome using a pathogen-free chicken and one that had been challenged with Campylobacter jejuni.

Methodology/Principal Findings

Comparative metagenomic pyrosequencing was used to generate 55,364,266 bases of random sampled pyrosequence data from two chicken cecal samples. SSU rDNA gene tags and environmental gene tags (EGTs) were identified using SEED subsystems-based annotations. The distribution of phylotypes and EGTs detected within each cecal sample were primarily from the Firmicutes, Bacteroidetes and Proteobacteria, consistent with previous SSU rDNA libraries of the chicken cecum. Carbohydrate metabolism and virulence genes are major components of the EGT content of both of these microbiomes. A comparison of the twelve major pathways in the SEED Virulence Subsystem (metavirulome) represented in the chicken cecum, mouse cecum and human fecal microbiomes showed that the metavirulomes differed between these microbiomes and the metavirulomes clustered by host environment. The chicken cecum microbiomes had the broadest range of EGTs within the SEED Conjugative Transposon Subsystem, however the mouse cecum microbiomes showed a greater abundance of EGTs in this subsystem. Gene assemblies (32 contigs) from one microbiome sample were predominately from the Bacteroidetes, and seven of these showed sequence similarity to transposases, whereas the remaining sequences were most similar to those from catabolic gene families.

Conclusion/Significance

This analysis has demonstrated that mobile DNA elements are a major functional component of cecal microbiomes, thus contributing to horizontal gene transfer and functional microbiome evolution. Moreover, the metavirulomes of these microbiomes appear to associate by host environment. These data have implications for defining core and variable microbiome content in a host species. Furthermore, this suggests that the evolution of host specific metavirulomes is a contributing factor in disease resistance to zoonotic pathogens.  相似文献   

14.
We compare two models for the analysis of repeated ordinal categorical data: the classical parametric model for means of scores assigned to the categories of the response variable and a nonparametric model based on relative effects derived from the marginal distribution functions of the response. An example in the field of Dentistry is used to illustrate and to compare the models. We also consider a simulation study to evaluate the type‐I error rates and the power of tests under both models in a balanced design setup. The simulation results suggest that both approaches behave similarly for equally spaced scores but may perform differently otherwise. (© 2004 WILEY‐VCH Verlag GmbH & Co. KGaA, Weinheim)  相似文献   

15.
Overdispersion is a common phenomenon in Poisson modeling, and the negative binomial (NB) model is frequently used to account for overdispersion. Testing approaches (Wald test, likelihood ratio test (LRT), and score test) for overdispersion in the Poisson regression versus the NB model are available. Because the generalized Poisson (GP) model is similar to the NB model, we consider the former as an alternate model for overdispersed count data. The score test has an advantage over the LRT and the Wald test in that the score test only requires that the parameter of interest be estimated under the null hypothesis. This paper proposes a score test for overdispersion based on the GP model and compares the power of the test with the LRT and Wald tests. A simulation study indicates the score test based on asymptotic standard Normal distribution is more appropriate in practical application for higher empirical power, however, it underestimates the nominal significance level, especially in small sample situations, and examples illustrate the results of comparing the candidate tests between the Poisson and GP models. A bootstrap test is also proposed to adjust the underestimation of nominal level in the score statistic when the sample size is small. The simulation study indicates the bootstrap test has significance level closer to nominal size and has uniformly greater power than the score test based on asymptotic standard Normal distribution. From a practical perspective, we suggest that, if the score test gives even a weak indication that the Poisson model is inappropriate, say at the 0.10 significance level, we advise the more accurate bootstrap procedure as a better test for comparing whether the GP model is more appropriate than Poisson model. Finally, the Vuong test is illustrated to choose between GP and NB2 models for the same dataset.  相似文献   

16.

With the increasing availability of microbiome 16S data, network estimation has become a useful approach to studying the interactions between microbial taxa. Network estimation on a set of variables is frequently explored using graphical models, in which the relationship between two variables is modeled via their conditional dependency given the other variables. Various methods for sparse inverse covariance estimation have been proposed to estimate graphical models in the high-dimensional setting, including graphical lasso. However, current methods do not address the compositional count nature of microbiome data, where abundances of microbial taxa are not directly measured, but are reflected by the observed counts in an error-prone manner. Adding to the challenge is that the sum of the counts within each sample, termed “sequencing depth,” is an experimental technicality that carries no biological information but can vary drastically across samples. To address these issues, we develop a new approach to network estimation, called BC-GLASSO (bias-corrected graphical lasso), which models the microbiome data using a logistic normal multinomial distribution with the sequencing depths explicitly incorporated, corrects the bias of the naive empirical covariance estimator arising from the heterogeneity in sequencing depths, and builds the inverse covariance estimator via graphical lasso. We demonstrate the advantage of BC-GLASSO over current approaches to microbial interaction network estimation under a variety of simulation scenarios. We also illustrate the efficacy of our method in an application to a human microbiome data set.

  相似文献   

17.
We present two tests for seasonal trend in monthly incidence data. The first approach uses a penalized likelihood to choose the number of harmonic terms to include in a parametric harmonic model (which includes time trends and autogression as well as seasonal harmonic terms) and then tests for seasonality using a parametric bootstrap test. The second approach uses a semiparametric regression model to test for seasonal trend. In the semiparametric model, the seasonal pattern is modeled nonparametrically, parametric terms are included for autoregressive effects and a linear time trend, and a parametric bootstrap test is used to test for seasonality. For both procedures, a null distribution is generated under a null Poisson model with time trends and autoregression parameters.We apply the methods to skin melanoma incidence rates collected by the surveillance, epidemiology, and end results (SEER) program of the National Cancer Institute, and perform simulation studies to evaluate the type I error rate and power for the two procedures. These simulations suggest that both procedures are alpha-level procedures. In addition, the harmonic model/bootstrap test had similar or larger power than the semiparametric model/bootstrap test for a wide range of alternatives, and the harmonic model/bootstrap test is much easier to implement. Thus, we recommend the harmonic model/bootstrap test for the analysis of seasonal incidence data.  相似文献   

18.

Thanks to advances in high-throughput sequencing technologies, the importance of microbiome to human health and disease has been increasingly recognized. Analyzing microbiome data from sequencing experiments is challenging due to their unique features such as compositional data, excessive zero observations, overdispersion, and complex relations among microbial taxa. Clustered microbiome data have become prevalent in recent years from designs such as longitudinal studies, family studies, and matched case–control studies. The within-cluster dependence compounds the challenge of the microbiome data analysis. Methods that properly accommodate intra-cluster correlation and features of the microbiome data are needed. We develop robust and powerful differential composition tests for clustered microbiome data. The methods do not rely on any distributional assumptions on the microbial compositions, which provides flexibility to model various correlation structures among taxa and among samples within a cluster. By leveraging the adjusted sandwich covariance estimate, the methods properly accommodate sample dependence within a cluster. The two-part version of the test can further improve power in the presence of excessive zero observations. Different types of confounding variables can be easily adjusted for in the methods. We perform extensive simulation studies under commonly adopted clustered data designs to evaluate the methods. We demonstrate that the methods properly control the type I error under all designs and are more powerful than existing methods in many scenarios. The usefulness of the proposed methods is further demonstrated with two real datasets from longitudinal microbiome studies on pregnant women and inflammatory bowel disease patients. The methods have been incorporated into the R package “miLineage” publicly available at https://tangzheng1.github.io/tanglab/software.html.

  相似文献   

19.
The human microbiome, which includes the collective microbes residing in or on the human body, has a profound influence on the human health. DNA sequencing technology has made the large-scale human microbiome studies possible by using shotgun metagenomic sequencing. One important aspect of data analysis of such metagenomic data is to quantify the bacterial abundances based on the metagenomic sequencing data. Existing methods almost always quantify such abundances one sample at a time, which ignore certain systematic differences in read coverage along the genomes due to GC contents, copy number variation and the bacterial origin of replication. In order to account for such differences in read counts, we propose a multi-sample Poisson model to quantify microbial abundances based on read counts that are assigned to species-specific taxonomic markers. Our model takes into account the marker-specific effects when normalizing the sequencing count data in order to obtain more accurate quantification of the species abundances. Compared to currently available methods on simulated data and real data sets, our method has demonstrated an improved accuracy in bacterial abundance quantification, which leads to more biologically interesting results from downstream data analysis.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号