首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
3.
4.
5.
6.
16S ribosomal RNA (rRNA) gene and other environmental sequencing techniques provide snapshots of microbial communities, revealing phylogeny and the abundances of microbial populations across diverse ecosystems. While changes in microbial community structure are demonstrably associated with certain environmental conditions (from metabolic and immunological health in mammals to ecological stability in soils and oceans), identification of underlying mechanisms requires new statistical tools, as these datasets present several technical challenges. First, the abundances of microbial operational taxonomic units (OTUs) from amplicon-based datasets are compositional. Counts are normalized to the total number of counts in the sample. Thus, microbial abundances are not independent, and traditional statistical metrics (e.g., correlation) for the detection of OTU-OTU relationships can lead to spurious results. Secondly, microbial sequencing-based studies typically measure hundreds of OTUs on only tens to hundreds of samples; thus, inference of OTU-OTU association networks is severely under-powered, and additional information (or assumptions) are required for accurate inference. Here, we present SPIEC-EASI (SParse InversE Covariance Estimation for Ecological Association Inference), a statistical method for the inference of microbial ecological networks from amplicon sequencing datasets that addresses both of these issues. SPIEC-EASI combines data transformations developed for compositional data analysis with a graphical model inference framework that assumes the underlying ecological association network is sparse. To reconstruct the network, SPIEC-EASI relies on algorithms for sparse neighborhood and inverse covariance selection. To provide a synthetic benchmark in the absence of an experimentally validated gold-standard network, SPIEC-EASI is accompanied by a set of computational tools to generate OTU count data from a set of diverse underlying network topologies. SPIEC-EASI outperforms state-of-the-art methods to recover edges and network properties on synthetic data under a variety of scenarios. SPIEC-EASI also reproducibly predicts previously unknown microbial associations using data from the American Gut project.  相似文献   

7.
8.
Continuous proportional data is common in biomedical research, e.g., the pre‐post therapy percent change in certain physiological and molecular variables such as glomerular filtration rate, certain gene expression level, or telomere length. As shown in (Song and Tan, 2000) such data requires methods beyond the common generalised linear models. However, the original marginal simplex model of (Song and Tan, 2000) for such longitudinal continuous proportional data assumes a constant dispersion parameter. This assumption of dispersion homogeneity is imposed mainly for mathematical convenience and may be violated in some situations. For example, the dispersion may vary in terms of drug treatment cohorts or follow‐up times. This paper extends their original model so that the heterogeneity of the dispersion parameter can be assessed and accounted for in order to conduct a proper statistical inference for the model parameters. A simulation study is given to demonstrate that statistical inference can be seriously affected by mistakenly assuming a varying dispersion parameter to be constant in the application of the available GEEs method. In addition, residual analysis is developed for checking various assumptions made in the modelling process, e.g., assumptions on error distribution. The methods are illustrated with the same eye surgery data in (Song and Tan, 2000) for ease of comparison. (© 2004 WILEY‐VCH Verlag GmbH & Co. KGaA, Weinheim)  相似文献   

9.
10.
Array-based gene expression studies frequently serve to identify genes that are expressed differently under two or more conditions. The actual analysis of the data, however, may be hampered by a number of technical and statistical problems. Possible remedies on the level of computational analysis lie in appropriate preprocessing steps, proper normalization of the data and application of statistical testing procedures in the derivation of differentially expressed genes. This review summarizes methods that are available for these purposes and provides a brief overview of the available software tools.  相似文献   

11.
ABSTRACT Most ecologists use statistical methods as their main analytical tools when analyzing data to identify relationships between a response and a set of predictors; thus, they treat all analyses as hypothesis tests or exercises in parameter estimation. However, little or no prior knowledge about a system can lead to creation of a statistical model or models that do not accurately describe major sources of variation in the response variable. We suggest that under such circumstances data mining is more appropriate for analysis. In this paper we 1) present the distinctions between data-mining (usually exploratory) analyses and parametric statistical (confirmatory) analyses, 2) illustrate 3 strengths of data-mining tools for generating hypotheses from data, and 3) suggest useful ways in which data mining and statistical analyses can be integrated into a thorough analysis of data to facilitate rapid creation of accurate models and to guide further research.  相似文献   

12.
An enormous amount of microarray data has been collected and accumulated in public repositories. Although some of the depositions include raw and processed data, significant parts of them include processed data only. If we need to combine multiple datasets for specific purposes, the data should be adjusted prior to use to remove bias between the datasets. We focused on a GeneChip platform and a pre-processing method, RMA, and examined simple quantile correction as the post-processing method for integration. Integration of the data pre-processed by RMA was evaluated using artificial spike-in datasets and real microarray datasets of atopic dermatitis and lung cancer. Studies using the spike-in datasets show that the quantile correction for data integration reduces the data quality at some extent but it should be acceptable level. Studies using the real datasets show that the quantile correction significantly reduces the bias. These results show that the quantile correction is useful for integration of multiple datasets processed by RMA, and encourage effective use of public microarray data.  相似文献   

13.
14.
15.
Multiple-interval mapping for ordinal traits   总被引:3,自引:0,他引:3       下载免费PDF全文
Li J  Wang S  Zeng ZB 《Genetics》2006,173(3):1649-1663
Many statistical methods have been developed to map multiple quantitative trait loci (QTL) in experimental cross populations. Among these methods, multiple-interval mapping (MIM) can map QTL with epistasis simultaneously. However, the previous implementation of MIM is for continuously distributed traits. In this study we extend MIM to ordinal traits on the basis of a threshold model. The method inherits the properties and advantages of MIM and can fit a model of multiple QTL effects and epistasis on the underlying liability score. We study a number of statistical issues associated with the method, such as the efficiency and stability of maximization and model selection. We also use computer simulation to study the performance of the method and compare it to other alternative approaches. The method has been implemented in QTL Cartographer to facilitate its general usage for QTL mapping data analysis on binary and ordinal traits.  相似文献   

16.
17.
Phylogenetic regression is frequently used in macroevolutionary studies, and its statistical properties have been thoroughly investigated. By contrast, phylogenetic ANOVA has received relatively less attention, and the conditions leading to incorrect statistical and biological inferences when comparing multivariate phenotypes among groups remain underexplored. Here, we propose a refined method of randomizing residuals in a permutation procedure (RRPP) for evaluating phenotypic differences among groups while conditioning the data on the phylogeny. We show that RRPP displays appropriate statistical properties for both phylogenetic ANOVA and regression models, and for univariate and multivariate datasets. For ANOVA, we find that RRPP exhibits higher statistical power than methods utilizing phylogenetic simulation. Additionally, we investigate how group dispersion across the phylogeny affects inferences, and reveal that highly aggregated groups generate strong and significant correlations with the phylogeny, which reduce statistical power and subsequently affect biological interpretations. We discuss the broader implications of this phylogenetic group aggregation, and its relation to challenges encountered with other comparative methods where one or a few transitions in discrete traits are observed on the phylogeny. Finally, we recommend that phylogenetic comparative studies of continuous trait data use RRPP for assessing the significance of indicator variables as sources of trait variation.  相似文献   

18.
This paper develops mathematical and computational methods for fitting, by the method of maximum likelihood (ML), the two-parameter, right-truncated Weibull distribution (RTWD) to life-test or survival data. Some important statistical properties of the RTWD are derived and ML estimating equations for the scale and shape parameters of the RTWD are developed. The ML equations are used to express the scale parameter as an analytic function of the shape parameter and to establish a computationally useful lower bound on the ML estimate of the shape parameter. This bound is a function only of the sample observations and the (known) truncation point T. The ML equations are reducible to a single nonlinear, transcendental equation in the shape parameter, and a computationally efficient algorithm is described for solving this equation. The practical use of the methods is illustrated in two numerical examples.  相似文献   

19.
Technologies that have emerged from the genome project have dramatically increased our ability to generate data on the way in which organisms respond to their environments, how they execute their programmes of development and growth, and how these are altered in the development of disease states. However, our ability to analyse these large datasets has not kept pace with our ability to generate them and consequently new strategies must be developed to address the issues associated with their analysis. One approach that we have employed quite successfully is to look at data from microarrays (or proteomics or metabolomics experiments) not as independent datasets, but rather as elements of a much larger body of biological information across various scales that must be integrated with, and interpreted within, the context of such ancillary data. Here we outline the general approach and provide three examples from published studies of the way in which we have applied this strategy.  相似文献   

20.

Background

In modern biomedical research of complex diseases, a large number of demographic and clinical variables, herein called phenomic data, are often collected and missing values (MVs) are inevitable in the data collection process. Since many downstream statistical and bioinformatics methods require complete data matrix, imputation is a common and practical solution. In high-throughput experiments such as microarray experiments, continuous intensities are measured and many mature missing value imputation methods have been developed and widely applied. Numerous methods for missing data imputation of microarray data have been developed. Large phenomic data, however, contain continuous, nominal, binary and ordinal data types, which void application of most methods. Though several methods have been developed in the past few years, not a single complete guideline is proposed with respect to phenomic missing data imputation.

Results

In this paper, we investigated existing imputation methods for phenomic data, proposed a self-training selection (STS) scheme to select the best imputation method and provide a practical guideline for general applications. We introduced a novel concept of “imputability measure” (IM) to identify missing values that are fundamentally inadequate to impute. In addition, we also developed four variations of K-nearest-neighbor (KNN) methods and compared with two existing methods, multivariate imputation by chained equations (MICE) and missForest. The four variations are imputation by variables (KNN-V), by subjects (KNN-S), their weighted hybrid (KNN-H) and an adaptively weighted hybrid (KNN-A). We performed simulations and applied different imputation methods and the STS scheme to three lung disease phenomic datasets to evaluate the methods. An R package “phenomeImpute” is made publicly available.

Conclusions

Simulations and applications to real datasets showed that MICE often did not perform well; KNN-A, KNN-H and random forest were among the top performers although no method universally performed the best. Imputation of missing values with low imputability measures increased imputation errors greatly and could potentially deteriorate downstream analyses. The STS scheme was accurate in selecting the optimal method by evaluating methods in a second layer of missingness simulation. All source files for the simulation and the real data analyses are available on the author’s publication website.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-014-0346-6) contains supplementary material, which is available to authorized users.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号