首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
It is now known that unwanted noise and unmodeled artifacts such as batch effects can dramatically reduce the accuracy of statistical inference in genomic experiments. These sources of noise must be modeled and removed to accurately measure biological variability and to obtain correct statistical inference when performing high-throughput genomic analysis. We introduced surrogate variable analysis (sva) for estimating these artifacts by (i) identifying the part of the genomic data only affected by artifacts and (ii) estimating the artifacts with principal components or singular vectors of the subset of the data matrix. The resulting estimates of artifacts can be used in subsequent analyses as adjustment factors to correct analyses. Here I describe a version of the sva approach specifically created for count data or FPKMs from sequencing experiments based on appropriate data transformation. I also describe the addition of supervised sva (ssva) for using control probes to identify the part of the genomic data only affected by artifacts. I present a comparison between these versions of sva and other methods for batch effect estimation on simulated data, real count-based data and FPKM-based data. These updates are available through the sva Bioconductor package and I have made fully reproducible analysis using these methods available from: https://github.com/jtleek/svaseq.  相似文献   

2.

Background

Surrogate variable analysis (SVA) is a powerful method to identify, estimate, and utilize the components of gene expression heterogeneity due to unknown and/or unmeasured technical, genetic, environmental, or demographic factors. These sources of heterogeneity are common in gene expression studies, and failing to incorporate them into the analysis can obscure results. Using SVA increases the biological accuracy and reproducibility of gene expression studies by identifying these sources of heterogeneity and correctly accounting for them in the analysis.

Results

Here we have developed a web application called SVAw (Surrogate variable analysis Web app) that provides a user friendly interface for SVA analyses of genome-wide expression studies. The software has been developed based on open source bioconductor SVA package. In our software, we have extended the SVA program functionality in three aspects: (i) the SVAw performs a fully automated and user friendly analysis workflow; (ii) It calculates probe/gene Statistics for both pre and post SVA analysis and provides a table of results for the regression of gene expression on the primary variable of interest before and after correcting for surrogate variables; and (iii) it generates a comprehensive report file, including graphical comparison of the outcome for the user.

Conclusions

SVAw is a web server freely accessible solution for the surrogate variant analysis of high-throughput datasets and facilitates removing all unwanted and unknown sources of variation. It is freely available for use at http://psychiatry.igm.jhmi.edu/sva. The executable packages for both web and standalone application and the instruction for installation can be downloaded from our web site.
  相似文献   

3.
4.
Why environmental scientists are becoming Bayesians   总被引:11,自引:0,他引:11  
Advances in computational statistics provide a general framework for the high‐dimensional models typically needed for ecological inference and prediction. Hierarchical Bayes (HB) represents a modelling structure with capacity to exploit diverse sources of information, to accommodate influences that are unknown (or unknowable), and to draw inference on large numbers of latent variables and parameters that describe complex relationships. Here I summarize the structure of HB and provide examples for common spatiotemporal problems. The flexible framework means that parameters, variables and latent variables can represent broader classes of model elements than are treated in traditional models. Inference and prediction depend on two types of stochasticity, including (1) uncertainty, which describes our knowledge of fixed quantities, it applies to all ‘unobservables’ (latent variables and parameters), and it declines asymptotically with sample size, and (2) variability, which applies to fluctuations that are not explained by deterministic processes and does not decline asymptotically with sample size. Examples demonstrate how different sources of stochasticity impact inference and prediction and how allowance for stochastic influences can guide research.  相似文献   

5.
Burgette LF  Reiter JP 《Biometrics》2012,68(1):92-100
We describe a Bayesian quantile regression model that uses a confirmatory factor structure for part of the design matrix. This model is appropriate when the covariates are indicators of scientifically determined latent factors, and it is these latent factors that analysts seek to include as predictors in the quantile regression. We apply the model to a study of birth weights in which the effects of latent variables representing psychosocial health and actual tobacco usage on the lower quantiles of the response distribution are of interest. The models can be fit using an R package called factorQR.  相似文献   

6.
7.
Advancements in mass spectrometry‐based proteomics have enabled experiments encompassing hundreds of samples. While these large sample sets deliver much‐needed statistical power, handling them introduces technical variability known as batch effects. Here, we present a step‐by‐step protocol for the assessment, normalization, and batch correction of proteomic data. We review established methodologies from related fields and describe solutions specific to proteomic challenges, such as ion intensity drift and missing values in quantitative feature matrices. Finally, we compile a set of techniques that enable control of batch effect adjustment quality. We provide an R package, "proBatch", containing functions required for each step of the protocol. We demonstrate the utility of this methodology on five proteomic datasets each encompassing hundreds of samples and consisting of multiple experimental designs. In conclusion, we provide guidelines and tools to make the extraction of true biological signal from large proteomic studies more robust and transparent, ultimately facilitating reliable and reproducible research in clinical proteomics and systems biology.  相似文献   

8.
Nuclear morphometry is used to address subtleties of carcinogenesis; it has been proposed for evaluating chemoprevention. An important issue for morphometry concerns control for extraneous sources of variation: fixation, slide cutting and staining. A common strategy has been to standardize the morphometric measures. Morphometric variables--such features as mean nuclear size and staining intensity--are often combined into multivariate indices. In this paper, we consider these variables one by one; any index is to a significant degree dependent on the individual indicators. This paper considers the extent to which statistical adjustment adds to the informational utility of individual indicators. We consider 14 features of 934 prostatic nuclei diagnosed by a single pathologist (Rodolfo Montironi) within a region of either normal tissue or high-grade prostatic intraepithelial neoplasia (HGPIN). HGPIN, a precursor to prostate cancer (PC), has been suggested as a target for PC chemoprevention. We consider a range of adjustment methods: transforming variables into deviations from means or from expected values generated by regression analysis. Our major test of standardization utility is the ability of the variables to deemphasize interindividual differences within diagnostic categories but to distinguish between diagnostic categories.  相似文献   

9.
There are many sources of systematic variation in cDNA microarray experiments which affect the measured gene expression levels (e.g. differences in labeling efficiency between the two fluorescent dyes). The term normalization refers to the process of removing such variation. A constant adjustment is often used to force the distribution of the intensity log ratios to have a median of zero for each slide. However, such global normalization approaches are not adequate in situations where dye biases can depend on spot overall intensity and/or spatial location within the array. This article proposes normalization methods that are based on robust local regression and account for intensity and spatial dependence in dye biases for different types of cDNA microarray experiments. The selection of appropriate controls for normalization is discussed and a novel set of controls (microarray sample pool, MSP) is introduced to aid in intensity-dependent normalization. Lastly, to allow for comparisons of expression levels across slides, a robust method based on maximum likelihood estimation is proposed to adjust for scale differences among slides.  相似文献   

10.
Gene Ontology and other forms of gene-category analysis play a major role in the evaluation of high-throughput experiments in molecular biology. Single-category enrichment analysis procedures such as Fisher's exact test tend to flag large numbers of redundant categories as significant, which can complicate interpretation. We have recently developed an approach called model-based gene set analysis (MGSA), that substantially reduces the number of redundant categories returned by the gene-category analysis. In this work, we present the Bioconductor package mgsa, which makes the MGSA algorithm available to users of the R language. Our package provides a simple and flexible application programming interface for applying the approach. AVAILABILITY: The mgsa package has been made available as part of Bioconductor 2.8. It is released under the conditions of the Artistic license 2.0. CONTACT: peter.robinson@charite.de; julien.gagneur@embl.de.  相似文献   

11.
Missing data are ubiquitous in clinical and social research, and multiple imputation (MI) is increasingly the methodology of choice for practitioners. Two principal strategies for imputation have been proposed in the literature: joint modelling multiple imputation (JM‐MI) and full conditional specification multiple imputation (FCS‐MI). While JM‐MI is arguably a preferable approach, because it involves specification of an explicit imputation model, FCS‐MI is pragmatically appealing, because of its flexibility in handling different types of variables. JM‐MI has developed from the multivariate normal model, and latent normal variables have been proposed as a natural way to extend this model to handle categorical variables. In this article, we evaluate the latent normal model through an extensive simulation study and an application on data from the German Breast Cancer Study Group, comparing the results with FCS‐MI. We divide our investigation in four sections, focusing on (i) binary, (ii) categorical, (iii) ordinal, and (iv) count data. Using data simulated from both the latent normal model and the general location model, we find that in all but one extreme general location model setting JM‐MI works very well, and sometimes outperforms FCS‐MI. We conclude the latent normal model, implemented in the R package jomo , can be used with confidence by researchers, both for single and multilevel multiple imputation.  相似文献   

12.
This work presents a sequential data analysis path, which was successfully applied to identify important patterns (fingerprints) in mammalian cell culture process data regarding process variables, time evolution and process response. The data set incorporates 116 fed‐batch cultivation experiments for the production of a Fc‐Fusion protein. Having precharacterized the evolutions of the investigated variables and manipulated parameters with univariate analysis, principal component analysis (PCA) and partial least squares regression (PLSR) are used for further investigation. The first major objective is to capture and understand the interaction structure and dynamic behavior of the process variables and the titer (process response) using different models. The second major objective is to evaluate those models regarding their capability to characterize and predict the titer production. Moreover, the effects of data unfolding, imputation of missing data, phase separation, and variable transformation on the performance of the models are evaluated. © 2015 American Institute of Chemical Engineers Biotechnol. Prog., 31:1633–1644, 2015  相似文献   

13.
MatchMiner is a freely available program package for batch navigation among gene and gene product identifier types commonly encountered in microarray studies and other forms of 'omic' research. The user inputs a list of gene identifiers and then uses the Merge function to find the overlap with a second list of identifiers of either the same or a different type or uses the LookUp function to find corresponding identifiers.  相似文献   

14.
In batch manufacturing processes, the total process variation is generally decomposed into batch-by-batch variation and within-batch variation. Since different variation components may be caused by different sources, separation, testing, and estimation of each variance component are essential to the process improvement. Most of the previous SPC research emphasized reducing variations due to assignable causes by implementing control charts for process monitoring. Different from this focus, this article aims to analyze and reduce inherent natural process variations by applying the ANOVA method. The key issue of using the ANOVA method is how to develop appropriate statistical models for all variation components of interest. The article provides a generic framework for decomposition of three typical variation components in batch manufacturing processes. For the purpose of variation root causes diagnosis, the corresponding linear contrasts are defined to represent the possible site variation patterns and the statistical nested effect models are developed accordingly. The article shows that the use of a full factor decomposition model can expedite the determination of the number of nested effect models and the model structure. Finally, an example is given for the variation reduction in the screening conductive gridline printing process for solar battery fabrication.  相似文献   

15.
以静息CD4~+T细胞为主的人类免疫缺陷病毒(human immunodeficiency virus,HIV)潜伏库的清除已成为治愈HIV-1感染的主要障碍,人们迫切需要建立一种高通量、可靠的、高灵敏度的方法来定量检测病毒潜伏库的真实大小。本文就目前关于HIV潜伏库的多种定量检测方法进行综述。  相似文献   

16.
A Bayesian procedure for analyzing longitudinal binary responses using a periodic cosine function was developed. It was assumed that, after adjustment for "seasonal" effects, the oscillation of the underlying latent variables for longitudinal binary responses was a stationary series. Based on this assumption, a single dimension sinusoidal analysis of longitudinal binary responses using the Gibbs sampling and Metropolis algorithms was implemented in a study of clinical mastitis records of Norwegian Red cows taken over five lactations.  相似文献   

17.
MOTIVATION: Determining gene function is an important challenge arising from the availability of whole genome sequences. Until recently, approaches based on sequence homology were the only high-throughput method for predicting gene function. Use of high-throughput generated experimental data sets for determining gene function has been limited for several reasons. RESULTS: Here a new approach is presented for integration of high-throughput data sets, leading to prediction of function based on relationships supported by multiple types and sources of data. This is achieved with a database containing 125 different high-throughput data sets describing phenotypes, cellular localizations, protein interactions and mRNA expression levels from Saccharomyces cerevisiae, using a bit-vector representation and information content-based ranking. The approach takes characteristic and qualitative differences between the data sets into account, is highly flexible, efficient and scalable. Database queries result in predictions for 543 uncharacterized genes, based on multiple functional relationships each supported by at least three types of experimental data. Some of these are experimentally verified, further demonstrating their reliability. The results also generate insights into the relative merits of different data types and provide a coherent framework for functional genomic datamining. AVAILABILITY: Free availability over the Internet. CONTACT: f.c.p.holstege@med.uu.nl SUPPLEMENTARY INFORMATION: http://www.genomics.med.uu.nl/pub/pk/comb_gen_network.  相似文献   

18.
Climate influences tree-ring density and ring-density variables extracted from X-ray images have been widely used for climate reconstructions. The R package xRing was developed to identify and measure tree rings on X-ray microdensity profiles automatically. This package is available for free and it offers functions to visualize and calibrate X-ray images, to detect tree-ring borders and to identify earlywood-latewood transition using wood density variations at the inter- and the intra-ring scale. The most important functions are calibrateFilm, detectRings, correctRings, detectEwLw, and getDensity. Outputs of these functions are S3 objects, for which specific methods are provided, including plot and print. The non-linear relationship between optical density of the film and wood density is defined by the function calibrateFilm. The function detectRings detects tree rings using wood density profiles as input. This function uses the difference between local maximum and minimum values to identify tree-ring borders automatically. The correctRings function is used to call a Graphical User Interface (GUI) to visualize and to correct tree-ring borders manually. After correcting tree-ring borders, the detectEwLw function is used to compute earlywood and latewood widths by dividing rings according to relative intra-ring density changes. The getDensity function computes for each tree ring the minimum (maximum) density and the mean earlywood, latewood and whole-ring density. Finally, a list with dataframes with tree-ring width and density variables can be obtained using the function getRwls. One of the major advantages of xRing package is that requires little knowledge of R language, but at the same time it can be easily changed or adapted by experienced users.  相似文献   

19.
High-throughput analyses that are central to microbial systems biology and ecophysiology research benefit from highly homogeneous and physiologically well-defined cell cultures. While attention has focused on the technical variation associated with high-throughput technologies, biological variation introduced as a function of cell cultivation methods has been largely overlooked. This study evaluated the impact of cultivation methods, controlled batch or continuous culture in bioreactors versus shake flasks, on the reproducibility of global proteome measurements in Shewanella oneidensis MR-1. Variability in dissolved oxygen concentration and consumption rate, metabolite profiles, and proteome was greater in shake flask than controlled batch or chemostat cultures. Proteins indicative of suboxic and anaerobic growth (e.g., fumarate reductase and decaheme c-type cytochromes) were more abundant in cells from shake flasks compared to bioreactor cultures, a finding consistent with data demonstrating that “aerobic” flask cultures were O2 deficient due to poor mass transfer kinetics. The work described herein establishes the necessity of controlled cultivation for ensuring highly reproducible and homogenous microbial cultures. By decreasing cell to cell variability, higher quality samples will allow for the interpretive accuracy necessary for drawing conclusions relevant to microbial systems biology research.  相似文献   

20.
面包酵母是一类用来提高面包质量的重要添加剂。目前,不同国家主要采用分批培养、补料分批培养或连续培养的方式来生产面包酵母。酿酒酵母是用来发酵面团中淀粉的理想微生物,除了提升食品香味,增加口感之外,这一发酵过程可以产生多种维生素和蛋白质。用于生产酵母生物量的主要成分包括各种碳源,如甜菜糖蜜和甘蔗糖蜜等。由于甜菜糖蜜可用于高产率地生产乙醇,加之其带来的生物环境污染和废水处理问题,因此需要考虑用其他糖类来生产面包酵母。其中一个代替性糖源是枣。由于各种原因,伊朗每年都浪费大量的枣。研究了用枣作为培养基碳源的可行性。将废枣榨成汁,然后研究了酵母的产量和生长速率。结果发现,在pH3.4,温度30℃,通风量1.4vvm,发酵罐搅拌速度500r/min时,酵母对底物的产率接近50%。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号