首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
MOTIVATION: Principal Component Analysis (PCA) is one of the most popular dimensionality reduction techniques for the analysis of high-dimensional datasets. However, in its standard form, it does not take into account any error measures associated with the data points beyond a standard spherical noise. This indiscriminate nature provides one of its main weaknesses when applied to biological data with inherently large variability, such as expression levels measured with microarrays. Methods now exist for extracting credibility intervals from the probe-level analysis of cDNA and oligonucleotide microarray experiments. These credibility intervals are gene and experiment specific, and can be propagated through an appropriate probabilistic downstream analysis. RESULTS: We propose a new model-based approach to PCA that takes into account the variances associated with each gene in each experiment. We develop an efficient EM-algorithm to estimate the parameters of our new model. The model provides significantly better results than standard PCA, while remaining computationally reasonable. We show how the model can be used to 'denoise' a microarray dataset leading to improved expression profiles and tighter clustering across profiles. The probabilistic nature of the model means that the correct number of principal components is automatically obtained.  相似文献   

2.
We have developed a program for microarray data analysis, which features the false discovery rate for testing statistical significance and the principal component analysis using the singular value decomposition method for detecting the global trends of gene-expression patterns. Additional features include analysis of variance with multiple methods for error variance adjustment, correction of cross-channel correlation for two-color microarrays, identification of genes specific to each cluster of tissue samples, biplot of tissues and corresponding tissue-specific genes, clustering of genes that are correlated with each principal component (PC), three-dimensional graphics based on virtual reality modeling language and sharing of PC between different experiments. The software also supports parameter adjustment, gene search and graphical output of results. The software is implemented as a web tool and thus the speed of analysis does not depend on the power of a client computer. AVAILABILITY: The tool can be used on-line or downloaded at http://lgsun.grc.nia.nih.gov/ANOVA/  相似文献   

3.
In this work, we introduce in the first part new developments in Principal Component Analysis (PCA) and in the second part a new method to select variables (genes in our application). Our focus is on problems where the values taken by each variable do not all have the same importance and where the data may be contaminated with noise and contain outliers, as is the case with microarray data. The usual PCA is not appropriate to deal with this kind of problems. In this context, we propose the use of a new correlation coefficient as an alternative to Pearson's. This leads to a so-called weighted PCA (WPCA). In order to illustrate the features of our WPCA and compare it with the usual PCA, we consider the problem of analyzing gene expression data sets. In the second part of this work, we propose a new PCA-based algorithm to iteratively select the most important genes in a microarray data set. We show that this algorithm produces better results when our WPCA is used instead of the usual PCA. Furthermore, by using Support Vector Machines, we show that it can compete with the Significance Analysis of Microarrays algorithm.  相似文献   

4.
The use of principal component analysis (PCA) as a multivariate statistical approach to reduce complex biomechanical data-sets is growing. With its increased application in biomechanics, there has been a concurrent divergence in the use of criteria to determine how much the data is reduced (i.e. how many principal factors are retained). This short communication presents power equations to support the use of a parallel analysis (PA) criterion as a quantitative and transparent method for determining how many factors to retain when conducting a PCA. Monte Carlo simulation was used to carry out PCA on random data-sets of varying dimension. This process mimicked the PA procedure that would be required to determine principal component (PC) retention for any independent study in which the data-set dimensions fell within the range tested here. A surface was plotted for each of the first eight PCs, expressing the expected outcome of a PA as a function of the dimensions of a data-set. A power relationship was used to fit the surface, facilitating the prediction of the expected outcome of a PA as a function of the dimensions of a data-set. Coefficients used to fit the surface and facilitate prediction are reported. These equations enable the PA to be freely adopted as a criterion to inform PC retention. A transparent and quantifiable criterion to determine how many PCs to retain will enhance the ability to compare and contrast between studies.  相似文献   

5.
Fundamentals of cDNA microarray data analysis   总被引:15,自引:0,他引:15  
Microarray technology is a powerful approach for genomics research. The multi-step, data-intensive nature of this technology has created an unprecedented informatics and analytical challenge. It is important to understand the crucial steps that can affect the outcome of the analysis. In this review, we provide an overview of the contemporary trend on various main analysis steps in the microarray data analysis process, which includes experimental design, data standardization, image acquisition and analysis, normalization, statistical significance inference, exploratory data analysis, class prediction and pathway analysis, as well as various considerations relevant to their implementation.  相似文献   

6.
Landgrebe J  Wurst W  Welzl G 《Genome biology》2002,3(4):research0019.1-research001911

Background  

In microarray data analysis, the comparison of gene-expression profiles with respect to different conditions and the selection of biologically interesting genes are crucial tasks. Multivariate statistical methods have been applied to analyze these large datasets. Less work has been published concerning the assessment of the reliability of gene-selection procedures. Here we describe a method to assess reliability in multivariate microarray data analysis using permutation-validated principal components analysis (PCA). The approach is designed for microarray data with a group structure.  相似文献   

7.
8.
MOTIVATION: Gene set analysis allows formal testing of subtle but coordinated changes in a group of genes, such as those defined by Gene Ontology (GO) or KEGG Pathway databases. We propose a new method for gene set analysis that is based on principal component analysis (PCA) of genes expression values in the gene set. PCA is an effective method for reducing high dimensionality and capture variations in gene expression values. However, one limitation with PCA is that the latent variable identified by the first PC may be unrelated to outcome. RESULTS: In the proposed supervised PCA (SPCA) model for gene set analysis, the PCs are estimated from a selected subset of genes that are associated with outcome. As outcome information is used in the gene selection step, this method is supervised, thus called the Supervised PCA model. Because of the gene selection step, test statistic in SPCA model can no longer be approximated well using t-distribution. We propose a two-component mixture distribution based on Gumbel exteme value distributions to account for the gene selection step. We show the proposed method compares favorably to currently available gene set analysis methods using simulated and real microarray data. SOFTWARE: The R code for the analysis used in this article are available upon request, we are currently working on implementing the proposed method in an R package.  相似文献   

9.

Background  

Data from metabolomic studies are typically complex and high-dimensional. Principal component analysis (PCA) is currently the most widely used statistical technique for analyzing metabolomic data. However, PCA is limited by the fact that it is not based on a statistical model.  相似文献   

10.
11.
GABRIEL  K. R. 《Biometrika》1971,58(3):453-467
  相似文献   

12.
This paper examines the selection of the appropriate representation of chromatogram data prior to using principal component analysis (PCA), a multivariate statistical technique, for the diagnosis of chromatogram data sets. The effects of four process variables were investigated; flow rate, temperature, loading concentration and loading volume, for a size exclusion chromatography system used to separate three components (monomer, dimer, trimer). The study showed that major positional shifts in the elution peaks that result when running the separation at different flow rates caused the effects of other variables to be masked if the PCA is performed using elapsed time as the comparative basis. Two alternative methods of representing the data in chromatograms are proposed. In the first data were converted to a volumetric basis prior to performing the PCA, while in the second, having made this transformation the data were adjusted to account for the total material loaded during each separation. Two datasets were analysed to demonstrate the approaches. The results show that by appropriate selection of the basis prior to the analysis, significantly greater process insight can be gained from the PCA and demonstrates the importance of pre-processing prior to such analysis.  相似文献   

13.

Background  

To cancel experimental variations, microarray data must be normalized prior to analysis. Where an appropriate model for statistical data distribution is available, a parametric method can normalize a group of data sets that have common distributions. Although such models have been proposed for microarray data, they have not always fit the distribution of real data and thus have been inappropriate for normalization. Consequently, microarray data in most cases have been normalized with non-parametric methods that adjust data in a pair-wise manner. However, data analysis and the integration of resultant knowledge among experiments have been difficult, since such normalization concepts lack a universal standard.  相似文献   

14.
Normalization of cDNA microarray data   总被引:43,自引:0,他引:43  
Normalization means to adjust microarray data for effects which arise from variation in the technology rather than from biological differences between the RNA samples or between the printed probes. This paper describes normalization methods based on the fact that dye balance typically varies with spot intensity and with spatial position on the array. Print-tip loess normalization provides a well-tested general purpose normalization method which has given good results on a wide range of arrays. The method may be refined by using quality weights for individual spots. The method is best combined with diagnostic plots of the data which display the spatial and intensity trends. When diagnostic plots show that biases still remain in the data after normalization, further normalization steps such as plate-order normalization or scale-normalization between the arrays may be undertaken. Composite normalization may be used when control spots are available which are known to be not differentially expressed. Variations on loess normalization include global loess normalization and two-dimensional normalization. Detailed commands are given to implement the normalization techniques using freely available software.  相似文献   

15.
Tracy  L  Bergemann 《遗传学报》2010,37(4):265-279
This research provides a new way to measure error in microarray data in order to improve gene expression analysis. Microarray data contains many sources of error. In order to glean information about mRNA expression levels, the true signal must first be segregated from noise. This research focuses on the variation that can be captured at the spot level in cDNA microarray images. Variation at other levels, due to differences at the array, dye, and block levels, can be corrected for by a variety of existing normalization procedures. Two signal quality estimates that capture the reliability of each spot printed on a microarray are described. A parametric estimate of within-spot variance, referred to here as σ2spot, assumes that pixels follow a normal distribution and are spatially correlated. A non-parametric estimate of error, called the mean square prediction error (MSPE), assumes that spots of high quality possess pixels that are similar to their neighbors. This paper will provide a framework to use either spot quality measure in downstream analysis, specifically as weights in regression models. Using these spot quality estimates as weights can result in greater efficiency, in a statistical sense, when modeling microarray data.  相似文献   

16.
M Crescenzi  A Giuliani 《FEBS letters》2001,507(1):114-118
By using principal components analysis (PCA) we demonstrate here that the information relevant to tumor line classification linked to the activity of 1375 genes expressed in 60 tumor cell lines can be reproduced by only five independent components. These components can be interpreted as cell motility and migration, cellular trafficking and endo/exocytosis, and epithelial character. PCA, at odds with cluster analysis methods routinely used in microarray analysis, allows for the participation of individual genes to multiple biochemical pathways, while assigning to each cell line a quantitative score reflecting fundamental biological functions.  相似文献   

17.
Xiao L  Wang K  Teng Y  Zhang J 《FEBS letters》2003,540(1-3):117-124
Wheat gliadin and other cereal prolamins have been said to be involved in the pathogenic damage of the small intestine in celiac disease via the apoptosis of epithelial cells. In the present work we investigated the mechanisms underlying the pro-apoptotic activity exerted by gliadin-derived peptides in Caco-2 intestinal cells, a cell line which retains many morphological and enzymatic features typical of normal human enterocytes. We found that digested peptides from wheat gliadins (i) induce apoptosis by the CD95/Fas apoptotic pathway, (ii) induce increased Fas and FasL mRNA levels, (iii) determine increased FasL release in the medium, and (iv) that gliadin digest-induced apoptosis can be blocked by Fas cascade blocking agents, i.e. targeted neutralizing antibodies. This favors the hypothesis that gliadin could activate an autocrine/paracrine Fas-mediated cell death pathway. Finally, we found that (v) a small peptide (1157 Da) from durum wheat, previously proposed for clinical practice, exerted a powerful protective activity against gliadin digest cytotoxicity.  相似文献   

18.
19.
Extracting features from high-dimensional data is a critically important task for pattern recognition and machine learning applications. High-dimensional data typically have much more variables than observations, and contain significant noise, missing components, or outliers. Features extracted from high-dimensional data need to be discriminative, sparse, and can capture essential characteristics of the data. In this paper, we present a way to constructing multivariate features and then classify the data into proper classes. The resulting small subset of features is nearly the best in the sense of Greenshtein's persistence; however, the estimated feature weights may be biased. We take a systematic approach for correcting the biases. We use conjugate gradient-based primal-dual interior-point techniques for large-scale problems. We apply our procedure to microarray gene analysis. The effectiveness of our method is confirmed by experimental results.  相似文献   

20.
Hörnquist M  Hertz J  Wahde M 《Bio Systems》2002,65(2-3):147-156
Large-scale expression data are today measured for thousands of genes simultaneously. This development is followed by an exploration of theoretical tools to get as much information out of these data as possible. One line is to try to extract the underlying regulatory network. The models used thus far, however, contain many parameters, and a careful investigation is necessary in order not to over-fit the models. We employ principal component analysis to show how, in the context of linear additive models, one can get a rough estimate of the effective dimensionality (the number of information-carrying dimensions) of large-scale gene expression datasets. We treat both the lack of independence of different measurements in a time series and the fact that that measurements are subject to some level of noise, both of which reduce the effective dimensionality and thereby constrain the complexity of models which can be built from the data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号