首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Hörnquist M  Hertz J  Wahde M 《Bio Systems》2002,65(2-3):147-156
Large-scale expression data are today measured for thousands of genes simultaneously. This development is followed by an exploration of theoretical tools to get as much information out of these data as possible. One line is to try to extract the underlying regulatory network. The models used thus far, however, contain many parameters, and a careful investigation is necessary in order not to over-fit the models. We employ principal component analysis to show how, in the context of linear additive models, one can get a rough estimate of the effective dimensionality (the number of information-carrying dimensions) of large-scale gene expression datasets. We treat both the lack of independence of different measurements in a time series and the fact that that measurements are subject to some level of noise, both of which reduce the effective dimensionality and thereby constrain the complexity of models which can be built from the data.  相似文献   

2.

Background  

Data from metabolomic studies are typically complex and high-dimensional. Principal component analysis (PCA) is currently the most widely used statistical technique for analyzing metabolomic data. However, PCA is limited by the fact that it is not based on a statistical model.  相似文献   

3.
This paper examines the selection of the appropriate representation of chromatogram data prior to using principal component analysis (PCA), a multivariate statistical technique, for the diagnosis of chromatogram data sets. The effects of four process variables were investigated; flow rate, temperature, loading concentration and loading volume, for a size exclusion chromatography system used to separate three components (monomer, dimer, trimer). The study showed that major positional shifts in the elution peaks that result when running the separation at different flow rates caused the effects of other variables to be masked if the PCA is performed using elapsed time as the comparative basis. Two alternative methods of representing the data in chromatograms are proposed. In the first data were converted to a volumetric basis prior to performing the PCA, while in the second, having made this transformation the data were adjusted to account for the total material loaded during each separation. Two datasets were analysed to demonstrate the approaches. The results show that by appropriate selection of the basis prior to the analysis, significantly greater process insight can be gained from the PCA and demonstrates the importance of pre-processing prior to such analysis.  相似文献   

4.
5.
MOTIVATION: Detailed comparison and analysis of the output of DNA gene expression arrays from multiple samples require global normalization of the measured individual gene intensities from the different hybridizations. This is needed for accounting for variations in array preparation and sample hybridization conditions. RESULTS: Here, we present a simple, robust and accurate procedure for the global normalization of datasets generated with single-channel DNA arrays based on principal component analysis. The procedure makes minimal assumptions about the data and performs well in cases where other standard procedures produced biased estimates. It is also insensitive to data transformation, filtering (thresholding) and pre-screening.  相似文献   

6.
The authors tested a new procedure for the discrimination of EPs obtained in different stimulus situations. In contrast with principal component analysis (PCA) used so far for the purpose of data compression, the method referred to as canonical component analysis (CCA) is optimal for the purpose of discrimination. To illustrate this, the authors performed both PCA and CCA for the same material, then after carrying out discriminant analysis (SDWA) for the data transformed in this way, compared the performance of the two procedures in discrimination. In view of both the theoretical and practical considerations, the authors recommend that in the future researchers use CCA instead of PCA in EP studies for data reduction carried out for discrimination.  相似文献   

7.
MOTIVATION: Principal Component Analysis (PCA) is one of the most popular dimensionality reduction techniques for the analysis of high-dimensional datasets. However, in its standard form, it does not take into account any error measures associated with the data points beyond a standard spherical noise. This indiscriminate nature provides one of its main weaknesses when applied to biological data with inherently large variability, such as expression levels measured with microarrays. Methods now exist for extracting credibility intervals from the probe-level analysis of cDNA and oligonucleotide microarray experiments. These credibility intervals are gene and experiment specific, and can be propagated through an appropriate probabilistic downstream analysis. RESULTS: We propose a new model-based approach to PCA that takes into account the variances associated with each gene in each experiment. We develop an efficient EM-algorithm to estimate the parameters of our new model. The model provides significantly better results than standard PCA, while remaining computationally reasonable. We show how the model can be used to 'denoise' a microarray dataset leading to improved expression profiles and tighter clustering across profiles. The probabilistic nature of the model means that the correct number of principal components is automatically obtained.  相似文献   

8.
Hörnquist M  Hertz J  Wahde M 《Bio Systems》2003,71(3):311-317
Large-scale expression data are today measured for thousands of genes simultaneously. This development has been followed by an exploration of theoretical tools to get as much information out of these data as possible. Several groups have used principal component analysis (PCA) for this task. However, since this approach is data-driven, care must be taken in order not to analyze the noise instead of the data. As a strong warning towards uncritical use of the output from a PCA, we employ a newly developed procedure to judge the effective dimensionality of a specific data set. Although this data set is obtained during the development of rat central nervous system, our finding is a general property of noisy time series data. Based on knowledge of the noise-level for the data, we find that the effective number of dimensions that are meaningful to use in a PCA is much lower than what could be expected from the number of measurements. We attribute this fact both to effects of noise and the lack of independence of the expression levels. Finally, we explore the possibility to increase the dimensionality by performing more measurements within one time series, and conclude that this is not a fruitful approach.  相似文献   

9.
基于主成分分析的广西省干旱时空格局   总被引:1,自引:0,他引:1  
广西省地处喀斯特地貌区,土壤保水效率低,且年降水时空分布不均,研究干旱时空分布尤为重要。基于广西省1981—2010年20个气象站实测和2011—2100年HadGEM2-ES模型模拟数据,利用标准化降水指数(SPI, Standardized Precipitation Index)和标准化降水蒸散指数(SPEI,Standardized Precipitation Evapotranspiration Index),分析广西省干旱的时空变化。对不同时间尺度的SPI和SPEI应用主成分分析(PCA,Principal Component Analysis)确定干旱的空间模式,结果揭示了3个空间分布明确的区域:桂东北地区(PC1),桂西北地区(PC2)和桂南地区(PC3)。各区域干旱的时空变化和频率分布差异显著。PC1和PC3的SPEI-12呈负增长趋势,PC2的SPEI-12呈正增长趋势,且PC1、PC2和PC3的SPI-12均大于SPEI-12。年尺度(SPEI-12)上PC1和PC3的干旱频率大于PC2,其干旱频率分别为34.24%和35.83%。SPI和SPEI在空间和时间尺度...  相似文献   

10.
Metagenomics is an emerging field in which the power of genomic analysis is applied to an entire microbial community, bypassing the need to isolate and culture individual microbial species. Assembling of metagenomic DNA fragments is very much like the overlap-layout-consensus procedure for assembling isolated genomes, but is augmented by an additional binning step to differentiate scaffolds, contigs and unassembled reads into various taxonomic groups. In this paper, we employed n-mer oligonucleotide frequencies as the features and developed a hierarchical classifier (PCAHIER) for binning short (≤ 1,000 bps) metagenomic fragments. The principal component analysis was used to reduce the high dimensionality of the feature space. The hierarchical classifier consists of four layers of local classifiers that are implemented based on the linear discriminant analysis. These local classifiers are responsible for binning prokaryotic DNA fragments into superkingdoms, of the same superkingdom into phyla, of the same phylum into genera, and of the same genus into species, respectively. We evaluated the performance of the PCAHIER by using our own simulated data sets as well as the widely used simHC synthetic metagenome data set from the IMG/M system. The effectiveness of the PCAHIER was demonstrated through comparisons against a non-hierarchical classifier, and two existing binning algorithms (TETRA and Phylopythia).  相似文献   

11.
The measurement of five gait parameters, namely, joint angular displacement of lower extremities, floor reaction forces, trajectory for a point of force application, temporal factor and distance factor has been performed with ease and high speed using mini-computer on-line real-time processing. Gait data of 211 patients with hip diseases was normalized, quantified and summarized by the principal component analysis. A 'gait evaluation plane' was formed according to the results obtained by the principal component analysis. The gait evaluation using the plane was compared with clinical conditions of patients, and it was evident that this system can evaluate the recovery of the gait by treatment.  相似文献   

12.
Principal component analysis (PCA) is probably one of the most used methods for exploratory data analysis. However, it may not be always effective when there are multiple influential factors. In this paper, the use of multiblock PCA for analysing such types of data is demonstrated through a real metabolomics study combined with a series of data simulating two underlying influential factors with different types of interactions based on 2 × 2 experiment designs. The performance of multiblock PCA is compared with those of PCA and also ANOVA-PCA which is another PCA extension developed to solve similar problems. The results demonstrate that multiblock PCA is highly efficient at analysing such types of data which contain multiple influential factors. These models give the most comprehensive view of data compared to the other two methods. The combination of super scores and block scores shows not only the general trends of changing caused by each of the influential factors but also the subtle changes within each combination of the factors and their levels. It is also highly resistant to the addition of ‘irrelevant’ competing information and the first PC remains the most discriminant one which neither of the other two methods was able to do. The reason of such property was demonstrated by employing a 2 × 3 experiment designs. Finally, the validity of the results shown by the multiblock PCA was tested using permutation tests and the results suggested that the inherit risk of over-fitting of this type of approach is low.  相似文献   

13.

Principal component analysis (PCA) is probably one of the most used methods for exploratory data analysis. However, it may not be always effective when there are multiple influential factors. In this paper, the use of multiblock PCA for analysing such types of data is demonstrated through a real metabolomics study combined with a series of data simulating two underlying influential factors with different types of interactions based on 2 × 2 experiment designs. The performance of multiblock PCA is compared with those of PCA and also ANOVA-PCA which is another PCA extension developed to solve similar problems. The results demonstrate that multiblock PCA is highly efficient at analysing such types of data which contain multiple influential factors. These models give the most comprehensive view of data compared to the other two methods. The combination of super scores and block scores shows not only the general trends of changing caused by each of the influential factors but also the subtle changes within each combination of the factors and their levels. It is also highly resistant to the addition of ‘irrelevant’ competing information and the first PC remains the most discriminant one which neither of the other two methods was able to do. The reason of such property was demonstrated by employing a 2 × 3 experiment designs. Finally, the validity of the results shown by the multiblock PCA was tested using permutation tests and the results suggested that the inherit risk of over-fitting of this type of approach is low.

  相似文献   

14.

Background  

In contemporary biology, complex biological processes are increasingly studied by collecting and analyzing measurements of the same entities that are collected with different analytical platforms. Such data comprise a number of data blocks that are coupled via a common mode. The goal of collecting this type of data is to discover biological mechanisms that underlie the behavior of the variables in the different data blocks. The simultaneous component analysis (SCA) family of data analysis methods is suited for this task. However, a SCA may be hampered by the data blocks being subjected to different amounts of measurement error, or noise. To unveil the true mechanisms underlying the data, it could be fruitful to take noise heterogeneity into consideration in the data analysis. Maximum likelihood based SCA (MxLSCA-P) was developed for this purpose. In a previous simulation study it outperformed normal SCA-P. This previous study, however, did not mimic in many respects typical functional genomics data sets, such as, data blocks coupled via the experimental mode, more variables than experimental units, and medium to high correlations between variables. Here, we present a new simulation study in which the usefulness of MxLSCA-P compared to ordinary SCA-P is evaluated within a typical functional genomics setting. Subsequently, the performance of the two methods is evaluated by analysis of a real life Escherichia coli metabolomics data set.  相似文献   

15.
We have developed a program for microarray data analysis, which features the false discovery rate for testing statistical significance and the principal component analysis using the singular value decomposition method for detecting the global trends of gene-expression patterns. Additional features include analysis of variance with multiple methods for error variance adjustment, correction of cross-channel correlation for two-color microarrays, identification of genes specific to each cluster of tissue samples, biplot of tissues and corresponding tissue-specific genes, clustering of genes that are correlated with each principal component (PC), three-dimensional graphics based on virtual reality modeling language and sharing of PC between different experiments. The software also supports parameter adjustment, gene search and graphical output of results. The software is implemented as a web tool and thus the speed of analysis does not depend on the power of a client computer. AVAILABILITY: The tool can be used on-line or downloaded at http://lgsun.grc.nia.nih.gov/ANOVA/  相似文献   

16.
We develop a new technique to analyse microarray data which uses a combination of principal components analysis and consensus ensemble k-clustering to find robust clusters and gene markers in the data. We apply our method to a public microarray breast cancer dataset which has expression levels of genes in normal samples as well as in three pathological stages of disease; namely, atypical ductal hyperplasia or ADH, ductal carcinoma in situ or DCIS and invasive ductal carcinoma or IDC. Our method averages over clustering techniques and data perturbation to find stable, robust clusters and gene markers. We identify the clusters and their pathways with distinct subtypes of breast cancer (Luminal,Basal and Her2+). We confirm that the cancer phenotype develops early (in early hyperplasia or ADH stage) and find from our analysis that each subtype progresses from ADH to DCIS to IDC along its own specific pathway, as if each was a distinct disease.  相似文献   

17.
基于主成分分析的古树土壤肥力综合评价   总被引:1,自引:0,他引:1  
以广州市海珠区登记在册的40株古树为研究对象,调查其生长状况,采集土壤测定pH、EC值、容重、通气度、有机质、全N、全P、全K、水解N、有效P、速效K含量等11项理化指标,采用主成分分析和聚类分析对古树土壤肥力进行综合评价.结果表明:海珠区大多数古树土壤EC值偏低(<0.35 mS·cm-1),表现为强变异;土壤有机质...  相似文献   

18.
Shannon entropy is used to provide an estimate of the number of interpretable components in a principal component analysis. In addition, several ad hoc stopping rules for dimension determination are reviewed and a modification of the broken stick model is presented. The modification incorporates a test for the presence of an "effective degeneracy" among the subspaces spanned by the eigenvectors of the correlation matrix of the data set then allocates the total variance among subspaces. A summary of the performance of the methods applied to both published microarray data sets and to simulated data is given.  相似文献   

19.
In this work, we introduce in the first part new developments in Principal Component Analysis (PCA) and in the second part a new method to select variables (genes in our application). Our focus is on problems where the values taken by each variable do not all have the same importance and where the data may be contaminated with noise and contain outliers, as is the case with microarray data. The usual PCA is not appropriate to deal with this kind of problems. In this context, we propose the use of a new correlation coefficient as an alternative to Pearson's. This leads to a so-called weighted PCA (WPCA). In order to illustrate the features of our WPCA and compare it with the usual PCA, we consider the problem of analyzing gene expression data sets. In the second part of this work, we propose a new PCA-based algorithm to iteratively select the most important genes in a microarray data set. We show that this algorithm produces better results when our WPCA is used instead of the usual PCA. Furthermore, by using Support Vector Machines, we show that it can compete with the Significance Analysis of Microarrays algorithm.  相似文献   

20.
A novel approach to fatigue assessment during dynamic contractions was proposed which projected multiple surface myoelectric parameters onto the vector connecting the temporal start and end points in feature-space in order to extract the long-term trend information. The proposed end to end (ETE) projection was compared to traditional principal component analysis (PCA) as well as neural-network implementations of linear (LPCA) and non-linear PCA (NLPCA). Nine healthy participants completed two repetitions of fatigue tests during isometric, cyclic and random fatiguing contractions of the biceps brachii. The fatigue assessments were evaluated in terms of a modified sensitivity to variability ratio (SVR) and each method used a set of time-domain and frequency-domain features which maximized the SVR. It was shown that there was no statistical difference among ETE, PCA and LPCA (p > 0.99) and that all three outperformed NLPCA (p < 0.0022). Future work will include a broader comparison of these methods to other new and established fatigue indices.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号