首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
MOTIVATION: Principal Component Analysis (PCA) is one of the most popular dimensionality reduction techniques for the analysis of high-dimensional datasets. However, in its standard form, it does not take into account any error measures associated with the data points beyond a standard spherical noise. This indiscriminate nature provides one of its main weaknesses when applied to biological data with inherently large variability, such as expression levels measured with microarrays. Methods now exist for extracting credibility intervals from the probe-level analysis of cDNA and oligonucleotide microarray experiments. These credibility intervals are gene and experiment specific, and can be propagated through an appropriate probabilistic downstream analysis. RESULTS: We propose a new model-based approach to PCA that takes into account the variances associated with each gene in each experiment. We develop an efficient EM-algorithm to estimate the parameters of our new model. The model provides significantly better results than standard PCA, while remaining computationally reasonable. We show how the model can be used to 'denoise' a microarray dataset leading to improved expression profiles and tighter clustering across profiles. The probabilistic nature of the model means that the correct number of principal components is automatically obtained.  相似文献   

2.
MOTIVATION: Multilayer perceptrons (MLP) represent one of the widely used and effective machine learning methods currently applied to diagnostic classification based on high-dimensional genomic data. Since the dimensionalities of the existing genomic data often exceed the available sample sizes by orders of magnitude, the MLP performance may degrade owing to the curse of dimensionality and over-fitting, and may not provide acceptable prediction accuracy. RESULTS: Based on Fisher linear discriminant analysis, we designed and implemented an MLP optimization scheme for a two-layer MLP that effectively optimizes the initialization of MLP parameters and MLP architecture. The optimized MLP consistently demonstrated its ability in easing the curse of dimensionality in large microarray datasets. In comparison with a conventional MLP using random initialization, we obtained significant improvements in major performance measures including Bayes classification accuracy, convergence properties and area under the receiver operating characteristic curve (A(z)). SUPPLEMENTARY INFORMATION: The Supplementary information is available on http://www.cbil.ece.vt.edu/publications.htm  相似文献   

3.
Principal component analysis (PCA) is a dimensionality reduction and data analysis tool commonly used in many areas. The main idea of PCA is to represent high-dimensional data with a few representative components that capture most of the variance present in the data. However, there is an obvious disadvantage of traditional PCA when it is applied to analyze data where interpretability is important. In applications, where the features have some physical meanings, we lose the ability to interpret the principal components extracted by conventional PCA because each principal component is a linear combination of all the original features. For this reason, sparse PCA has been proposed to improve the interpretability of traditional PCA by introducing sparsity to the loading vectors of principal components. The sparse PCA can be formulated as an ? 1 regularized optimization problem, which can be solved by proximal gradient methods. However, these methods do not scale well because computation of the exact gradient is generally required at each iteration. Stochastic gradient framework addresses this challenge by computing an expected gradient at each iteration. Nevertheless, stochastic approaches typically have low convergence rates due to the high variance. In this paper, we propose a convex sparse principal component analysis (Cvx-SPCA), which leverages a proximal variance reduced stochastic scheme to achieve a geometric convergence rate. We further show that the convergence analysis can be significantly simplified by using a weak condition which allows a broader class of objectives to be applied. The efficiency and effectiveness of the proposed method are demonstrated on a large-scale electronic medical record cohort.  相似文献   

4.
The recent explosion in procurement and availability of high-dimensional gene- and protein-expression profile datasets for cancer diagnostics has necessitated the development of sophisticated machine learning tools with which to analyze them. A major limitation in the ability to accurate classify these high-dimensional datasets stems from the 'curse of dimensionality', occurring in situations where the number of genes or peptides significantly exceeds the total number of patient samples. Previous attempts at dealing with this issue have mostly centered on the use of a dimensionality reduction (DR) scheme, Principal Component Analysis (PCA), to obtain a low-dimensional projection of the high-dimensional data. However, linear PCA and other linear DR methods, which rely on Euclidean distances to estimate object similarity, do not account for the inherent underlying nonlinear structure associated with most biomedical data. The motivation behind this work is to identify the appropriate DR methods for analysis of high-dimensional gene- and protein-expression studies. Towards this end, we empirically and rigorously compare three nonlinear (Isomap, Locally Linear Embedding, Laplacian Eigenmaps) and three linear DR schemes (PCA, Linear Discriminant Analysis, Multidimensional Scaling) with the intent of determining a reduced subspace representation in which the individual object classes are more easily discriminable.  相似文献   

5.
In recent years, the intrinsic low rank structure of some datasets has been extensively exploited to reduce dimensionality, remove noise and complete the missing entries. As a well-known technique for dimensionality reduction and data compression, Generalized Low Rank Approximations of Matrices (GLRAM) claims its superiority on computation time and compression ratio over the SVD. However, GLRAM is very sensitive to sparse large noise or outliers and its robust version does not have been explored or solved yet. To address this problem, this paper proposes a robust method for GLRAM, named Robust GLRAM (RGLRAM). We first formulate RGLRAM as an l 1-norm optimization problem which minimizes the l 1-norm of the approximation errors. Secondly, we apply the technique of Augmented Lagrange Multipliers (ALM) to solve this l 1-norm minimization problem and derive a corresponding iterative scheme. Then the weak convergence of the proposed algorithm is discussed under mild conditions. Next, we investigate a special case of RGLRAM and extend RGLRAM to a general tensor case. Finally, the extensive experiments on synthetic data show that it is possible for RGLRAM to exactly recover both the low rank and the sparse components while it may be difficult for previous state-of-the-art algorithms. We also discuss three issues on RGLRAM: the sensitivity to initialization, the generalization ability and the relationship between the running time and the size/number of matrices. Moreover, the experimental results on images of faces with large corruptions illustrate that RGLRAM obtains the best denoising and compression performance than other methods.  相似文献   

6.
Nguyen PH 《Proteins》2006,65(4):898-913
Employing the recently developed hierarchical nonlinear principal component analysis (NLPCA) method of Saegusa et al. (Neurocomputing 2004;61:57-70 and IEICE Trans Inf Syst 2005;E88-D:2242-2248), the complexities of the free energy landscapes of several peptides, including triglycine, hexaalanine, and the C-terminal beta-hairpin of protein G, were studied. First, the performance of this NLPCA method was compared with the standard linear principal component analysis (PCA). In particular, we compared two methods according to (1) the ability of the dimensionality reduction and (2) the efficient representation of peptide conformations in low-dimensional spaces spanned by the first few principal components. The study revealed that NLPCA reduces the dimensionality of the considered systems much better, than did PCA. For example, in order to get the similar error, which is due to representation of the original data of beta-hairpin in low dimensional space, one needs 4 and 21 principal components of NLPCA and PCA, respectively. Second, by representing the free energy landscapes of the considered systems as a function of the first two principal components obtained from PCA, we obtained the relatively well-structured free energy landscapes. In contrast, the free energy landscapes of NLPCA are much more complicated, exhibiting many states which are hidden in the PCA maps, especially in the unfolded regions. Furthermore, the study also showed that many states in the PCA maps are mixed up by several peptide conformations, while those of the NLPCA maps are more pure. This finding suggests that the NLPCA should be used to capture the essential features of the systems.  相似文献   

7.
高维蛋白质波谱癌症数据分析,一直面临着高维数据的困扰。针对高维蛋白质波谱癌症数据在降维过程中的问题,提出基于小波分析技术和主成分分析技术的高维蛋白质波谱癌症数据特征提取的方法,并在特征提取之后,使用支持向量机进行分类。对8-7-02数据集进行2层小波分解时,分别使用db1、db3、db4、db6、db8、db10、haar小波基,并使用支持向量机进行分类,正确率分别达到98.18%、98.35%、98.04%、98.36%、97.89%、97.96%、98.20%。在进一步提高分类识别正确率的同时,提高了时间率。  相似文献   

8.
High dimensionality and small sample sizes, and their inherent risk of overfitting, pose great challenges for constructing efficient classifiers in microarray data classification. Therefore a feature selection technique should be conducted prior to data classification to enhance prediction performance. In general, filter methods can be considered as principal or auxiliary selection mechanism because of their simplicity, scalability, and low computational complexity. However, a series of trivial examples show that filter methods result in less accurate performance because they ignore the dependencies of features. Although few publications have devoted their attention to reveal the relationship of features by multivariate-based methods, these methods describe relationships among features only by linear methods. While simple linear combination relationship restrict the improvement in performance. In this paper, we used kernel method to discover inherent nonlinear correlations among features as well as between feature and target. Moreover, the number of orthogonal components was determined by kernel Fishers linear discriminant analysis (FLDA) in a self-adaptive manner rather than by manual parameter settings. In order to reveal the effectiveness of our method we performed several experiments and compared the results between our method and other competitive multivariate-based features selectors. In our comparison, we used two classifiers (support vector machine, -nearest neighbor) on two group datasets, namely two-class and multi-class datasets. Experimental results demonstrate that the performance of our method is better than others, especially on three hard-classify datasets, namely Wang''s Breast Cancer, Gordon''s Lung Adenocarcinoma and Pomeroy''s Medulloblastoma.  相似文献   

9.
For an adequate analysis of pathological speech signals, a sizeable number of parameters is required, such as those related to jitter, shimmer and noise content. Often this kind of high-dimensional signal representation is difficult to understand, even for expert voice therapists and physicians. Data visualization of a high-dimensional dataset can provide a useful first step in its exploratory data analysis, facilitating an understanding about its underlying structure. In the present paper, eight dimensionality reduction techniques, both classical and recent, are compared on speech data containing normal and pathological speech. A qualitative analysis of their dimensionality reduction capabilities is presented. The transformed data are also quantitatively evaluated, using classifiers, and it is found that it may be advantageous to perform the classification process on the transformed data, rather than on the original. These qualitative and quantitative analyses allow us to conclude that a nonlinear, supervised method, called kernel local Fisher discriminant analysis is superior for dimensionality reduction in the actual context.  相似文献   

10.
Graphical techniques have become powerful tools for the visualization and analysis of complicated biological systems. However, we cannot give such a graphical representation in a 2D/3D space when the dimensions of the represented data are more than three dimensions. The proposed method, a combination dimensionality reduction approach (CDR), consists of two parts: (i) principal component analysis (PCA) with a newly defined parameter ρ and (ii) locally linear embedding (LLE) with a proposed graphical selection for its optional parameter k. The CDR approach with ρ and k not only avoids loss of principal information, but also sufficiently well preserves the global high-dimensional structures in low-dimensional space such as 2D or 3D. The applications of the CDR on characteristic analysis at different codon positions in genome show that the method is a useful tool by which biologists could find useful biological knowledge.  相似文献   

11.
Three-dimensional gait analysis (3D–GA) is commonly used to answer clinical questions of the form “which joints and what variables are most affected during when”. When studying high-dimensional datasets, traditional dimension reduction methods (e.g. principal components analysis) require “data flattening”, which may make the ensuing solutions difficult to interpret. The aim of the present study is to present a case study of how a multi-dimensional dimension reduction technique, Parallel Factor 2 (PARAFAC2), provides a clinically interpretable set of solutions to typical biomechanical datasets where different variables are collected during walking and running. Three-dimensional kinematic and kinetic data used for the present analyses came from two publicly available datasets on walking (n = 33) and running (n = 28). For each dataset, a four-dimensional array was constructed as follows: Mode A was time normalized cycle points; mode B was the number of participants multiplied by the number of speed conditions tested; mode C was the number of joint degrees of freedom, and mode D was variable (angle, velocity, moment, power). Five factors for walking and four factors for running were extracted which explained 79.23% and 84.64% of their dataset’s variance. The factor which explains the greatest variance was swing-phase sagittal plane knee kinematics (walking), and kinematics and kinetics (running). Qualitatively, all extracted factors increased in magnitude with greater speed in both walking and running. This study is a proof of concept that PARAFAC2 is useful for performing dimension reduction and producing clinically interpretable solutions to guide clinical decision making.  相似文献   

12.
Humans and animals are able to learn complex behaviors based on a massive stream of sensory information from different modalities. Early animal studies have identified learning mechanisms that are based on reward and punishment such that animals tend to avoid actions that lead to punishment whereas rewarded actions are reinforced. However, most algorithms for reward-based learning are only applicable if the dimensionality of the state-space is sufficiently small or its structure is sufficiently simple. Therefore, the question arises how the problem of learning on high-dimensional data is solved in the brain. In this article, we propose a biologically plausible generic two-stage learning system that can directly be applied to raw high-dimensional input streams. The system is composed of a hierarchical slow feature analysis (SFA) network for preprocessing and a simple neural network on top that is trained based on rewards. We demonstrate by computer simulations that this generic architecture is able to learn quite demanding reinforcement learning tasks on high-dimensional visual input streams in a time that is comparable to the time needed when an explicit highly informative low-dimensional state-space representation is given instead of the high-dimensional visual input. The learning speed of the proposed architecture in a task similar to the Morris water maze task is comparable to that found in experimental studies with rats. This study thus supports the hypothesis that slowness learning is one important unsupervised learning principle utilized in the brain to form efficient state representations for behavioral learning.  相似文献   

13.
Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy in extracting the top principal components compared with existing tools, in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.  相似文献   

14.
The analysis of polychoric correlations via principal component analysis and exploratory factor analysis are well-known approaches to determine the dimensionality of ordered categorical items. However, the application of these approaches has been considered as critical due to the possible indefiniteness of the polychoric correlation matrix. A possible solution to this problem is the application of smoothing algorithms. This study compared the effects of three smoothing algorithms, based on the Frobenius norm, the adaption of the eigenvalues and eigenvectors, and on minimum-trace factor analysis, on the accuracy of various variations of parallel analysis by the means of a simulation study. We simulated different datasets which varied with respect to the size of the respondent sample, the size of the item set, the underlying factor model, the skewness of the response distributions and the number of response categories in each item. We found that a parallel analysis and principal component analysis of smoothed polychoric and Pearson correlations led to the most accurate results in detecting the number of major factors in simulated datasets when compared to the other methods we investigated. Of the methods used for smoothing polychoric correlation matrices, we recommend the algorithm based on minimum trace factor analysis.  相似文献   

15.

Background

The interaction between loci to affect phenotype is called epistasis. It is strict epistasis if no proper subset of the interacting loci exhibits a marginal effect. For many diseases, it is likely that unknown epistatic interactions affect disease susceptibility. A difficulty when mining epistatic interactions from high-dimensional datasets concerns the curse of dimensionality. There are too many combinations of SNPs to perform an exhaustive search. A method that could locate strict epistasis without an exhaustive search can be considered the brass ring of methods for analyzing high-dimensional datasets.

Methodology/Findings

A SNP pattern is a Bayesian network representing SNP-disease relationships. The Bayesian score for a SNP pattern is the probability of the data given the pattern, and has been used to learn SNP patterns. We identified a bound for the score of a SNP pattern. The bound provides an upper limit on the Bayesian score of any pattern that could be obtained by expanding a given pattern. We felt that the bound might enable the data to say something about the promise of expanding a 1-SNP pattern even when there are no marginal effects. We tested the bound using simulated datasets and semi-synthetic high-dimensional datasets obtained from GWAS datasets. We found that the bound was able to dramatically reduce the search time for strict epistasis. Using an Alzheimer''s dataset, we showed that it is possible to discover an interaction involving the APOE gene based on its score because of its large marginal effect, but that the bound is most effective at discovering interactions without marginal effects.

Conclusions/Significance

We conclude that the bound appears to ameliorate the curse of dimensionality in high-dimensional datasets. This is a very consequential result and could be pivotal in our efforts to reveal the dark matter of genetic disease risk from high-dimensional datasets.  相似文献   

16.
Principal component analysis is a powerful tool in biomechanics for reducing complex multivariate datasets to a subset of important parameters. However, interpreting the biomechanical meaning of these parameters can be a subjective process. Biomechanical interpretations that are based on visual inspection of extreme 5th and 95th percentile waveforms may be confounded when extreme waveforms express more than one biomechanical feature. This study compares interpretation of principal components using representative extremes with a recently developed method, called single component reconstruction, which provides an uncontaminated visualization of each individual biomechanical feature. Example datasets from knee joint moments, lateral gastrocnemius EMG, and lumbar spine kinematics are used to demonstrate that the representative extremes method and single component reconstruction can yield equivalent interpretations of principal components. However, single component reconstruction interpretation cannot be contaminated by other components, which may enhance the use and understanding of principal component analysis within the biomechanics community.  相似文献   

17.
18.
Biodiversity can be represented by different dimensions. While many diversity metrics try to capture the variation of these dimensions they also lead to a ‘fragmentation’ of the concept of biodiversity itself. Developing a unified measure that integrates all the dimensions of biodiversity is a theoretical solution for this problem, however, it remains operationally impossible. Alternatively, understanding which dimensions better represent the biodiversity of a set of communities can be a reliable way to integrate the different diversity metrics. Therefore, to achieve a holistic understand of biological diversity, we explore the concept of dimensionality. We define dimensionality of diversity as the number of complementary components of biodiversity, represented by diversity metrics, needed to describe biodiversity in an unambiguously and effective way. We provide a solution that joins two components of dimensionality – correlation and the variation – operationalized through two metrics, respectively: evenness of eigenvalues (EE) and importance values (IV). Through simulation we show that considering EE and IV together can provide information that is neglected when only EE is considered. We demonstrate how to apply this framework by investigating the dimensionality of South American small mammal communities. Our example evidenced that, for some representations of biological diversity, more attention is needed in the choice of diversity metrics necessary to effectively characterize biodiversity. We conclude by highlighting that this integrated framework provides a better understanding of dimensionality than considering only the correlation component.  相似文献   

19.
陈磊  刘毅慧 《生物信息学》2011,9(3):229-234
基因芯片技术是基因组学中的重要研究工具。而基因芯片数据( 微阵列数据) 往往是高维的,使得降维成为微阵列数据分析中的一个必要步骤。本文对美国哈佛医学院 G. J. Gordon 等人提供的肺癌微阵列数据进行分析。通过 t- test,Wilcoxon 秩和检测分别提取微阵列数据特征属性,后根据 CART( Classification and Regression Tree) 算法,以 Gini 差异性指标作为误差函数,用提取的特征属性广延的构造分类树; 再进行剪枝找到最优规模的树,目的是提高树的泛化性能使得能很好适应新的预测数据。实验证明: 该方法对肺癌微阵列数据分类识别率达到 96% 以上,且很稳定; 并可以得到人们容易理解的分类规则和分类关键基因。  相似文献   

20.
Memory-based learning schemes are characterized by their memorization of observed events. A memory-based learning scheme either memorizes the collected data directly or reorganizes such information and stores it distributively into a tabular memory. For the tabular type, the system requires a training process to determine the contents of the associative memory. This training process helps filter out zero-mean noise. Since the stored data are associated to pre-assigned input locations, memory management and data retrieval are easier and more efficient. Despite these merits, a drawback of tabular schemes is the difficulty in applying it to high-dimensional problems due to the curse of dimensionality. As the input dimensionality increases, the number of quantized elements in the input space increases at an exponential rate and that causes a huge demand of memory. In this paper, a dynamic tabular structure is proposed for possible relaxation of such a demand. The memorized data are organized as part of a k-d tree. Nodes in the tree, called vertices, correspond to some regularly assigned input points. Memory resource is allocated only at locations where it is needed. Being able to easily compute the vertex positions helps reduce the searching cost in data retrieval. In addition, the learning process is able to expand the tree structure into one covering the problem domain. With memory allocated based on demand, memory consumption becomes closely related to task complexity instead of input dimensionality, and is minimized. Algorithms for structure construction and training are presented in this paper. Simulation results demonstrate that the memory can be efficiently utilized. The developed scheme offers a solution for high-dimensional learning problems with a manageable size of effective input domain.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号