首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 921 毫秒
1.
GWAS has facilitated greatly the discovery of risk SNPs associated with complex diseases. Traditional methods analyze SNP individually and are limited by low power and reproducibility since correction for multiple comparisons is necessary. Several methods have been proposed based on grouping SNPs into SNP sets using biological knowledge and/or genomic features. In this article, we compare the linear kernel machine based test (LKM) and principal components analysis based approach (PCA) using simulated datasets under the scenarios of 0 to 3 causal SNPs, as well as simple and complex linkage disequilibrium (LD) structures of the simulated regions. Our simulation study demonstrates that both LKM and PCA can control the type I error at the significance level of 0.05. If the causal SNP is in strong LD with the genotyped SNPs, both the PCA with a small number of principal components (PCs) and the LKM with kernel of linear or identical-by-state function are valid tests. However, if the LD structure is complex, such as several LD blocks in the SNP set, or when the causal SNP is not in the LD block in which most of the genotyped SNPs reside, more PCs should be included to capture the information of the causal SNP. Simulation studies also demonstrate the ability of LKM and PCA to combine information from multiple causal SNPs and to provide increased power over individual SNP analysis. We also apply LKM and PCA to analyze two SNP sets extracted from an actual GWAS dataset on non-small cell lung cancer.  相似文献   

2.
H Gao  T Zhang  Y Wu  Y Wu  L Jiang  J Zhan  J Li  R Yang 《Heredity》2014,113(6):526-532
Given the drawbacks of implementing multivariate analysis for mapping multiple traits in genome-wide association study (GWAS), principal component analysis (PCA) has been widely used to generate independent ‘super traits'' from the original multivariate phenotypic traits for the univariate analysis. However, parameter estimates in this framework may not be the same as those from the joint analysis of all traits, leading to spurious linkage results. In this paper, we propose to perform the PCA for residual covariance matrix instead of the phenotypical covariance matrix, based on which multiple traits are transformed to a group of pseudo principal components. The PCA for residual covariance matrix allows analyzing each pseudo principal component separately. In addition, all parameter estimates are equivalent to those obtained from the joint multivariate analysis under a linear transformation. However, a fast least absolute shrinkage and selection operator (LASSO) for estimating the sparse oversaturated genetic model greatly reduces the computational costs of this procedure. Extensive simulations show statistical and computational efficiencies of the proposed method. We illustrate this method in a GWAS for 20 slaughtering traits and meat quality traits in beef cattle.  相似文献   

3.
The purpose of many microarray studies is to find the association between gene expression and sample characteristics such as treatment type or sample phenotype. There has been a surge of efforts developing different methods for delineating the association. Aside from the high dimensionality of microarray data, one well recognized challenge is the fact that genes could be complicatedly inter-related, thus making many statistical methods inappropriate to use directly on the expression data. Multivariate methods such as principal component analysis (PCA) and clustering are often used as a part of the effort to capture the gene correlation, and the derived components or clusters are used to describe the association between gene expression and sample phenotype. We propose a method for patient population dichotomization using maximally selected test statistics in combination with the PCA method, which shows favorable results. The proposed method is compared with a currently well-recognized method.  相似文献   

4.
5.
We compared the accuracies of four genomic-selection prediction methods as affected by marker density, level of linkage disequilibrium (LD), quantitative trait locus (QTL) number, sample size, and level of replication in populations generated from multiple inbred lines. Marker data on 42 two-row spring barley inbred lines were used to simulate high and low LD populations from multiple inbred line crosses: the first included many small full-sib families and the second was derived from five generations of random mating. True breeding values (TBV) were simulated on the basis of 20 or 80 additive QTL. Methods used to derive genomic estimated breeding values (GEBV) were random regression best linear unbiased prediction (RR–BLUP), Bayes-B, a Bayesian shrinkage regression method, and BLUP from a mixed model analysis using a relationship matrix calculated from marker data. Using the best methods, accuracies of GEBV were comparable to accuracies from phenotype for predicting TBV without requiring the time and expense of field evaluation. We identified a trade-off between a method's ability to capture marker-QTL LD vs. marker-based relatedness of individuals. The Bayesian shrinkage regression method primarily captured LD, the BLUP methods captured relationships, while Bayes-B captured both. Under most of the study scenarios, mixed-model analysis using a marker-derived relationship matrix (BLUP) was more accurate than methods that directly estimated marker effects, suggesting that relationship information was more valuable than LD information. When markers were in strong LD with large-effect QTL, or when predictions were made on individuals several generations removed from the training data set, however, the ranking of method performance was reversed and BLUP had the lowest accuracy.  相似文献   

6.
A new approach to nonlinear modeling and adaptive monitoring using fuzzy principal component regression (FPCR) is proposed and then applied to a real wastewater treatment plant (WWTP) data set. First, principal component analysis (PCA) is used to reduce the dimensionality of data and to remove collinearity. Second, the adaptive credibilistic fuzzy-c-means method is used to appropriately monitor diverse operating conditions based on the PCA score values. Then a new adaptive discrimination monitoring method is proposed to distinguish between a large process change and a simple fault. Third, a FPCR method is proposed, where the Takagi-Sugeno-Kang (TSK) fuzzy model is employed to model the relation between the PCA score values and the target output to avoid the over-fitting problem with original variables. Here, the rule bases, the centers and the widths of TSK fuzzy model are found by heuristic methods. The proposed FPCR method is applied to predict the output variable, the reduction of chemical oxygen demand in the full-scale WWTP. The result shows that it has the ability to model the nonlinear process and multiple operating conditions and is able to identify various operating regions and discriminate between a sustained fault and a simple fault (or abnormalities) occurring within the process data.  相似文献   

7.
Revisiting the problem of intron-exon identification, we use a principal component analysis (PCA) to classify DNA sequences and present first results that validate our approach. Sequences are translated into document vectors that represent their word content; a principal component analysis then defines Gaussian-distributed sequence classes. The classification uses word content and variation of word usage to distinguish sequences. We test our approach with several data sets of genomic DNA and are able to classify introns and exons with an accuracy of up to 96%. We compare the method with the best traditional coding measure, the non-overlapping hexamer frequency count, and find that the PCA method produces better results. We also investigate the degree of cross-validation between different data sets of introns and exons and find evidence that the quality of a data set can be detected.  相似文献   

8.
Emerging integrative analysis of genomic and anatomical imaging data which has not been well developed, provides invaluable information for the holistic discovery of the genomic structure of disease and has the potential to open a new avenue for discovering novel disease susceptibility genes which cannot be identified if they are analyzed separately. A key issue to the success of imaging and genomic data analysis is how to reduce their dimensions. Most previous methods for imaging information extraction and RNA-seq data reduction do not explore imaging spatial information and often ignore gene expression variation at the genomic positional level. To overcome these limitations, we extend functional principle component analysis from one dimension to two dimensions (2DFPCA) for representing imaging data and develop a multiple functional linear model (MFLM) in which functional principal scores of images are taken as multiple quantitative traits and RNA-seq profile across a gene is taken as a function predictor for assessing the association of gene expression with images. The developed method has been applied to image and RNA-seq data of ovarian cancer and kidney renal clear cell carcinoma (KIRC) studies. We identified 24 and 84 genes whose expressions were associated with imaging variations in ovarian cancer and KIRC studies, respectively. Our results showed that many significantly associated genes with images were not differentially expressed, but revealed their morphological and metabolic functions. The results also demonstrated that the peaks of the estimated regression coefficient function in the MFLM often allowed the discovery of splicing sites and multiple isoforms of gene expressions.  相似文献   

9.
Principal component analysis (PCA) is a dimensionality reduction and data analysis tool commonly used in many areas. The main idea of PCA is to represent high-dimensional data with a few representative components that capture most of the variance present in the data. However, there is an obvious disadvantage of traditional PCA when it is applied to analyze data where interpretability is important. In applications, where the features have some physical meanings, we lose the ability to interpret the principal components extracted by conventional PCA because each principal component is a linear combination of all the original features. For this reason, sparse PCA has been proposed to improve the interpretability of traditional PCA by introducing sparsity to the loading vectors of principal components. The sparse PCA can be formulated as an ? 1 regularized optimization problem, which can be solved by proximal gradient methods. However, these methods do not scale well because computation of the exact gradient is generally required at each iteration. Stochastic gradient framework addresses this challenge by computing an expected gradient at each iteration. Nevertheless, stochastic approaches typically have low convergence rates due to the high variance. In this paper, we propose a convex sparse principal component analysis (Cvx-SPCA), which leverages a proximal variance reduced stochastic scheme to achieve a geometric convergence rate. We further show that the convergence analysis can be significantly simplified by using a weak condition which allows a broader class of objectives to be applied. The efficiency and effectiveness of the proposed method are demonstrated on a large-scale electronic medical record cohort.  相似文献   

10.
Load carriage is a very common daily activity at home and in the workplace. Generally, the load is in the form of an external load carried by an individual, it could also be the excessive body mass carried by an overweight individual. To quantify the effects of carrying extra weight, whether in the form of an external load or excess body mass, motion capture data were generated for a diverse subject set. This consisted of twenty-three subjects generating one hundred fifteen trials for each loading condition. This study applied principal component analysis (PCA) to motion capture data in order to analyze the lower body gait patterns for four loading conditions: normal weight unloaded, normal weight loaded, overweight unloaded and overweight loaded.PCA has been shown to be a powerful tool for analyzing complex gait data. In this analysis, it is shown that in order to quantify the effects of external loads and/or for both normal weight and overweight subjects, the first principal component (PC1) is needed. For the work in this paper, PCs were generated from lower body joint angle data. The PC1 of the hip angle and PC1 of the ankle angle are shown to be an indicator of external load and BMI effects on temporal gait data.  相似文献   

11.
12.
《Process Biochemistry》2007,42(7):1124-1134
2D spectrofluorometry produces a large volume of spectral data during fermentation processes with recombinant E. coli, which can be analyzed using chemometric methods such as principal component analysis (PCA), principal component regression (PCR) and partial least square regression (PLS). An analysis of the spectral data by PCA results in scores and loadings that are not only visualized in the score-loading plots but are also used to monitor the fermentation processes on-line. The score plots provided useful qualitative information on four fermentation processes for the production of extracellular 5-aminolevulinic acid (ALA). Two chemometric models (PCR and PLS) were used to examine the correlation between the 2D fluorescence spectra and a few parameters of the fermentation processes. The results showed that PLS had slightly better calibration and prediction performance than PCR.  相似文献   

13.
Nguyen PH 《Proteins》2006,65(4):898-913
Employing the recently developed hierarchical nonlinear principal component analysis (NLPCA) method of Saegusa et al. (Neurocomputing 2004;61:57-70 and IEICE Trans Inf Syst 2005;E88-D:2242-2248), the complexities of the free energy landscapes of several peptides, including triglycine, hexaalanine, and the C-terminal beta-hairpin of protein G, were studied. First, the performance of this NLPCA method was compared with the standard linear principal component analysis (PCA). In particular, we compared two methods according to (1) the ability of the dimensionality reduction and (2) the efficient representation of peptide conformations in low-dimensional spaces spanned by the first few principal components. The study revealed that NLPCA reduces the dimensionality of the considered systems much better, than did PCA. For example, in order to get the similar error, which is due to representation of the original data of beta-hairpin in low dimensional space, one needs 4 and 21 principal components of NLPCA and PCA, respectively. Second, by representing the free energy landscapes of the considered systems as a function of the first two principal components obtained from PCA, we obtained the relatively well-structured free energy landscapes. In contrast, the free energy landscapes of NLPCA are much more complicated, exhibiting many states which are hidden in the PCA maps, especially in the unfolded regions. Furthermore, the study also showed that many states in the PCA maps are mixed up by several peptide conformations, while those of the NLPCA maps are more pure. This finding suggests that the NLPCA should be used to capture the essential features of the systems.  相似文献   

14.
This paper illustrates an application of principal component analysis (PCA), partial least squares regression (PLS) and generalized procrustes analysis (GPA) to evaluate the ability of a trained group of assessors to perceive rancidity in foods. PCA and regression PLS were utilized to determine to which extent sensory attributes capture the information perceived by a trained sensory panel, and if this can be developed into a predictive model for rancidity in sausages. The data were submitted to a GPA to obtain a map of the products for each subject as compared with a consensus products map. Assessors plots for the sensory attributes were also obtained to reveal the dissimilarities between panelists and to explore clustering.  相似文献   

15.
Access to real-time process information is desirable for consistent and efficient operation of bioprocesses. Near-infrared spectroscopy (NIRS) is known to have potential for providing real-time information on the quantitative levels of important bioprocess variables. However, given the fact that a typical NIR spectrum encompasses information regarding almost all the constituents of the sample matrix, there are few case studies that have investigated the spectral details for applications in bioprocess quality assessment or qualitative bioprocess monitoring. Such information would be invaluable in providing operator-level assistance on the progress of a bioprocess in industrial-scale productions. We investigated this aspect and report the results of our investigation. Near-infrared spectral information derived from scanning unprocessed culture fluid (broth) samples from a complex antibiotic production process was assessed for a data set that incorporated bioprocess variations. Principal component analysis was applied to the spectral data and the loadings and scores of the principal components studied. Changes in the spectral information that corresponded to variations in the bioprocess could be deciphered. Despite the complexity of the matrix, near-infrared spectra of the culture broth are shown to have valuable information that can be deconvoluted with the help of factor analysis techniques such as principal component analysis (PCA). Although complex to interpret, the loadings and score plots are shown to offer potential in process diagnosis that could be of value in the rapid assessment of process quality, and in data assessment prior to quantitative model development.  相似文献   

16.
Genomewide association studies (GWAS) aim to identify genetic markers strongly associated with quantitative traits by utilizing linkage disequilibrium (LD) between candidate genes and markers. However, because of LD between nearby genetic markers, the standard GWAS approaches typically detect a number of correlated SNPs covering long genomic regions, making corrections for multiple testing overly conservative. Additionally, the high dimensionality of modern GWAS data poses considerable challenges for GWAS procedures such as permutation tests, which are computationally intensive. We propose a cluster‐based GWAS approach that first divides the genome into many large nonoverlapping windows and uses linkage disequilibrium network analysis in combination with principal component (PC) analysis as dimensional reduction tools to summarize the SNP data to independent PCs within clusters of loci connected by high LD. We then introduce single‐ and multilocus models that can efficiently conduct the association tests on such high‐dimensional data. The methods can be adapted to different model structures and used to analyse samples collected from the wild or from biparental F2 populations, which are commonly used in ecological genetics mapping studies. We demonstrate the performance of our approaches with two publicly available data sets from a plant (Arabidopsis thaliana) and a fish (Pungitius pungitius), as well as with simulated data.  相似文献   

17.
The relationships between gas chromatographic (GC) profiles and sensory data of 72 purely fermented soy sauce samples were analyzed by multiple regression analysis and principal component analysis (PCA). Prior to the analysis, GC data was transformed into 7 different modes in order to compare the fitting to a hypothetical linear model. The result from logarithmically transformed ratio of each peak to the sum of whole peaks showed the best precision of predictability for sensory score (R = 0.978). As the result of PCA, eigen values of 10 PCs were shown to be larger than 1.0 but the 5 major PCs could account for 66% of the variance in the total variance of 39 GC peaks. The first and second PCs showed great importance for aroma quality and similarity or dissimilarity in profiles of extracted PCs showed a similar trend with quality differences evaluated by sensory tests. These results showed the importance of the harmonious balance of each aroma compound to create a preferable soy sauce aroma.  相似文献   

18.
The paper presents an application of principal component analysis (PCA) to ECG processing. For this purpose the ECG beats are time-aligned and stored in the columns of an auxiliary matrix. The matrix, considered as a set of multidimensional variables, undergoes PCA. Reconstruction of the respective columns on the basis of a low dimensional principal subspace leads to the enhancement of the stored ECG beats. A few modifications of this classical approach to ECG signal filtering by means of a multivariate analysis are introduced. The first one is based on replacing the classical PCA by its robust extension. The second consists in replacing the analysis of the whole synchronized beats by the analysis of shorter signal segments. This creates the background for the third modification, which introduces the concept of variable dimensions of the subspaces corresponding to different parts of ECG beats. The experiments performed show that introduction of the respective modifications significantly improves the classical approach to ECG processing by application of principal component analysis.  相似文献   

19.
微生物生态研究中基于BIOLOG方法的数据分析   总被引:21,自引:0,他引:21  
BIOLOG微平板法作为一种方便快速的微生物检验技术,已广泛应用于环境微生物检测,微生物生态研究等方面,发挥着越来越重要的作用。该方法可以获得关于微生物群落碳源利用能力的大量数据,反映出关于微生物活性的丰富信息。然而大量的数据也对解释和分析提出了挑战,分析了应用于BIOLOG产生数据的统计分析方法,对常用的AWCD值计算,多样性指数计算,主成分分析(PCA),聚类分析,相关、回归等方法深入探讨,阐述各自的功能、不足以及在应用中容易出现的问题。另外也对一些不常见的方法,如非参数多元分析(Non-Parametric version of MANOVA/Permutation version of MANOVA)、动力学参数分析、多元回归树、典范对应分析等也进行了讨论。通过对不同方法应用目标和原理的分析论述了各自优缺点,对微生物研究中基于BIOLOG方法数据分析的选择应用提供参考。  相似文献   

20.
A quickly growing number of characteristics reflecting various aspects of gene function and evolution can be either measured experimentally or computed from DNA and protein sequences. The study of pairwise correlations between such quantitative genomic variables as well as collective analysis of their interrelations by multidimensional methods have delivered crucial insights into the processes of molecular evolution. Here, we present a principal component analysis (PCA) of 16 genomic variables from Saccharomyces cerevisiae, the largest data set analyzed so far. Because many missing values and potential outliers hinder the direct calculation of principal components, we introduce the application of Bayesian PCA. We confirm some of the previously established correlations, such as evolutionary rate versus protein expression, and reveal new correlations such as those between translational efficiency, phosphorylation density, and protein age. Although the first principal component primarily contrasts genomic change and protein expression, the second component separates variables related to gene existence and expressed protein functions. Enrichment analysis on genes affecting variable correlations unveils classes of influential genes. For example, although ribosomal and nuclear transport genes make important contributions to the correlation between protein isoelectric point and molecular weight, protein synthesis and amino acid metabolism genes help cause the lack of significant correlation between propensity for gene loss and protein age. We present the novel Quagmire database (Quantitative Genomics Resource) which allows exploring relationships between more genomic variables in three model organisms-Escherichia coli, S. cerevisiae, and Homo sapiens (http://webclu.bio.wzw.tum.de:18080/quagmire).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号