首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Many human traits are highly correlated. This correlation can be leveraged to improve the power of genetic association tests to identify markers associated with one or more of the traits. Principal component analysis (PCA) is a useful tool that has been widely used for the multivariate analysis of correlated variables. PCA is usually applied as a dimension reduction method: the few top principal components (PCs) explaining most of total trait variance are tested for association with a predictor of interest, and the remaining components are not analyzed. In this study we review the theoretical basis of PCA and describe the behavior of PCA when testing for association between a SNP and correlated traits. We then use simulation to compare the power of various PCA-based strategies when analyzing up to 100 correlated traits. We show that contrary to widespread practice, testing only the top PCs often has low power, whereas combining signal across all PCs can have greater power. This power gain is primarily due to increased power to detect genetic variants with opposite effects on positively correlated traits and variants that are exclusively associated with a single trait. Relative to other methods, the combined-PC approach has close to optimal power in all scenarios considered while offering more flexibility and more robustness to potential confounders. Finally, we apply the proposed PCA strategy to the genome-wide association study of five correlated coagulation traits where we identify two candidate SNPs that were not found by the standard approach.  相似文献   

2.
The paper deals with the optimal Bayes discriminant rule for qualitative variables. The performance of variable selection is investigated under strong assumptions like the restriction to dichotomous variables, which are assumed to be independent or dependent with fixed dependence structure, and all parameters known. Differences in comparison with normal variables in linear discriminant analysis can be shown. This is a further reason for applying special methods of discriminant analysis in the case of qualitative variables.  相似文献   

3.
We consider the statistical analysis of population structure using genetic data. We show how the two most widely used approaches to modeling population structure, admixture-based models and principal components analysis (PCA), can be viewed within a single unifying framework of matrix factorization. Specifically, they can both be interpreted as approximating an observed genotype matrix by a product of two lower-rank matrices, but with different constraints or prior distributions on these lower-rank matrices. This opens the door to a large range of possible approaches to analyzing population structure, by considering other constraints or priors. In this paper, we introduce one such novel approach, based on sparse factor analysis (SFA). We investigate the effects of the different types of constraint in several real and simulated data sets. We find that SFA produces similar results to admixture-based models when the samples are descended from a few well-differentiated ancestral populations and can recapitulate the results of PCA when the population structure is more “continuous,” as in isolation-by-distance models.  相似文献   

4.
Selecting an appropriate variable subset in linear multivariate methods is an important methodological issue for ecologists. Interest often exists in obtaining general predictive capacity or in finding causal inferences from predictor variables. Because of a lack of solid knowledge on a studied phenomenon, scientists explore predictor variables in order to find the most meaningful (i.e. discriminating) ones. As an example, we modelled the response of the amphibious softwater plant Eleocharis multicaulis using canonical discriminant function analysis. We asked how variables can be selected through comparison of several methods: univariate Pearson chi-square screening, principal components analysis (PCA) and step-wise analysis, as well as combinations of some methods. We expected PCA to perform best. The selected methods were evaluated through fit and stability of the resulting discriminant functions and through correlations between these functions and the predictor variables. The chi-square subset, at P < 0.05, followed by a step-wise sub-selection, gave the best results. In contrast to expectations, PCA performed poorly, as so did step-wise analysis. The different chi-square subset methods all yielded ecologically meaningful variables, while probable noise variables were also selected by PCA and step-wise analysis. We advise against the simple use of PCA or step-wise discriminant analysis to obtain an ecologically meaningful variable subset; the former because it does not take into account the response variable, the latter because noise variables are likely to be selected. We suggest that univariate screening techniques are a worthwhile alternative for variable selection in ecology.  相似文献   

5.
The work reported in this paper examines the use of principal component analysis (PCA), a technique of multivariate statistics to facilitate the extraction of meaningful diagnostic information from a data set of chromatographic traces. Two data sets mimicking archived production records were analysed using PCA. In the first a full-factorial experimental design approach was used to generate the data. In the second, the chromatograms were generated by adjusting just one of the process variables at a time. Data base mining was achieved through the generation of both gross and disjoint principal component (PC) models. PCA provided easily interpretable 2-dimensional diagnostic plots revealing clusters of chromatograms obtained under similar operating conditions. PCA methods can be used to detect and diagnose changes in process conditions, however results show that a PCA model may require recalibration if an equipment change is made. We conclude that PCA methods may be useful for the diagnosis of subtle deviations from process specification not readily distinguishable to the operator.  相似文献   

6.
Correspondence analysis (CA) is frequently used in the interpretation of palaeontological data, but little is known about the minimum requirements for a result to be valid. Far from being a fundamental mathematical study of CA, this paper aims to present a tool, which may serve to evaluate results obtained in (palaeontological) praxis. We created matrices of random data, grouped by matrix size and varying percentages of zero cells. Each matrix was submitted to CA. Per matrix group the minimum, mean and maximum percentages of total inertia were calculated for the first four axes. We compared these results with several real cases in vertebrate paleontology. Valid conclusions based on CA can only be drawn on percentages that are considerably higher than the axis percentages obtained from random matrices.  相似文献   

7.
This paper briefly presents the aims, requirements and results of partial least squares regression analysis (PLSR), and its potential utility in ecological studies. This statistical technique is particularly well suited to analyzing a large array of related predictor variables (i.e. not truly independent), with a sample size not large enough compared to the number of independent variables, and in cases in which an attempt is made to approach complex phenomena or syndromes that must be defined as a combination of several variables obtained independently. A simulation experiment is carried out to compare this technique with multiple regression (MR) and with a combination of principal component analysis and multiple regression (PCA+MR), varying the number of predictor variables and sample sizes. PLSR models explained a similar amount of variance to those results obtained by MR and PCA+MR. However, PLSR was more reliable than other techniques when identifying relevant variables and their magnitudes of influence, especially in cases of small sample size and low tolerance. Finally, we present one example of PLSR to illustrate its application and interpretation in ecology.  相似文献   

8.
Abstract. Numerous ecological studies use Principal Components Analysis (PCA) for exploratory analysis and data reduction. Determination of the number of components to retain is the most crucial problem confronting the researcher when using PCA. An incorrect choice may lead to the underextraction of components, but commonly results in overextraction. Of several methods proposed to determine the significance of principal components, Parallel Analysis (PA) has proven consistently accurate in determining the threshold for significant components, variable loadings, and analytical statistics when decomposing a correlation matrix. In this procedure, eigenvalues from a data set prior to rotation are compared with those from a matrix of random values of the same dimensionality (p variables and n samples). PCA eigenvalues from the data greater than PA eigenvalues from the corresponding random data can be retained. All components with eigenvalues below this threshold value should be considered spurious. We illustrate Parallel Analysis on an environmental data set. We reviewed all articles utilizing PCA or Factor Analysis (FA) from 1987 to 1993 from Ecology, Ecological Monographs, Journal of Vegetation Science and Journal of Ecology. Analyses were first separated into those PCA which decomposed a correlation matrix and those PCA which decomposed a covariance matrix. Parallel Analysis (PA) was applied for each PCA/FA found in the literature. Of 39 analy ses (in 22 articles), 29 (74.4 %) considered no threshold rule, presumably retaining interpretable components. According to the PA results, 26 (66.7 %) overextracted components. This overextraction may have resulted in potentially misleading interpretation of spurious components. It is suggested that the routine use of PA in multivariate ordination will increase confidence in the results and reduce the subjective interpretation of supposedly objective methods.  相似文献   

9.
This paper examines the selection of the appropriate representation of chromatogram data prior to using principal component analysis (PCA), a multivariate statistical technique, for the diagnosis of chromatogram data sets. The effects of four process variables were investigated; flow rate, temperature, loading concentration and loading volume, for a size exclusion chromatography system used to separate three components (monomer, dimer, trimer). The study showed that major positional shifts in the elution peaks that result when running the separation at different flow rates caused the effects of other variables to be masked if the PCA is performed using elapsed time as the comparative basis. Two alternative methods of representing the data in chromatograms are proposed. In the first data were converted to a volumetric basis prior to performing the PCA, while in the second, having made this transformation the data were adjusted to account for the total material loaded during each separation. Two datasets were analysed to demonstrate the approaches. The results show that by appropriate selection of the basis prior to the analysis, significantly greater process insight can be gained from the PCA and demonstrates the importance of pre-processing prior to such analysis.  相似文献   

10.
##正## In this paper,a structural analysis is performed to gain insights on the synergistic mechanical amplification effect thatCampaniform sensilla have when combined in an array configuration.In order to simplify the analysis performed in this preliminaryinvestigation,an array of four holes in a single orthotropic lamina is considered.Firstly,a Finite Element Method(FEM) analysis is performed to discretely assess the influence that different geometrical parameters have on the mechanicalamplification properties of the array.Secondly,an artificial neural network is used to obtain an approximated multi-dimensionalcontinuous function,which models the relationship between the geometrical parameters and the amplification properties of thearray.Thirdly,an optimization is performed to identify the geometrical parameters yielding the maximum mechanical amplification.Finally,results are validated with an additional FEM simulation performed by varying geometrical parameters in theneighborhood of the identified optimal parameters.The method proposed in this paper can be fully automated and used to solvea wide range of optimization problems aimed at identifying optimal configurations of strain sensors inspired by Campaniformsensilla.  相似文献   

11.
In order to assess heavy metal pollution of sediment samples collected from Shuangtaizi estuary, contamination factor (CF) and multivariate statistical analyses, including principal component analysis (PCA), cluster analysis (CA) and correlation analysis, are carried out in this paper. CF confirms that Pb, Cu and Hg concentrations are very low and all fall within the range of background, while Zn and Cd demonstrate moderate contamination (in A10, A13, A15, A16 and A17 sites) and very high contamination (in A10, A13, A15 and A17 sites), respectively. The PCA indicates that four significant principal components (PCs) are extracted, explaining 88.959% of total variance, which suggests that Pb, Cu and Zn are mainly associated with organic carbon (OC). The result from CA is consistent with that obtained from PCA, classifying that the heavy metals in two clusters derive from different sources. Correlation analysis supports the conclusions from PCA and CA, elucidating the relationships between heavy metals and particle sizes of the sediments.  相似文献   

12.
MOTIVATION: Principal Component Analysis (PCA) is one of the most popular dimensionality reduction techniques for the analysis of high-dimensional datasets. However, in its standard form, it does not take into account any error measures associated with the data points beyond a standard spherical noise. This indiscriminate nature provides one of its main weaknesses when applied to biological data with inherently large variability, such as expression levels measured with microarrays. Methods now exist for extracting credibility intervals from the probe-level analysis of cDNA and oligonucleotide microarray experiments. These credibility intervals are gene and experiment specific, and can be propagated through an appropriate probabilistic downstream analysis. RESULTS: We propose a new model-based approach to PCA that takes into account the variances associated with each gene in each experiment. We develop an efficient EM-algorithm to estimate the parameters of our new model. The model provides significantly better results than standard PCA, while remaining computationally reasonable. We show how the model can be used to 'denoise' a microarray dataset leading to improved expression profiles and tighter clustering across profiles. The probabilistic nature of the model means that the correct number of principal components is automatically obtained.  相似文献   

13.
Metabolomics and other omics tools are generally characterized by large data sets with many variables obtained under different environmental conditions. Clustering methods and more specifically two-mode clustering methods are excellent tools for analyzing this type of data. Two-mode clustering methods allow for analysis of the behavior of subsets of metabolites under different experimental conditions. In addition, the results are easily visualized. In this paper we introduce a two-mode clustering method based on a genetic algorithm that uses a criterion that searches for homogeneous clusters. Furthermore we introduce a cluster stability criterion to validate the clusters and we provide an extended knee plot to select the optimal number of clusters in both experimental and metabolite modes. The genetic algorithm-based two-mode clustering gave biological relevant results when it was applied to two real life metabolomics data sets. It was, for instance, able to identify a catabolic pathway for growth on several of the carbon sources. Electronic supplementary material The online version of this article (doi:) contains supplementary material, which is available to authorized users. J. A. Hageman and R. A. van den Berg contributed equally to this paper.  相似文献   

14.
One important problem when calculating structures of biomolecules from NMR data is distinguishing converged structures from outlier structures. This paper describes how Principal Components Analysis (PCA) has the potential to classify calculated structures automatically, according to correlated structural variation across the population. PCA analysis has the additional advantage that it highlights regions of proteins which are varying across the population. To apply PCA, protein structures have to be reduced in complexity and this paper describes two different representations of protein structures which achieve this. The calculated structures of a 28 amino acid peptide are used to demonstrate the methods. The two different representations of protein structure are shown to give equivalent results, and correct results are obtained even though the ensemble of structures used as an example contains two different protein conformations. The PCA analysis also correctly identifies the structural differences between the two conformations.  相似文献   

15.
The purpose of this study was to determine the swimmers’ loss of speed during the underwater gliding motion of a grab start. This study also set out to determine the kinematical variables influencing this loss of speed. Eight French national-level swimmers participated in this study. The swimmers were filmed using 4 mini-DV cameras during the entire underwater phase. Using the DLT technique and the Dempster's anthropometric data, swimmer's movement have been identified. Two principal components analysis (PCA) have been used to study the relations between the kinematical variables influencing the loss of speed. The swimmers reached a velocity between 2.2 and 1.9 m s?1 after their centre of mass covered a distance ranging between 5.63 and 6.01 m from the start wall. For this range of velocity, head position was included between 6.02 and 6.51 m. First PCA show that the kinematical parameters at the immersion (first image at which the swimmers’ whole body was under water) are included in the first two components. Second PCA show that the knee, hip and shoulder angles can be included in the same component. The present study identified the optimal instant for initiating underwater leg movements after a grab start. This study also showed that the performance during the underwater gliding motion is determined as much by variables at the immersion as by the swimmer's loss of speed. It also seems that to hold the streamlined position the synergetic action of the knee, the hip and the shoulder is essential.  相似文献   

16.
Multiple imputation (MI) is used to handle missing at random (MAR) data. Despite warnings from statisticians, continuous variables are often recoded into binary variables. With MI it is important that the imputation and analysis models are compatible; variables should be imputed in the same form they appear in the analysis model. With an encoded binary variable more accurate imputations may be obtained by imputing the underlying continuous variable. We conducted a simulation study to explore how best to impute a binary variable that was created from an underlying continuous variable. We generated a completely observed continuous outcome associated with an incomplete binary covariate that is a categorized version of an underlying continuous covariate, and an auxiliary variable associated with the underlying continuous covariate. We simulated data with several sample sizes, and set 25% and 50% of data in the covariate to MAR dependent on the outcome and the auxiliary variable. We compared the performance of five different imputation methods: (a) Imputation of the binary variable using logistic regression; (b) imputation of the continuous variable using linear regression, then categorizing into the binary variable; (c, d) imputation of both the continuous and binary variables using fully conditional specification (FCS) and multivariate normal imputation; (e) substantive-model compatible (SMC) FCS. Bias and standard errors were large when the continuous variable only was imputed. The other methods performed adequately. Imputation of both the binary and continuous variables using FCS often encountered mathematical difficulties. We recommend the SMC-FCS method as it performed best in our simulation studies.  相似文献   

17.
MOTIVATION: Microarrays are capable of determining the expression levels of thousands of genes simultaneously. In combination with classification methods, this technology can be useful to support clinical management decisions for individual patients, e.g. in oncology. The aim of this paper is to systematically benchmark the role of non-linear versus linear techniques and dimensionality reduction methods. RESULTS: A systematic benchmarking study is performed by comparing linear versions of standard classification and dimensionality reduction techniques with their non-linear versions based on non-linear kernel functions with a radial basis function (RBF) kernel. A total of 9 binary cancer classification problems, derived from 7 publicly available microarray datasets, and 20 randomizations of each problem are examined. CONCLUSIONS: Three main conclusions can be formulated based on the performances on independent test sets. (1) When performing classification with least squares support vector machines (LS-SVMs) (without dimensionality reduction), RBF kernels can be used without risking too much overfitting. The results obtained with well-tuned RBF kernels are never worse and sometimes even statistically significantly better compared to results obtained with a linear kernel in terms of test set receiver operating characteristic and test set accuracy performances. (2) Even for classification with linear classifiers like LS-SVM with linear kernel, using regularization is very important. (3) When performing kernel principal component analysis (kernel PCA) before classification, using an RBF kernel for kernel PCA tends to result in overfitting, especially when using supervised feature selection. It has been observed that an optimal selection of a large number of features is often an indication for overfitting. Kernel PCA with linear kernel gives better results.  相似文献   

18.
BACKGROUND: Vibrio alginolyticus is known to enter into a viable but nonculturable (VBNC) state in response to environmental conditions unfavorable to the growth. Cells in VBNC condition pose a public health threat because they are potentially pathogenic. METHODS: We constructed a pathway for the identification of the most significant variables and the characterization of those variables able to discriminate the groups under investigation. Different parameters measured by the image processing software were chosen as the most representative of V. alginolyticus cell morphology (length index for dimension) and metabolic activity (density profile indexes). To detect relationships between the groups of treatment performed, we carried out a principal components analysis (PCA). RESULTS: The PCA analysis indicated that increasing coccoid shape transformation was related to both metabolic and dimension variations, delineating a well defined graph profile. Indeed, we discovered that specific morphological variations occur when cells in the culturable state pass into VBNC condition, namely comma-shaped culturable bacteria are converted into coccoid-shaped VBNC cells. The results were also supported by scanning electron microscopy analysis. CONCLUSIONS: This technique allows the analysis of a large number of vibrio samples in a short period of time. The obtained multiparameter information may complement genetic/molecular analyses facilitating, in an automatic fashion, further studies to evaluate the potential risk of this pathogen in the environment. It may also be a useful tool for large-scale cell biology studies and high content screening.  相似文献   

19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号