首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 734 毫秒
1.
A single determinant dominates the rate of yeast protein evolution   总被引:21,自引:0,他引:21  
A gene's rate of sequence evolution is among the most fundamental evolutionary quantities in common use, but what determines evolutionary rates has remained unclear. Here, we carry out the first combined analysis of seven predictors (gene expression level, dispensability, protein abundance, codon adaptation index, gene length, number of protein-protein interactions, and the gene's centrality in the interaction network) previously reported to have independent influences on protein evolutionary rates. Strikingly, our analysis reveals a single dominant variable linked to the number of translation events which explains 40-fold more variation in evolutionary rate than any other, suggesting that protein evolutionary rate has a single major determinant among the seven predictors. The dominant variable explains nearly half the variation in the rate of synonymous and protein evolution. We show that the two most commonly used methods to disentangle the determinants of evolutionary rate, partial correlation analysis and ordinary multivariate regression, produce misleading or spurious results when applied to noisy biological data. We overcome these difficulties by employing principal component regression, a multivariate regression of evolutionary rate against the principal components of the predictor variables. Our results support the hypothesis that translational selection governs the rate of synonymous and protein sequence evolution in yeast.  相似文献   

2.
Reflections on univariate and multivariate analysis of metabolomics data   总被引:1,自引:0,他引:1  
Metabolomics experiments usually result in a large quantity of data. Univariate and multivariate analysis techniques are routinely used to extract relevant information from the data with the aim of providing biological knowledge on the problem studied. Despite the fact that statistical tools like the t test, analysis of variance, principal component analysis, and partial least squares discriminant analysis constitute the backbone of the statistical part of the vast majority of metabolomics papers, it seems that many basic but rather fundamental questions are still often asked, like: Why do the results of univariate and multivariate analyses differ? Why apply univariate methods if you have already applied a multivariate method? Why if I do not see something univariately I see something multivariately? In the present paper we address some aspects of univariate and multivariate analysis, with the scope of clarifying in simple terms the main differences between the two approaches. Applications of the t test, analysis of variance, principal component analysis and partial least squares discriminant analysis will be shown on both real and simulated metabolomics data examples to provide an overview on fundamental aspects of univariate and multivariate methods.  相似文献   

3.
It is not uncommon for biological anthropologists to analyze incomplete bioarcheological or forensic skeleton specimens. As many quantitative multivariate analyses cannot handle incomplete data, missing data imputation or estimation is a common preprocessing practice for such data. Using William W. Howells' Craniometric Data Set and the Goldman Osteometric Data Set, we evaluated the performance of multiple popular statistical methods for imputing missing metric measurements. Results indicated that multiple imputation methods outperformed single imputation methods, such as Bayesian principal component analysis (BPCA). Multiple imputation with Bayesian linear regression implemented in the R package norm2, the Expectation–Maximization (EM) with Bootstrapping algorithm implemented in Amelia, and the Predictive Mean Matching (PMM) method and several of the derivative linear regression models implemented in mice, perform well regarding accuracy, robustness, and speed. Based on the findings of this study, we suggest a practical procedure for choosing appropriate imputation methods.  相似文献   

4.
MOTIVATION: The identification of physiological processes underlying and generating the expression pattern observed in microarray experiments is a major challenge. Principal component analysis (PCA) is a linear multivariate statistical method that is regularly employed for that purpose as it provides a reduced-dimensional representation for subsequent study of possible biological processes responding to the particular experimental conditions. Making explicit the data assumptions underlying PCA highlights their lack of biological validity thus making biological interpretation of the principal components problematic. A microarray data representation which enables clear biological interpretation is a desirable analysis tool. RESULTS: We address this issue by employing the probabilistic interpretation of PCA and proposing alternative linear factor models which are based on refined biological assumptions. A practical study on two well-understood microarray datasets highlights the weakness of PCA and the greater biological interpretability of the linear models we have developed.  相似文献   

5.
Graphical techniques have become powerful tools for the visualization and analysis of complicated biological systems. However, we cannot give such a graphical representation in a 2D/3D space when the dimensions of the represented data are more than three dimensions. The proposed method, a combination dimensionality reduction approach (CDR), consists of two parts: (i) principal component analysis (PCA) with a newly defined parameter ρ and (ii) locally linear embedding (LLE) with a proposed graphical selection for its optional parameter k. The CDR approach with ρ and k not only avoids loss of principal information, but also sufficiently well preserves the global high-dimensional structures in low-dimensional space such as 2D or 3D. The applications of the CDR on characteristic analysis at different codon positions in genome show that the method is a useful tool by which biologists could find useful biological knowledge.  相似文献   

6.
Because most macroecological and biodiversity data are spatially autocorrelated, special tools for describing spatial structures and dealing with hypothesis testing are usually required. Unfortunately, most of these methods have not been available in a single statistical package. Consequently, using these tools is still a challenge for most ecologists and biogeographers. In this paper, we present sam (Spatial Analysis in Macroecology), a new, easy-to-use, freeware package for spatial analysis in macroecology and biogeography. Through an intuitive, fully graphical interface, this package allows the user to describe spatial patterns in variables and provides an explicit spatial framework for standard techniques of regression and correlation. Moran's I autocorrelation coefficient can be calculated based on a range of matrices describing spatial relationships, for original variables as well as for residuals of regression models, which can also include filtering components (obtained by standard trend surface analysis or by principal coordinates of neighbour matrices). sam also offers tools for correcting the number of degrees of freedom when calculating the significance of correlation coefficients. Explicit spatial modelling using several forms of autoregression and generalized least-squares models are also available. We believe this new tool will provide researchers with the basic statistical tools to resolve autocorrelation problems and, simultaneously, to explore spatial components in macroecological and biogeographical data. Although the program was designed primarily for the applications in macroecology and biogeography, most of sam 's statistical tools will be useful for all kinds of surface pattern spatial analysis. The program is freely available at http://www.ecoevol.ufg.br/sam (permanent URL at http://purl.oclc.org/sam/ ).  相似文献   

7.
The objective of this study was to investigate the usefulness of an array of statistical techniques to describe relationships between instrumental data and non‐oral sensory texture profiling scores by using a range of model processed cheese analogs as example. Pairwise correlation, used as an exploratory tool, showed no significant correlation for flexibility and greasiness with any individual instrumental parameter. Stepwise regression, principal component regression and partial least squares regression were used to generate models for firmness, stickiness and curdiness of the analogs studied. No models could be generated for flexibility and greasiness, and models for rubberiness had poor quality of fit compared with the other sensory attributes. In general, firmness, stickiness and curdiness were satisfactorily modeled by using chemical data and small deformation rheological parameters. Compression data (large deformation), often used in correlation studies regarding the texture of cheese, did not necessarily lead to better correlation results in comparison with other instrumental parameters used in this research.  相似文献   

8.

Root cause analysis (RCA) is one of the most prominent tools used to comprehensively evaluate a biopharmaceutical production process. Despite of its widespread use in industry, the Food and Drug Administration has observed a lot of unsuitable approaches for RCAs within the last years. The reasons for those unsuitable approaches are the use of incorrect variables during the analysis and the lack in process understanding, which impede correct model interpretation. Two major approaches to perform RCAs are currently dominating the chemical and pharmaceutical industry: raw data analysis and feature-based approach. Both techniques are shown to be able to identify the significant variables causing the variance of the response. Although they are different in data unfolding, the same tools as principal component analysis and partial least square regression are used in both concepts. Within this article we demonstrate the strength and weaknesses of both approaches. We proved that a fusion of both results in a comprehensive and effective workflow, which not only increases better process understanding. We demonstrate this workflow along with an example. Hence, the presented workflow allows to save analysis time and to reduce the effort of data mining by easy detection of the most important variables within the given dataset. Subsequently, the final obtained process knowledge can be translated into new hypotheses, which can be tested experimentally and thereby lead to effectively improving process robustness.

  相似文献   

9.
10.
This paper briefly presents the aims, requirements and results of partial least squares regression analysis (PLSR), and its potential utility in ecological studies. This statistical technique is particularly well suited to analyzing a large array of related predictor variables (i.e. not truly independent), with a sample size not large enough compared to the number of independent variables, and in cases in which an attempt is made to approach complex phenomena or syndromes that must be defined as a combination of several variables obtained independently. A simulation experiment is carried out to compare this technique with multiple regression (MR) and with a combination of principal component analysis and multiple regression (PCA+MR), varying the number of predictor variables and sample sizes. PLSR models explained a similar amount of variance to those results obtained by MR and PCA+MR. However, PLSR was more reliable than other techniques when identifying relevant variables and their magnitudes of influence, especially in cases of small sample size and low tolerance. Finally, we present one example of PLSR to illustrate its application and interpretation in ecology.  相似文献   

11.
The present review attempts to cover a number of methods that have appeared in the last few years for performing quantitative proteome analysis. However, due to the large number of methods described for both electrophoretic and chromatographic approaches, we have limited this review to conventional two-dimensional (2-D) map analysis which couples orthogonally a charge-based step (isoelectric focusing) to a size-based separation step (sodium dodecyl sulfate-electrophoresis). The first and oldest method applied to 2-D map data reduction is based on statistical analysis performed on sets of gels via powerful software packages, such as Melanie, PDQuest, Z3 and Z4000, Phoretix and Progenesis. This method calls for separately running a number of replicas for control and treated samples. The two sets of data are then merged and compared via a number of software packages which we describe. In addition to commercially-available systems, a number of home made approaches for 2-D map comparison have been recently described and are also reviewed. They are based on fuzzyfication of the digitized 2-D gel image coupled to linear discriminant analysis, three-way principal component analysis or a combination of principal component analysis and soft-independent modeling of class analogy. These statistical tools appear to perform well in differential proteomic studies.  相似文献   

12.
Geometric morphometrics is the statistical analysis of form based on Cartesian landmark coordinates. After separating shape from overall size, position, and orientation of the landmark configurations, the resulting Procrustes shape coordinates can be used for statistical analysis. Kendall shape space, the mathematical space induced by the shape coordinates, is a metric space that can be approximated locally by a Euclidean tangent space. Thus, notions of distance (similarity) between shapes or of the length and direction of developmental and evolutionary trajectories can be meaningfully assessed in this space. Results of statistical techniques that preserve these convenient properties—such as principal component analysis, multivariate regression, or partial least squares analysis—can be visualized as actual shapes or shape deformations. The Procrustes distance between a shape and its relabeled reflection is a measure of bilateral asymmetry. Shape space can be extended to form space by augmenting the shape coordinates with the natural logarithm of Centroid Size, a measure of size in geometric morphometrics that is uncorrelated with shape for small isotropic landmark variation. The thin-plate spline interpolation function is the standard tool to compute deformation grids and 3D visualizations. It is also central to the estimation of missing landmarks and to the semilandmark algorithm, which permits to include outlines and surfaces in geometric morphometric analysis. The powerful visualization tools of geometric morphometrics and the typically large amount of shape variables give rise to a specific exploratory style of analysis, allowing the identification and quantification of previously unknown shape features.  相似文献   

13.
Application of independent component analysis to microarrays   总被引:4,自引:1,他引:3  
We apply linear and nonlinear independent component analysis (ICA) to project microarray data into statistically independent components that correspond to putative biological processes, and to cluster genes according to over- or under-expression in each component. We test the statistical significance of enrichment of gene annotations within clusters. ICA outperforms other leading methods, such as principal component analysis, k-means clustering and the Plaid model, in constructing functionally coherent clusters on microarray datasets from Saccharomyces cerevisiae, Caenorhabditis elegans and human.  相似文献   

14.
In this study, we wanted to inspect whether the evolutionary driven differences in primary sequences could correlate, and thus predict the genetic diversity of related marker loci, which is an important criterion to assess the quality of any DNA marker. We adopted new approach of quantitative symbolic DNA sequence analysis called DNA random walk representation to study multiallelic marker loci from Begonia × tuberhybrida Voss. We described significant correlation of random walk-derived digital invariants to genetic diversity of the marker loci. Specifically, on the 3D-contour plot of multivariate principal component analysis (PCA), we revealed statistical correlation between the first two PCA factors and the number of alleles per marker locus. Based on that correlation, we suggest that DNA walk representation may predict allele-rich loci solely from their primary sequences, which improves current design of new DNA germplasm identificators.  相似文献   

15.
 In many applications of signal processing, especially in communications and biomedicine, preprocessing is necessary to remove noise from data recorded by multiple sensors. Typically, each sensor or electrode measures the noisy mixture of original source signals. In this paper a noise reduction technique using independent component analysis (ICA) and subspace filtering is presented. In this approach we apply subspace filtering not to the observed raw data but to a demixed version of these data obtained by ICA. Finite impulse response filters are employed whose vectors are parameters estimated based on signal subspace extraction. ICA allows us to filter independent components. After the noise is removed we reconstruct the enhanced independent components to obtain clean original signals; i.e., we project the data to sensor level. Simulations as well as real application results for EEG-signal noise elimination are included to show the validity and effectiveness of the proposed approach. Received: 6 November 2000 / Accepted in revised form: 12 November 2001  相似文献   

16.

Background  

Cluster analysis is an important technique for the exploratory analysis of biological data. Such data is often high-dimensional, inherently noisy and contains outliers. This makes clustering challenging. Mixtures are versatile and powerful statistical models which perform robustly for clustering in the presence of noise and have been successfully applied in a wide range of applications.  相似文献   

17.
Biomedical spectroscopic experiments generate large volumes of data. For accurate, robust diagnostic tools the data must be analyzed for only a few characteristic observations per subject, and a large number of subjects must be studied. We describe here two of the current data analytic approaches applied to this problem: SIMCA (principal component analysis, partial least squares), and the statistical classification strategy (SCS). We demonstrate the application of the SCS by three examples of its use in analyzing 1H NMR spectra: screening for colon cancer, characterization of thyroid cancer, and distinguishing cancer from cholangitis in the biliary tract.  相似文献   

18.
Gene expression datasets are large and complex, having many variables and unknown internal structure. We apply independent component analysis (ICA) to derive a less redundant representation of the expression data. The decomposition produces components with minimal statistical dependence and reveals biologically relevant information. Consequently, to the transformed data, we apply cluster analysis (an important and popular analysis tool for obtaining an initial understanding of the data, usually employed for class discovery). The proposed self-organizing map (SOM)-based clustering algorithm automatically determines the number of 'natural' subgroups of the data, being aided at this task by the available prior knowledge of the functional categories of genes. An entropy criterion allows each gene to be assigned to multiple classes, which is closer to the biological representation. These features, however, are not achieved at the cost of the simplicity of the algorithm, since the map grows on a simple grid structure and the learning algorithm remains equal to Kohonen's one.  相似文献   

19.
H Gao  T Zhang  Y Wu  Y Wu  L Jiang  J Zhan  J Li  R Yang 《Heredity》2014,113(6):526-532
Given the drawbacks of implementing multivariate analysis for mapping multiple traits in genome-wide association study (GWAS), principal component analysis (PCA) has been widely used to generate independent ‘super traits'' from the original multivariate phenotypic traits for the univariate analysis. However, parameter estimates in this framework may not be the same as those from the joint analysis of all traits, leading to spurious linkage results. In this paper, we propose to perform the PCA for residual covariance matrix instead of the phenotypical covariance matrix, based on which multiple traits are transformed to a group of pseudo principal components. The PCA for residual covariance matrix allows analyzing each pseudo principal component separately. In addition, all parameter estimates are equivalent to those obtained from the joint multivariate analysis under a linear transformation. However, a fast least absolute shrinkage and selection operator (LASSO) for estimating the sparse oversaturated genetic model greatly reduces the computational costs of this procedure. Extensive simulations show statistical and computational efficiencies of the proposed method. We illustrate this method in a GWAS for 20 slaughtering traits and meat quality traits in beef cattle.  相似文献   

20.
《Process Biochemistry》2007,42(7):1124-1134
2D spectrofluorometry produces a large volume of spectral data during fermentation processes with recombinant E. coli, which can be analyzed using chemometric methods such as principal component analysis (PCA), principal component regression (PCR) and partial least square regression (PLS). An analysis of the spectral data by PCA results in scores and loadings that are not only visualized in the score-loading plots but are also used to monitor the fermentation processes on-line. The score plots provided useful qualitative information on four fermentation processes for the production of extracellular 5-aminolevulinic acid (ALA). Two chemometric models (PCR and PLS) were used to examine the correlation between the 2D fluorescence spectra and a few parameters of the fermentation processes. The results showed that PLS had slightly better calibration and prediction performance than PCR.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号