首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 359 毫秒
1.
Metabolomics data are typically complex and high dimensional. Multivariate dimension-reducing techniques have thus been developed for analysing metabolomics data to disclose underlying relationships, with principal component analysis (PCA) as the technique mostly applied. Despite its widespread use in metabolomics, PCA has shortcomings that limit its applicability. Several approaches have been made to overcome these limitations and we describe an advanced disjoint PCA (DPCA) model, termed concurrent class analysis and abbreviated as CONCA. CONCA is a new model, and is unique in linking DPCA models to a traditional PCA model. This is accomplished by restructuring the input data matrix, applying DPCA group models to the restructured data, and combining the DPCA models in order to replicate a traditional PCA. We applied the CONCA model to a metabolomics data set on isovaleric acidaemia (IVA), a rare inherited metabolic disorder. The outcome showed that three of the variables with high discrimination value identified through the CONCA analysis are prominent organic acid biomarkers for IVA. Moreover, three further minor metabolites associated with the disease, and two as a consequence of treatment, were likewise identified as important discriminatory variables. The benefit of the CONCA model thus is its ability to disclose information concerning each individual group and to identify the variables important in discrimination (VIDs) which are also responsible for group separation.  相似文献   

2.
Isovaleric acidemia (IVA, MIM 248600) can be a severe and potentially life-threatening disease in affected neonates, but with a positive prognosis on treatment for some phenotypes. This study presents the first application of metabolomics to evaluate the metabolite profiles derived from urine samples of untreated and treated IVA patients as well as of obligate heterozygotes. All IVA patients carried the same homozygous c.367 G > A nucleotide change in exon 4 of the IVD gene but manifested phenotypic diversity. Concurrent class analysis (CONCA) was used to compare all the metabolites from the original complete data set obtained from the three case and two control groups used in this investigation. This application of CONCA has not been reported previously, and is used here to compare four different modes of scaling of all metabolites. The variables important in discrimination from the CONCA thus enabled the recognition of different metabolic patterns encapsulated within the data sets that would not have been revealed by using only one mode of scaling. Application of multivariate and univariate analyses disclosed 11 important metabolites that distinguished untreated IVA from controls. These included well-established diagnostic biomarkers of IVA, endogenous detoxification markers, and 3-hydroxycaproic acid, an indicator of ketosis, but not reported previously for this disease. Nine metabolites were identified that reflected the effect of treatment of IVA. They included detoxification products and indicators related to the high carbohydrate and low protein diet which formed the hallmark of the treatment. This investigation also provides the first comparative metabolite profile for heterozygotes of this inherited metabolic disorder. The detection of informative metabolites in even very low concentrations in all three experimental groups highlights the potential advantage of the holistic mode of analysis of inherited metabolic diseases in a metabolomics investigation.  相似文献   

3.
PCA (principal components analysis) and ANN (artificial neural network) are two broadly used pattern recognition methods in metabolomics data-mining. Yet their limitations sometimes are great obstacles for researchers. In this paper the wavelet transform (WT) method was used to integrate with PCA and ANN to improve their performance in manipulating metabolomics data. A dataset was decomposed by wavelets and then reconstructed. The "hard thresholding" algorithm was used, through which the detail information was discarded, and the entire "metabolomics image" reconstructed on the significant information. It was supposed that the most relevant information was captured after this process. It was found that, thanks to its ability in denoising data, the WT method could significantly improve the performance of the non-linear essence-extracting method ANN in classifying samples; further integration of WT with PCA showed that WT could greatly enhance the ability of PCA in distinguishing one group of samples from another and also its ability in identifying potential biomarkers. The results highlighted WT as a promising resolution in bridging the gap between huge bytes of data and the instructive biological information.  相似文献   

4.
In this paper, we propose and implement a hybrid model combining two-directional two-dimensional principal component analysis ((2D)2PCA) and a Radial Basis Function Neural Network (RBFNN) to forecast stock market behavior. First, 36 stock market technical variables are selected as the input features, and a sliding window is used to obtain the input data of the model. Next, (2D)2PCA is utilized to reduce the dimension of the data and extract its intrinsic features. Finally, an RBFNN accepts the data processed by (2D)2PCA to forecast the next day''s stock price or movement. The proposed model is used on the Shanghai stock market index, and the experiments show that the model achieves a good level of fitness. The proposed model is then compared with one that uses the traditional dimension reduction method principal component analysis (PCA) and independent component analysis (ICA). The empirical results show that the proposed model outperforms the PCA-based model, as well as alternative models based on ICA and on the multilayer perceptron.  相似文献   

5.
A new approach to nonlinear modeling and adaptive monitoring using fuzzy principal component regression (FPCR) is proposed and then applied to a real wastewater treatment plant (WWTP) data set. First, principal component analysis (PCA) is used to reduce the dimensionality of data and to remove collinearity. Second, the adaptive credibilistic fuzzy-c-means method is used to appropriately monitor diverse operating conditions based on the PCA score values. Then a new adaptive discrimination monitoring method is proposed to distinguish between a large process change and a simple fault. Third, a FPCR method is proposed, where the Takagi-Sugeno-Kang (TSK) fuzzy model is employed to model the relation between the PCA score values and the target output to avoid the over-fitting problem with original variables. Here, the rule bases, the centers and the widths of TSK fuzzy model are found by heuristic methods. The proposed FPCR method is applied to predict the output variable, the reduction of chemical oxygen demand in the full-scale WWTP. The result shows that it has the ability to model the nonlinear process and multiple operating conditions and is able to identify various operating regions and discriminate between a sustained fault and a simple fault (or abnormalities) occurring within the process data.  相似文献   

6.

Principal component analysis (PCA) is probably one of the most used methods for exploratory data analysis. However, it may not be always effective when there are multiple influential factors. In this paper, the use of multiblock PCA for analysing such types of data is demonstrated through a real metabolomics study combined with a series of data simulating two underlying influential factors with different types of interactions based on 2 × 2 experiment designs. The performance of multiblock PCA is compared with those of PCA and also ANOVA-PCA which is another PCA extension developed to solve similar problems. The results demonstrate that multiblock PCA is highly efficient at analysing such types of data which contain multiple influential factors. These models give the most comprehensive view of data compared to the other two methods. The combination of super scores and block scores shows not only the general trends of changing caused by each of the influential factors but also the subtle changes within each combination of the factors and their levels. It is also highly resistant to the addition of ‘irrelevant’ competing information and the first PC remains the most discriminant one which neither of the other two methods was able to do. The reason of such property was demonstrated by employing a 2 × 3 experiment designs. Finally, the validity of the results shown by the multiblock PCA was tested using permutation tests and the results suggested that the inherit risk of over-fitting of this type of approach is low.

  相似文献   

7.
Clustering and correlation analysis techniques have become popular tools for the analysis of data produced by metabolomics experiments. The results obtained from these approaches provide an overview of the interactions between objects of interest. Often in these experiments, one is more interested in information about the nature of these relationships, e.g., cause-effect relationships, than in the actual strength of the interactions. Finding such relationships is of crucial importance as most biological processes can only be understood in this way. Bayesian networks allow representation of these cause-effect relationships among variables of interest in terms of whether and how they influence each other given that a third, possibly empty, group of variables is known. This technique also allows the incorporation of prior knowledge as established from the literature or from biologists. The representation as a directed graph of these relationship is highly intuitive and helps to understand these processes. This paper describes how constraint-based Bayesian networks can be applied to metabolomics data and can be used to uncover the important pathways which play a significant role in the ripening of fresh tomatoes. We also show here how this methods of reconstructing pathways is intuitive and performs better than classical techniques. Methods for learning Bayesian network models are powerful tools for the analysis of data of the magnitude as generated by metabolomics experiments. It allows one to model cause-effect relationships and helps in understanding the underlying processes.  相似文献   

8.
Analysis of longitudinal metabolomics data   总被引:7,自引:0,他引:7  
MOTIVATION: Metabolomics datasets are generally large and complex. Using principal component analysis (PCA), a simplified view of the variation in the data is obtained. The PCA model can be interpreted and the processes underlying the variation in the data can be analysed. In metabolomics, often a priori information is present about the data. Various forms of this information can be used in an unsupervised data analysis with weighted PCA (WPCA). A WPCA model will give a view on the data that is different from the view obtained using PCA, and it will add to the interpretation of the information in a metabolomics dataset. RESULTS: A method is presented to translate spectra of repeated measurements into weights describing the experimental error. These weights are used in the data analysis with WPCA. The WPCA model will give a view on the data where the non-uniform experimental error is accounted for. Therefore, the WPCA model will focus more on the natural variation in the data. AVAILABILITY: M-files for MATLAB for the algorithm used in this research are available at http://www-its.chem.uva.nl/research/pac/Software/pcaw.zip.  相似文献   

9.
Nuclear magnetic resonance (NMR) spectroscopy acts as the best tool that can be used in tissue engineering scaffolds to investigate unknown metabolites. Moreover, metabolomics is a systems approach for examining in vivo and in vitro metabolic profiles, which promises to provide data on cancer metabolic alterations. However, metabolomic profiling allows for the activity of small molecules and metabolic alterations to be measured. Furthermore, metabolic profiling also provides high-spectral resolution, which can then be linked to potential metabolic relationships. An altered metabolism is a hallmark of cancer that can control many malignant properties to drive tumorigenesis. Metabolite targeting and metabolic engineering contribute to carcinogenesis by proliferation, and metabolic differentiation. The resulting the metabolic differences are examined with traditional chemometric methods such as principal component analysis (PCA), and partial least squares-discriminate analysis (PLS-DA). In this review, we examine NMR-based activity metabolomic platforms that can be used to analyze various fluxomics and for multivariant statistical analysis in cancer. We also aim to provide the reader with a basic understanding of NMR spectroscopy, cancer metabolomics, target profiling, chemometrics, and multifunctional tools for metabolomics discrimination, with a focus on metabolic phenotypic diversity for cancer therapeutics.  相似文献   

10.
Being a relatively new addition to the 'omics' field, metabolomics is still evolving its own computational infrastructure and assessing its own computational needs. Due to its strong emphasis on chemical information and because of the importance of linking that chemical data to biological consequences, metabolomics must combine elements of traditional bioinformatics with traditional cheminformatics. This is a significant challenge as these two fields have evolved quite separately and require very different computational tools and skill sets. This review is intended to familiarize readers with the field of metabolomics and to outline the needs, the challenges and the recent progress being made in four areas of computational metabolomics: (i) metabolomics databases; (ii) metabolomics LIMS; (iii) spectral analysis tools for metabolomics and (iv) metabolic modeling.  相似文献   

11.
Principal component analysis (PCA) has been applied to a fed-batch fermentation for the production of streptokinase to identify the variables which are essential to formulate an adequate model. To mimic an industrial situation, Gaussian noise was introduced in the feed rate of the substrate. Both in the presence and in the absence of noise, the same five variables out of seven were selected by PCA. The minimal model trained separately without and with noise was able to predict satisfactorily the course of the fermentation for a condition not employed in training. These observations attest the suitability of PCA to formulate minimal models for industrial scale fermentations.  相似文献   

12.
As a systematic and holistic study of metabolites in plants, animals, and human beings, metabolomics has advanced considerably in recent years, due largely to the rapid development of analytical technology and the application of multivariate data analysis methods. Exploratory data analysis, which has played a crucial role in this advance, aims to examine the natural data structure to reveal important information. Principal components analysis (PCA) is probably the most widely used technique for exploratory data analysis, but projection pursuit (PP) is another important method that often outperforms PCA because it is based on distributional rather than variance optimization. Recent algorithmic improvements have made the implementation of PP easier, but, when the sample size is small compared to the number of variables, it is found that PP (with kurtosis as a projection index) fails to gives meaningful information. Mathematically, this involves the ill-posed inverse problem that also occurs for many other multivariate data analysis methods that result in overfitting. In this work, a regularized projection pursuit (RPP) method is proposed to solve this problem and iterative optimization algorithms are developed for both step-wise univariate and multivariate PP. The utility of the algorithms is established using simulated data, which also demonstrates the use of ridge trace plots for the optimization of the ridge parameter. Three experimental data sets in the public domain are also analyzed, including a study on soy bean disease (47 samples × 35 variables), NMR spectral data for glomerulonephritis patients (50 × 200) and metabolomics data from a bovine diet study (39 × 47). In all cases, RPP showed superior class separation compared to PCA or ordinary PP.  相似文献   

13.
The purpose of this study was to investigate whether visible and near-infrared (Vis-NIR) spectroscopy can be used for diagnoses of anti-phospholipid syndrome (APS). Vis-NIR spectra from 90 plasma samples [anti-phospholipid antibodies (aPLs)-positive group, n=48; aPLs-negative group, n=42] were subjected to principal component analysis (PCA) and soft independent modeling of class analogy (SIMCA) to develop multivariate models to discriminate between aPLs-positive and aPLs-negative. Both PCA and SIMCA models were further assessed by the prediction of 84 masked other determinations. The PCA model predicted successful discrimination of the masked samples with respect to aPLs-positive and aPLs-negative. The SIMCA model predicted 42 of 48 (87.5%) aPLs-positive patients and 33 of 36 (91.7%) aPLs-negative patients of Vis-NIR spectra from masked samples correctly. These results suggest that Vis-NIR spectroscopy combined with multivariate analysis could provide a promising tool to objectively diagnose APS.  相似文献   

14.

Partial Least Squares-Discriminant Analysis (PLS-DA) is a PLS regression method with a special binary ‘dummy’ y-variable and it is commonly used for classification purposes and biomarker selection in metabolomics studies. Several statistical approaches are currently in use to validate outcomes of PLS-DA analyses e.g. double cross validation procedures or permutation testing. However, there is a great inconsistency in the optimization and the assessment of performance of PLS-DA models due to many different diagnostic statistics currently employed in metabolomics data analyses. In this paper, properties of four diagnostic statistics of PLS-DA, namely the number of misclassifications (NMC), the Area Under the Receiver Operating Characteristic (AUROC), Q 2 and Discriminant Q 2 (DQ 2) are discussed. All four diagnostic statistics are used in the optimization and the performance assessment of PLS-DA models of three different-size metabolomics data sets obtained with two different types of analytical platforms and with different levels of known differences between two groups: control and case groups. Statistical significance of obtained PLS-DA models was evaluated with permutation testing. PLS-DA models obtained with NMC and AUROC are more powerful in detecting very small differences between groups than models obtained with Q 2 and Discriminant Q 2 (DQ 2). Reproducibility of obtained PLS-DA models outcomes, models complexity and permutation test distributions are also investigated to explain this phenomenon. DQ 2 and Q 2 (in contrary to NMC and AUROC) prefer PLS-DA models with lower complexity and require higher number of permutation tests and submodels to accurately estimate statistical significance of the model performance. NMC and AUROC seem more efficient and more reliable diagnostic statistics and should be recommended in two group discrimination metabolomic studies.

  相似文献   

15.
Aims Preserving and restoring Tamarix ramosissima is urgently required in the Tarim Basin, Northwest China. Using species distribution models to predict the biogeographical distribution of species is regularly used in conservation and other management activities. However, the uncertainty in the data and models inevitably reduces their prediction power. The major purpose of this study is to assess the impacts of predictor variables and species distribution models on simulating T. ramosissima distribution, to explore the relationships between predictor variables and species distribution models and to model the potential distribution of T. ramosissima in this basin.Methods Three models—the generalized linear model (GLM), classification and regression tree (CART) and Random Forests—were selected and were processed on the BIOMOD platform. The presence/absence data of T. ramosissima in the Tarim Basin, which were calculated from vegetation maps, were used as response variables. Climate, soil and digital elevation model (DEM) data variables were divided into four datasets and then used as predictors. The four datasets were (i) climate variables, (ii) soil, climate and DEM variables, (iii) principal component analysis (PCA)-based climate variables and (iv) PCA-based soil, climate and DEM variables.Important findings The results indicate that predictive variables for species distribution models should be chosen carefully, because too many predictors can reduce the prediction power. The effectiveness of using PCA to reduce the correlation among predictors and enhance the modelling power depends on the chosen predictor variables and models. Our results implied that it is better to reduce the correlating predictors before model processing. The Random Forests model was more precise than the GLM and CART models. The best model for T. ramosissima was the Random Forests model with climate predictors alone. Soil variables considered in this study could not significantly improve the model's prediction accuracy for T. ramosissima. The potential distribution area of T. ramosissima in the Tarim Basin is ~3.57 × 10 4 km 2, which has the potential to mitigate global warming and produce bioenergy through restoring T. ramosissima in the Tarim Basin.  相似文献   

16.
In human metabolic profiling studies, between-subject variability is often the dominant feature and can mask the potential classifications of clinical interest. Conventional models such as principal component analysis (PCA) are usually not effective in such situations and it is therefore highly desirable to find a suitable model which is able to discover the underlying pattern hidden behind the high between-subject variability. In this study we employed two clinical metabolomics data sets as the testing grounds, in which such variability had been observed, and we demonstrate that a proper choice of chemometrics model can help to overcome this issue of high between-subject variability. Two data sets were used to represent two different types of experiment designs. The first data set was obtained from a small-scale study investigating volatile organic compounds (VOCs) collected from chronic wounds using a skin patch device and analysed by thermal desorption-gas chromatography-mass spectrometry. Five patients were recruited and for each patient three sites sampled in triplicate: healthy skin, boundary of the lesion and top of the lesion, the aim was to discriminate these three types of samples based on their VOC profile. The second data set was from a much larger study involving 35 healthy subjects, 47 patients with chronic obstructive pulmonary disease and 33 with asthma. The VOCs in the breath of each subject were collected using a mask device and analysed again by GC–MS with the aim of discriminating the three types of subjects based on breath VOC profiles. Multilevel simultaneous component analysis, multilevel partial least squares for discriminant analysis, ANOVA-PCA, and a novel simplified ANOVA-PCA model—which we have named ANOVA-Mean Centre (ANOVA-MC)—were applied on these two data sets. Significantly improved results were obtained by using these models. We also present a novel validation procedure to verify statistically the results obtained from those models.  相似文献   

17.
To investigate visible and near-infrared (Vis-NIR) spectroscopy enabling chronic fatigue syndrome (CFS) diagnosis, we subjected sera from CFS patients as well as healthy donors to Vis-NIR spectroscopy. Vis-NIR spectra in the 600-1100 nm region for sera from 77 CFS patients and 71 healthy donors were subjected to principal component analysis (PCA) and soft independent modeling of class analogy (SIMCA) to develop multivariate models to discriminate between CFS patients and healthy donors. The model was further assessed by the prediction of 99 masked other determinations (54 in the healthy group and 45 in the CFS patient group). The PCA model predicted successful discrimination of the masked samples. The SIMCA model predicted 54 of 54 (100%) healthy donors and 42 of 45 (93.3%) CFS patients of Vis-NIR spectra from masked serum samples correctly. These results suggest that Vis-NIR spectroscopy for sera combined with chemometrics analysis could provide a promising tool to objectively diagnose CFS.  相似文献   

18.
Capsule The highest densities of Meadow Pipits in Central Europe are found in lowland and upland wet meadows.

Aims To create a large-scale predictive model of Meadow Pipit density.

Methods We analysed factors affecting the density of the Meadows Pipit in Poland using data from 777?×?1?km study plots and a set of 22 environmental variables, including agriculture intensification and habitat-specific plant species as classifiers of meadow types. Predictors were selected using variation inflation factor, then related to species density data using generalized additive models.

Results The best-supported model included 11 variables and was clearly better (Akaike information criterion weight?=?0.47) than other models. The density of the Meadow Pipit reaches its highest levels on large areas of extensively used wet meadows as well as pastures where livestock graze and which show high photosynthetic activity in April.

Conclusion Some aspects of the environment that were not identified from remote sensing data were vital for determining relatively high density. Conservation efforts for preserving Meadow Pipit populations should focus on maintaining wet meadows and extensively grazed pastures. Given the results, the Meadows Pipit may be classified as a good indicator of traditional agriculture.  相似文献   

19.
20.
Abstract

Macroanalytic studies of the relationship of fertility and development have been applied in the past based mostly on cross‐sectional aggregate data from various countries. Because these countries belong to different models of the epidemiologic transition, variation in the dynamic relationship among these models should be allowed for. In this paper, various techniques (including linear and quadratic regression, a minimum‐maximum method of plotting the relationship, a special approach of stepwise regression) were applied to a data set from 85 countries. The crude birth rate was used as the dependent variable with several demographic, economic, social health, and family planning indicators as independent variables, measures over the period 1950–75. The results confirm the existence of submodels of countries with varying relationships between fertility and its correlates. The results disallow direct transferability of the experience of one group of countries (such as Europe) to another group belonging to another model (such as the less developed countries). The study also found the strength of the family planning effort to be a significant factor and one to be singled out as a major contributor in the fertility decline between 1965–75 in the developing countries. Its effect, however, stands to be enhanced in various degrees by concurrent social and economic development.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号