首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.

Metabolomics data are typically complex and high dimensional. Multivariate dimension-reducing techniques have thus been developed for analysing metabolomics data to disclose underlying relationships, with principal component analysis (PCA) as the technique mostly applied. Despite its widespread use in metabolomics, PCA has shortcomings that limit its applicability. Several approaches have been made to overcome these limitations and we describe an advanced disjoint PCA (DPCA) model, termed concurrent class analysis and abbreviated as CONCA. CONCA is a new model, and is unique in linking DPCA models to a traditional PCA model. This is accomplished by restructuring the input data matrix, applying DPCA group models to the restructured data, and combining the DPCA models in order to replicate a traditional PCA. We applied the CONCA model to a metabolomics data set on isovaleric acidaemia (IVA), a rare inherited metabolic disorder. The outcome showed that three of the variables with high discrimination value identified through the CONCA analysis are prominent organic acid biomarkers for IVA. Moreover, three further minor metabolites associated with the disease, and two as a consequence of treatment, were likewise identified as important discriminatory variables. The benefit of the CONCA model thus is its ability to disclose information concerning each individual group and to identify the variables important in discrimination (VIDs) which are also responsible for group separation.

  相似文献   

2.
Isovaleric acidemia (IVA, MIM 248600) can be a severe and potentially life-threatening disease in affected neonates, but with a positive prognosis on treatment for some phenotypes. This study presents the first application of metabolomics to evaluate the metabolite profiles derived from urine samples of untreated and treated IVA patients as well as of obligate heterozygotes. All IVA patients carried the same homozygous c.367 G > A nucleotide change in exon 4 of the IVD gene but manifested phenotypic diversity. Concurrent class analysis (CONCA) was used to compare all the metabolites from the original complete data set obtained from the three case and two control groups used in this investigation. This application of CONCA has not been reported previously, and is used here to compare four different modes of scaling of all metabolites. The variables important in discrimination from the CONCA thus enabled the recognition of different metabolic patterns encapsulated within the data sets that would not have been revealed by using only one mode of scaling. Application of multivariate and univariate analyses disclosed 11 important metabolites that distinguished untreated IVA from controls. These included well-established diagnostic biomarkers of IVA, endogenous detoxification markers, and 3-hydroxycaproic acid, an indicator of ketosis, but not reported previously for this disease. Nine metabolites were identified that reflected the effect of treatment of IVA. They included detoxification products and indicators related to the high carbohydrate and low protein diet which formed the hallmark of the treatment. This investigation also provides the first comparative metabolite profile for heterozygotes of this inherited metabolic disorder. The detection of informative metabolites in even very low concentrations in all three experimental groups highlights the potential advantage of the holistic mode of analysis of inherited metabolic diseases in a metabolomics investigation.  相似文献   

3.
PCA (principal components analysis) and ANN (artificial neural network) are two broadly used pattern recognition methods in metabolomics data-mining. Yet their limitations sometimes are great obstacles for researchers. In this paper the wavelet transform (WT) method was used to integrate with PCA and ANN to improve their performance in manipulating metabolomics data. A dataset was decomposed by wavelets and then reconstructed. The "hard thresholding" algorithm was used, through which the detail information was discarded, and the entire "metabolomics image" reconstructed on the significant information. It was supposed that the most relevant information was captured after this process. It was found that, thanks to its ability in denoising data, the WT method could significantly improve the performance of the non-linear essence-extracting method ANN in classifying samples; further integration of WT with PCA showed that WT could greatly enhance the ability of PCA in distinguishing one group of samples from another and also its ability in identifying potential biomarkers. The results highlighted WT as a promising resolution in bridging the gap between huge bytes of data and the instructive biological information.  相似文献   

4.
In this paper, we propose and implement a hybrid model combining two-directional two-dimensional principal component analysis ((2D)2PCA) and a Radial Basis Function Neural Network (RBFNN) to forecast stock market behavior. First, 36 stock market technical variables are selected as the input features, and a sliding window is used to obtain the input data of the model. Next, (2D)2PCA is utilized to reduce the dimension of the data and extract its intrinsic features. Finally, an RBFNN accepts the data processed by (2D)2PCA to forecast the next day''s stock price or movement. The proposed model is used on the Shanghai stock market index, and the experiments show that the model achieves a good level of fitness. The proposed model is then compared with one that uses the traditional dimension reduction method principal component analysis (PCA) and independent component analysis (ICA). The empirical results show that the proposed model outperforms the PCA-based model, as well as alternative models based on ICA and on the multilayer perceptron.  相似文献   

5.
A new approach to nonlinear modeling and adaptive monitoring using fuzzy principal component regression (FPCR) is proposed and then applied to a real wastewater treatment plant (WWTP) data set. First, principal component analysis (PCA) is used to reduce the dimensionality of data and to remove collinearity. Second, the adaptive credibilistic fuzzy-c-means method is used to appropriately monitor diverse operating conditions based on the PCA score values. Then a new adaptive discrimination monitoring method is proposed to distinguish between a large process change and a simple fault. Third, a FPCR method is proposed, where the Takagi-Sugeno-Kang (TSK) fuzzy model is employed to model the relation between the PCA score values and the target output to avoid the over-fitting problem with original variables. Here, the rule bases, the centers and the widths of TSK fuzzy model are found by heuristic methods. The proposed FPCR method is applied to predict the output variable, the reduction of chemical oxygen demand in the full-scale WWTP. The result shows that it has the ability to model the nonlinear process and multiple operating conditions and is able to identify various operating regions and discriminate between a sustained fault and a simple fault (or abnormalities) occurring within the process data.  相似文献   

6.
Clustering and correlation analysis techniques have become popular tools for the analysis of data produced by metabolomics experiments. The results obtained from these approaches provide an overview of the interactions between objects of interest. Often in these experiments, one is more interested in information about the nature of these relationships, e.g., cause-effect relationships, than in the actual strength of the interactions. Finding such relationships is of crucial importance as most biological processes can only be understood in this way. Bayesian networks allow representation of these cause-effect relationships among variables of interest in terms of whether and how they influence each other given that a third, possibly empty, group of variables is known. This technique also allows the incorporation of prior knowledge as established from the literature or from biologists. The representation as a directed graph of these relationship is highly intuitive and helps to understand these processes. This paper describes how constraint-based Bayesian networks can be applied to metabolomics data and can be used to uncover the important pathways which play a significant role in the ripening of fresh tomatoes. We also show here how this methods of reconstructing pathways is intuitive and performs better than classical techniques. Methods for learning Bayesian network models are powerful tools for the analysis of data of the magnitude as generated by metabolomics experiments. It allows one to model cause-effect relationships and helps in understanding the underlying processes.  相似文献   

7.
In human metabolic profiling studies, between-subject variability is often the dominant feature and can mask the potential classifications of clinical interest. Conventional models such as principal component analysis (PCA) are usually not effective in such situations and it is therefore highly desirable to find a suitable model which is able to discover the underlying pattern hidden behind the high between-subject variability. In this study we employed two clinical metabolomics data sets as the testing grounds, in which such variability had been observed, and we demonstrate that a proper choice of chemometrics model can help to overcome this issue of high between-subject variability. Two data sets were used to represent two different types of experiment designs. The first data set was obtained from a small-scale study investigating volatile organic compounds (VOCs) collected from chronic wounds using a skin patch device and analysed by thermal desorption-gas chromatography-mass spectrometry. Five patients were recruited and for each patient three sites sampled in triplicate: healthy skin, boundary of the lesion and top of the lesion, the aim was to discriminate these three types of samples based on their VOC profile. The second data set was from a much larger study involving 35 healthy subjects, 47 patients with chronic obstructive pulmonary disease and 33 with asthma. The VOCs in the breath of each subject were collected using a mask device and analysed again by GC–MS with the aim of discriminating the three types of subjects based on breath VOC profiles. Multilevel simultaneous component analysis, multilevel partial least squares for discriminant analysis, ANOVA-PCA, and a novel simplified ANOVA-PCA model—which we have named ANOVA-Mean Centre (ANOVA-MC)—were applied on these two data sets. Significantly improved results were obtained by using these models. We also present a novel validation procedure to verify statistically the results obtained from those models.  相似文献   

8.
Analysis of longitudinal metabolomics data   总被引:7,自引:0,他引:7  
MOTIVATION: Metabolomics datasets are generally large and complex. Using principal component analysis (PCA), a simplified view of the variation in the data is obtained. The PCA model can be interpreted and the processes underlying the variation in the data can be analysed. In metabolomics, often a priori information is present about the data. Various forms of this information can be used in an unsupervised data analysis with weighted PCA (WPCA). A WPCA model will give a view on the data that is different from the view obtained using PCA, and it will add to the interpretation of the information in a metabolomics dataset. RESULTS: A method is presented to translate spectra of repeated measurements into weights describing the experimental error. These weights are used in the data analysis with WPCA. The WPCA model will give a view on the data where the non-uniform experimental error is accounted for. Therefore, the WPCA model will focus more on the natural variation in the data. AVAILABILITY: M-files for MATLAB for the algorithm used in this research are available at http://www-its.chem.uva.nl/research/pac/Software/pcaw.zip.  相似文献   

9.
Nuclear magnetic resonance (NMR) spectroscopy acts as the best tool that can be used in tissue engineering scaffolds to investigate unknown metabolites. Moreover, metabolomics is a systems approach for examining in vivo and in vitro metabolic profiles, which promises to provide data on cancer metabolic alterations. However, metabolomic profiling allows for the activity of small molecules and metabolic alterations to be measured. Furthermore, metabolic profiling also provides high-spectral resolution, which can then be linked to potential metabolic relationships. An altered metabolism is a hallmark of cancer that can control many malignant properties to drive tumorigenesis. Metabolite targeting and metabolic engineering contribute to carcinogenesis by proliferation, and metabolic differentiation. The resulting the metabolic differences are examined with traditional chemometric methods such as principal component analysis (PCA), and partial least squares-discriminate analysis (PLS-DA). In this review, we examine NMR-based activity metabolomic platforms that can be used to analyze various fluxomics and for multivariant statistical analysis in cancer. We also aim to provide the reader with a basic understanding of NMR spectroscopy, cancer metabolomics, target profiling, chemometrics, and multifunctional tools for metabolomics discrimination, with a focus on metabolic phenotypic diversity for cancer therapeutics.  相似文献   

10.
As a systematic and holistic study of metabolites in plants, animals, and human beings, metabolomics has advanced considerably in recent years, due largely to the rapid development of analytical technology and the application of multivariate data analysis methods. Exploratory data analysis, which has played a crucial role in this advance, aims to examine the natural data structure to reveal important information. Principal components analysis (PCA) is probably the most widely used technique for exploratory data analysis, but projection pursuit (PP) is another important method that often outperforms PCA because it is based on distributional rather than variance optimization. Recent algorithmic improvements have made the implementation of PP easier, but, when the sample size is small compared to the number of variables, it is found that PP (with kurtosis as a projection index) fails to gives meaningful information. Mathematically, this involves the ill-posed inverse problem that also occurs for many other multivariate data analysis methods that result in overfitting. In this work, a regularized projection pursuit (RPP) method is proposed to solve this problem and iterative optimization algorithms are developed for both step-wise univariate and multivariate PP. The utility of the algorithms is established using simulated data, which also demonstrates the use of ridge trace plots for the optimization of the ridge parameter. Three experimental data sets in the public domain are also analyzed, including a study on soy bean disease (47 samples × 35 variables), NMR spectral data for glomerulonephritis patients (50 × 200) and metabolomics data from a bovine diet study (39 × 47). In all cases, RPP showed superior class separation compared to PCA or ordinary PP.  相似文献   

11.
Being a relatively new addition to the 'omics' field, metabolomics is still evolving its own computational infrastructure and assessing its own computational needs. Due to its strong emphasis on chemical information and because of the importance of linking that chemical data to biological consequences, metabolomics must combine elements of traditional bioinformatics with traditional cheminformatics. This is a significant challenge as these two fields have evolved quite separately and require very different computational tools and skill sets. This review is intended to familiarize readers with the field of metabolomics and to outline the needs, the challenges and the recent progress being made in four areas of computational metabolomics: (i) metabolomics databases; (ii) metabolomics LIMS; (iii) spectral analysis tools for metabolomics and (iv) metabolic modeling.  相似文献   

12.
Principal component analysis (PCA) has been applied to a fed-batch fermentation for the production of streptokinase to identify the variables which are essential to formulate an adequate model. To mimic an industrial situation, Gaussian noise was introduced in the feed rate of the substrate. Both in the presence and in the absence of noise, the same five variables out of seven were selected by PCA. The minimal model trained separately without and with noise was able to predict satisfactorily the course of the fermentation for a condition not employed in training. These observations attest the suitability of PCA to formulate minimal models for industrial scale fermentations.  相似文献   

13.
Aims Preserving and restoring Tamarix ramosissima is urgently required in the Tarim Basin, Northwest China. Using species distribution models to predict the biogeographical distribution of species is regularly used in conservation and other management activities. However, the uncertainty in the data and models inevitably reduces their prediction power. The major purpose of this study is to assess the impacts of predictor variables and species distribution models on simulating T. ramosissima distribution, to explore the relationships between predictor variables and species distribution models and to model the potential distribution of T. ramosissima in this basin.Methods Three models—the generalized linear model (GLM), classification and regression tree (CART) and Random Forests—were selected and were processed on the BIOMOD platform. The presence/absence data of T. ramosissima in the Tarim Basin, which were calculated from vegetation maps, were used as response variables. Climate, soil and digital elevation model (DEM) data variables were divided into four datasets and then used as predictors. The four datasets were (i) climate variables, (ii) soil, climate and DEM variables, (iii) principal component analysis (PCA)-based climate variables and (iv) PCA-based soil, climate and DEM variables.Important findings The results indicate that predictive variables for species distribution models should be chosen carefully, because too many predictors can reduce the prediction power. The effectiveness of using PCA to reduce the correlation among predictors and enhance the modelling power depends on the chosen predictor variables and models. Our results implied that it is better to reduce the correlating predictors before model processing. The Random Forests model was more precise than the GLM and CART models. The best model for T. ramosissima was the Random Forests model with climate predictors alone. Soil variables considered in this study could not significantly improve the model's prediction accuracy for T. ramosissima. The potential distribution area of T. ramosissima in the Tarim Basin is ~3.57 × 10 4 km 2, which has the potential to mitigate global warming and produce bioenergy through restoring T. ramosissima in the Tarim Basin.  相似文献   

14.
There are a large number of tomato cultivars with a wide range of morphological, chemical, nutritional and sensorial characteristics. Many factors are known to affect the nutrient content of tomato cultivars. A complete understanding of the effect of these factors would require an exhaustive experimental design, multidisciplinary scientific approach and a suitable statistical method. Some multivariate analytical techniques such as Principal Component Analysis (PCA) or Factor Analysis (FA) have been widely applied in order to search for patterns in the behaviour and reduce the dimensionality of a data set by a new set of uncorrelated latent variables. However, in some cases it is not useful to replace the original variables with these latent variables. In this study, Automatic Interaction Detection (AID) algorithm and Artificial Neural Network (ANN) models were applied as alternative to the PCA, AF and other multivariate analytical techniques in order to identify the relevant phytochemical constituents for characterization and authentication of tomatoes. To prove the feasibility of AID algorithm and ANN models to achieve the purpose of this study, both methods were applied on a data set with twenty five chemical parameters analysed on 167 tomato samples from Tenerife (Spain). Each tomato sample was defined by three factors: cultivar, agricultural practice and harvest date. General Linear Model linked to AID (GLM-AID) tree-structured was organized into 3 levels according to the number of factors. p-Coumaric acid was the compound the allowed to distinguish the tomato samples according to the day of harvest. More than one chemical parameter was necessary to distinguish among different agricultural practices and among the tomato cultivars. Several ANN models, with 25 and 10 input variables, for the prediction of cultivar, agricultural practice and harvest date, were developed. Finally, the models with 10 input variables were chosen with fit’s goodness between 44 and 100%. The lowest fits were for the cultivar classification, this low percentage suggests that other kind of chemical parameter should be used to identify tomato cultivars.  相似文献   

15.
The purpose of this study was to investigate whether visible and near-infrared (Vis-NIR) spectroscopy can be used for diagnoses of anti-phospholipid syndrome (APS). Vis-NIR spectra from 90 plasma samples [anti-phospholipid antibodies (aPLs)-positive group, n=48; aPLs-negative group, n=42] were subjected to principal component analysis (PCA) and soft independent modeling of class analogy (SIMCA) to develop multivariate models to discriminate between aPLs-positive and aPLs-negative. Both PCA and SIMCA models were further assessed by the prediction of 84 masked other determinations. The PCA model predicted successful discrimination of the masked samples with respect to aPLs-positive and aPLs-negative. The SIMCA model predicted 42 of 48 (87.5%) aPLs-positive patients and 33 of 36 (91.7%) aPLs-negative patients of Vis-NIR spectra from masked samples correctly. These results suggest that Vis-NIR spectroscopy combined with multivariate analysis could provide a promising tool to objectively diagnose APS.  相似文献   

16.
Due to the complexity of host-parasite relationships, discrimination between fish populations using parasites as biological tags is difficult. This study introduces, to our knowledge for the first time, random forests (RF) as a new modelling technique in the application of parasite community data as biological markers for population assignment of fish. This novel approach is applied to a dataset with a complex structure comprising 763 parasite infracommunities in population samples of Atlantic cod, Gadus morhua, from the spawning/feeding areas in five regions in the North East Atlantic (Baltic, Celtic, Irish and North seas and Icelandic waters). The learning behaviour of RF is evaluated in comparison with two other algorithms applied to class assignment problems, the linear discriminant function analysis (LDA) and artificial neural networks (ANN). The three algorithms are used to develop predictive models applying three cross-validation procedures in a series of experiments (252 models in total). The comparative approach to RF, LDA and ANN algorithms applied to the same datasets demonstrates the competitive potential of RF for developing predictive models since RF exhibited better accuracy of prediction and outperformed LDA and ANN in the assignment of fish to their regions of sampling using parasite community data. The comparative analyses and the validation experiment with a 'blind' sample confirmed that RF models performed more effectively with a large and diverse training set and a large number of variables. The discrimination results obtained for a migratory fish species with largely overlapping parasite communities reflects the high potential of RF for developing predictive models using data that are both complex and noisy, and indicates that it is a promising tool for parasite tag studies. Our results suggest that parasite community data can be used successfully to discriminate individual cod from the five different regions of the North East Atlantic studied using RF.  相似文献   

17.
Nuclear magnetic resonance (NMR) and Mass Spectroscopy (MS) are the two most common spectroscopic analytical techniques employed in metabolomics. The large spectral datasets generated by NMR and MS are often analyzed using data reduction techniques like Principal Component Analysis (PCA). Although rapid, these methods are susceptible to solvent and matrix effects, high rates of false positives, lack of reproducibility and limited data transferability from one platform to the next. Given these limitations, a growing trend in both NMR and MS-based metabolomics is towards targeted profiling or "quantitative" metabolomics, wherein compounds are identified and quantified via spectral fitting prior to any statistical analysis.?Despite the obvious advantages of this method, targeted profiling is hindered by the time required to perform manual or computer-assisted spectral fitting. In an effort to increase data analysis throughput for NMR-based metabolomics, we have developed an automatic method for identifying and quantifying metabolites in one-dimensional (1D) proton NMR spectra. This new algorithm is capable of using carefully constructed reference spectra and optimizing thousands of variables to reconstruct experimental NMR spectra of biofluids using rules and concepts derived from physical chemistry and NMR theory. The automated profiling program has been tested against spectra of synthetic mixtures as well as biological spectra of urine, serum and cerebral spinal fluid (CSF). Our results indicate that the algorithm can correctly identify compounds with high fidelity in each biofluid sample (except for urine). Furthermore, the metabolite concentrations exhibit a very high correlation with both simulated and manually-detected values.  相似文献   

18.
To investigate visible and near-infrared (Vis-NIR) spectroscopy enabling chronic fatigue syndrome (CFS) diagnosis, we subjected sera from CFS patients as well as healthy donors to Vis-NIR spectroscopy. Vis-NIR spectra in the 600-1100 nm region for sera from 77 CFS patients and 71 healthy donors were subjected to principal component analysis (PCA) and soft independent modeling of class analogy (SIMCA) to develop multivariate models to discriminate between CFS patients and healthy donors. The model was further assessed by the prediction of 99 masked other determinations (54 in the healthy group and 45 in the CFS patient group). The PCA model predicted successful discrimination of the masked samples. The SIMCA model predicted 54 of 54 (100%) healthy donors and 42 of 45 (93.3%) CFS patients of Vis-NIR spectra from masked serum samples correctly. These results suggest that Vis-NIR spectroscopy for sera combined with chemometrics analysis could provide a promising tool to objectively diagnose CFS.  相似文献   

19.
Face recognition has emerged as the fastest growing biometric technology and has expanded a lot in the last few years. Many new algorithms and commercial systems have been proposed and developed. Most of them use Principal Component Analysis (PCA) as a base for their techniques. Different and even conflicting results have been reported by researchers comparing these algorithms. The purpose of this study is to have an independent comparative analysis considering both performance and computational complexity of six appearance based face recognition algorithms namely PCA, 2DPCA, A2DPCA, (2D)2PCA, LPP and 2DLPP under equal working conditions. This study was motivated due to the lack of unbiased comprehensive comparative analysis of some recent subspace methods with diverse distance metric combinations. For comparison with other studies, FERET, ORL and YALE databases have been used with evaluation criteria as of FERET evaluations which closely simulate real life scenarios. A comparison of results with previous studies is performed and anomalies are reported. An important contribution of this study is that it presents the suitable performance conditions for each of the algorithms under consideration.  相似文献   

20.
邹应斌  米湘成  石纪成 《生态学报》2004,24(12):2967-2972
研究利用人工神经网络模型 ,以水稻群体分蘖动态为例 ,采用交互验证和独立验证的方式 ,对水稻生长 BP网络模型进行了训练与模拟 ,其结果与水稻群体分蘖的积温统计模型、基本动力学模型和复合分蘖模型进行了比较。研究结果表明 ,神经网络模型具有一定的外推能力 ,但其外推能力依赖于大量的训练样本。神经网络模型具有较好的拟合能力 ,是因为有较多的模型参数 ,因此对神经网络模型的训练需要大量的参数来保证其参数不致过度吻合。具有外推能力神经网络模型的最少训练样本数应大于 6 .75倍于神经网络参数数目 ,小于 13.5倍于神经网络参数数目。因此在应用神经网络模型时 ,如果神经网络模型包括较多的输入变量时 ,可考虑采用主成分分析、对应分析等技术对输入变量进行信息综合 ,相应地减少网络模型的参数。另一方面 ,当训练样本不足时 ,最好只用神经网络模型对同一系统的情况进行模拟 ,应谨慎使用神经网络模型进行外推。神经网络模型给作物模拟研究的科学工作者提供了一个“傻瓜”式工具 ,对数学建模不熟悉的农业研究人员 ,人工神经网络可以替代数学建模进行仿真实验 ;对于精通数学建模的研究人员来说 ,它至少是一种补充和可作为比较的非线性数据处理方法  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号