首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.

Background  

The use of mass spectrometry as a proteomics tool is poised to revolutionize early disease diagnosis and biomarker identification. Unfortunately, before standard supervised classification algorithms can be employed, the "curse of dimensionality" needs to be solved. Due to the sheer amount of information contained within the mass spectra, most standard machine learning techniques cannot be directly applied. Instead, feature selection techniques are used to first reduce the dimensionality of the input space and thus enable the subsequent use of classification algorithms. This paper examines feature selection techniques for proteomic mass spectrometry.  相似文献   

2.
SUMMARY: The development of statistical models linking the molecular state of a cell to its physiology is one of the most important tasks in the analysis of Functional Genomics data. Because of the large number of variables measured a comprehensive evaluation of variable subsets cannot be performed with available computational resources. It follows that an efficient variable selection strategy is required. However, although software packages for performing univariate variable selection are available, a comprehensive software environment to develop and evaluate multivariate statistical models using a multivariate variable selection strategy is still needed. In order to address this issue, we developed GALGO, an R package based on a genetic algorithm variable selection strategy, primarily designed to develop statistical models from large-scale datasets.  相似文献   

3.
The search for predictive biomarkers of disease from high-throughput mass spectrometry (MS) data requires a complex analysis path. Preprocessing and machine-learning modules are pipelined, starting from raw spectra, to set up a predictive classifier based on a shortlist of candidate features. As a machine-learning problem, proteomic profiling on MS data needs caution like the microarray case. The risk of overfitting and of selection bias effects is pervasive: not only potential features easily outnumber samples by 10(3) times, but it is easy to neglect information-leakage effects during preprocessing from spectra to peaks. The aim of this review is to explain how to build a general purpose design analysis protocol (DAP) for predictive proteomic profiling: we show how to limit leakage due to parameter tuning and how to organize classification and ranking on large numbers of replicate versions of the original data to avoid selection bias. The DAP can be used with alternative components, i.e. with different preprocessing methods (peak clustering or wavelet based), classifiers e.g. Support Vector Machine (SVM) or feature ranking methods (recursive feature elimination or I-Relief). A procedure for assessing stability and predictive value of the resulting biomarkers' list is also provided. The approach is exemplified with experiments on synthetic datasets (from the Cromwell MS simulator) and with publicly available datasets from cancer studies.  相似文献   

4.
MOTIVATION: The nearest shrunken centroids classifier has become a popular algorithm in tumor classification problems using gene expression microarray data. Feature selection is an embedded part of the method to select top-ranking genes based on a univariate distance statistic calculated for each gene individually. The univariate statistics summarize gene expression profiles outside of the gene co-regulation network context, leading to redundant information being included in the selection procedure. RESULTS: We propose an Eigengene-based Linear Discriminant Analysis (ELDA) to address gene selection in a multivariate framework. The algorithm uses a modified rotated Spectral Decomposition (SpD) technique to select 'hub' genes that associate with the most important eigenvectors. Using three benchmark cancer microarray datasets, we show that ELDA selects the most characteristic genes, leading to substantially smaller classifiers than the univariate feature selection based analogues. The resulting de-correlated expression profiles make the gene-wise independence assumption more realistic and applicable for the shrunken centroids classifier and other diagonal linear discriminant type of models. Our algorithm further incorporates a misclassification cost matrix, allowing differential penalization of one type of error over another. In the breast cancer data, we show false negative prognosis can be controlled via a cost-adjusted discriminant function. AVAILABILITY: R code for the ELDA algorithm is available from author upon request.  相似文献   

5.
In bioinformatics studies, supervised classification with high-dimensional input variables is frequently encountered. Examples routinely arise in genomic, epigenetic and proteomic studies. Feature selection can be employed along with classifier construction to avoid over-fitting, to generate more reliable classifier and to provide more insights into the underlying causal relationships. In this article, we provide a review of several recently developed penalized feature selection and classification techniques--which belong to the family of embedded feature selection methods--for bioinformatics studies with high-dimensional input. Classification objective functions, penalty functions and computational algorithms are discussed. Our goal is to make interested researchers aware of these feature selection and classification methods that are applicable to high-dimensional bioinformatics data.  相似文献   

6.
A new multiple trait strategy based on discriminant analysis was studied for efficient detection of linked QTL in outbred sib families, in comparison with a multivariate likelihood technique. The discriminant analysis technique describes the segregation of a linear combination of the traits in a univariate likelihood. This combination is calculated for each pair of positions depending on the inheritance of the pairs of QTL haplotypes in the progeny. The gains in power and accuracy for position estimations of multiple trait methods in grid searches were evaluated in reference to single trait detections of linked QTL. The methods were applied to simulated designs with two correlated traits submitted to various effects from the linked QTL. Multiple trait strategies were generally more powerful and accurate than the single trait technique. Linked QTL were distinguished when they were separated enough to identify informative recombinations: at least two genetic markers and 25 cM between the QTL under the simulated conditions. Except in a particular case, discriminant analysis was at least as powerful as the multivariate technique and its implementation was five times faster. Combining the advantages from both methodologies, we finally propose a complete strategy for rapid and efficient systematic multivariate detections in outbred populations.  相似文献   

7.
Continuous pharmaceutical manufacturing processes are of increased industrial interest and require uni- and multivariate Process Analytical Technology (PAT) data from different unit operations to be aligned and explored within the Quality by Design (QbD) context. Real-time pharmaceutical process verification is accomplished by monitoring univariate (temperature, pressure, etc.) and multivariate (spectra, images, etc.) process parameters and quality attributes, to provide an accurate state estimation of the process, required for advanced control strategies. This paper describes the development and use of such tools for a continuous hot melt extrusion (HME) process, monitored with generic sensors and a near-infrared (NIR) spectrometer in real-time, using SIPAT (Siemens platform to collect, display, and extract process information) and additional components developed as needed. The IT architecture of such a monitoring procedure based on uni- and multivariate sensor systems and their integration in SIPAT is shown. SIPAT aligned spectra from the extrudate (in the die section) with univariate measurements (screw speed, barrel temperatures, material pressure, etc.). A multivariate supervisory quality control strategy was developed for the process to monitor the hot melt extrusion process on the basis of principal component analysis (PCA) of the NIR spectra. Monitoring the first principal component and the time-aligned reference feed rate enables the determination of the residence time in real-time.  相似文献   

8.
MOTIVATION: Protein expression profiling for differences indicative of early cancer holds promise for improving diagnostics. Due to their high dimensionality, statistical analysis of proteomic data from mass spectrometers is challenging in many aspects such as dimension reduction, feature subset selection as well as construction of classification rules. Search of an optimal feature subset, commonly known as the feature subset selection (FSS) problem, is an important step towards disease classification/diagnostics with biomarkers. METHODS: We develop a parsimonious threshold-independent feature selection (PTIFS) method based on the concept of area under the curve (AUC) of the receiver operating characteristic (ROC). To reduce computational complexity to a manageable level, we use a sigmoid approximation to the empirical AUC as the criterion function. Starting from an anchor feature, the PTIFS method selects a feature subset through an iterative updating algorithm. Highly correlated features that have similar discriminating power are precluded from being selected simultaneously. The classification rule is then determined from the resulting feature subset. RESULTS: The performance of the proposed approach is investigated by extensive simulation studies, and by applying the method to two mass spectrometry data sets of prostate cancer and of liver cancer. We compare the new approach with the threshold gradient descent regularization (TGDR) method. The results show that our method can achieve comparable performance to that of the TGDR method in terms of disease classification, but with fewer features selected. AVAILABILITY: Supplementary Material and the PTIFS implementations are available at http://staff.ustc.edu.cn/~ynyang/PTIFS. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

9.
Selecting relevant features is a common task in most OMICs data analysis, where the aim is to identify a small set of key features to be used as biomarkers. To this end, two alternative but equally valid methods are mainly available, namely the univariate (filter) or the multivariate (wrapper) approach. The stability of the selected lists of features is an often neglected but very important requirement. If the same features are selected in multiple independent iterations, they more likely are reliable biomarkers. In this study, we developed and evaluated the performance of a novel method for feature selection and prioritization, aiming at generating robust and stable sets of features with high predictive power. The proposed method uses the fuzzy logic for a first unbiased feature selection and a Random Forest built from conditional inference trees to prioritize the candidate discriminant features. Analyzing several multi-class gene expression microarray data sets, we demonstrate that our technique provides equal or better classification performance and a greater stability as compared to other Random Forest-based feature selection methods.  相似文献   

10.

Background

The majority of ovarian cancer biomarker discovery efforts focus on the identification of proteins that can improve the predictive power of presently available diagnostic tests. We here show that metabolomics, the study of metabolic changes in biological systems, can also provide characteristic small molecule fingerprints related to this disease.

Results

In this work, new approaches to automatic classification of metabolomic data produced from sera of ovarian cancer patients and benign controls are investigated. The performance of support vector machines (SVM) for the classification of liquid chromatography/time-of-flight mass spectrometry (LC/TOF MS) metabolomic data focusing on recognizing combinations or "panels" of potential metabolic diagnostic biomarkers was evaluated. Utilizing LC/TOF MS, sera from 37 ovarian cancer patients and 35 benign controls were studied. Optimum panels of spectral features observed in positive or/and negative ion mode electrospray (ESI) MS with the ability to distinguish between control and ovarian cancer samples were selected using state-of-the-art feature selection methods such as recursive feature elimination and L1-norm SVM.

Conclusion

Three evaluation processes (leave-one-out-cross-validation, 12-fold-cross-validation, 52-20-split-validation) were used to examine the SVM models based on the selected panels in terms of their ability for differentiating control vs. disease serum samples. The statistical significance for these feature selection results were comprehensively investigated. Classification of the serum sample test set was over 90% accurate indicating promise that the above approach may lead to the development of an accurate and reliable metabolomic-based approach for detecting ovarian cancer.  相似文献   

11.
Because of high dimensionality, machine learning algorithms typically rely on feature selection techniques in order to perform effective classification in microarray gene expression data sets. However, the large number of features compared to the number of samples makes the task of feature selection computationally hard and prone to errors. This paper interprets feature selection as a task of stochastic optimization, where the goal is to select among an exponential number of alternative gene subsets the one expected to return the highest generalization in classification. Blocking is an experimental design strategy which produces similar experimental conditions to compare alternative stochastic configurations in order to be confident that observed differences in accuracy are due to actual differences rather than to fluctuations and noise effects. We propose an original blocking strategy for improving feature selection which aggregates in a paired way the validation outcomes of several learning algorithms to assess a gene subset and compare it to others. This is a novelty with respect to conventional wrappers, which commonly adopt a sole learning algorithm to evaluate the relevance of a given set of variables. The rationale of the approach is that, by increasing the amount of experimental conditions under which we validate a feature subset, we can lessen the problems related to the scarcity of samples and consequently come up with a better selection. The paper shows that the blocking strategy significantly improves the performance of a conventional forward selection for a set of 16 publicly available cancer expression data sets. The experiments involve six different classifiers and show that improvements take place independent of the classification algorithm used after the selection step. Two further validations based on available biological annotation support the claim that blocking strategies in feature selection may improve the accuracy and the quality of the solution. The first validation is based on retrieving PubMEd abstracts associated to the selected genes and matching them to regular expressions describing the biological phenomenon underlying the expression data sets. The biological validation that follows is based on the use of the Bioconductor package GoStats in order to perform Gene Ontology statistical analysis.  相似文献   

12.
Phase contrast X-ray computed tomography (PCI-CT) has been demonstrated as a novel imaging technique that can visualize human cartilage with high spatial resolution and soft tissue contrast. Different textural approaches have been previously investigated for characterizing chondrocyte organization on PCI-CT to enable classification of healthy and osteoarthritic cartilage. However, the large size of feature sets extracted in such studies motivates an investigation into algorithmic feature reduction for computing efficient feature representations without compromising their discriminatory power. For this purpose, geometrical feature sets derived from the scaling index method (SIM) were extracted from 1392 volumes of interest (VOI) annotated on PCI-CT images of ex vivo human patellar cartilage specimens. The extracted feature sets were subject to linear and non-linear dimension reduction techniques as well as feature selection based on evaluation of mutual information criteria. The reduced feature set was subsequently used in a machine learning task with support vector regression to classify VOIs as healthy or osteoarthritic; classification performance was evaluated using the area under the receiver-operating characteristic (ROC) curve (AUC). Our results show that the classification performance achieved by 9-D SIM-derived geometric feature sets (AUC: 0.96 ± 0.02) can be maintained with 2-D representations computed from both dimension reduction and feature selection (AUC values as high as 0.97 ± 0.02). Thus, such feature reduction techniques can offer a high degree of compaction to large feature sets extracted from PCI-CT images while maintaining their ability to characterize the underlying chondrocyte patterns.  相似文献   

13.
We present our approach to classifying the processed proteomic data that were made available to the participants of the classification competition. Although classification of the spectra was the goal of the competition we feel that proteomic applications to cancer biomarker studies make certain additional demands. For example, one such requirement should be identification of certain features which collectively could differentiate the two groups of samples. Also ideally, the size of the feature set should be small. To that end we propose a linear discriminant classifier based on nine m/z intensity values. Construction and performance of this classifier are discussed.  相似文献   

14.
Proximity based GPCRs prediction in transform domain   总被引:1,自引:0,他引:1  
In this work, we predict G-protein coupled receptors (GPCRs) using hydrophobicity of amino acid sequences and Fast Fourier Transform for feature generation. We analyze whether the GPCRs classification strategy depends on the way the feature space may be exploited. Consequently, we show that the sequence pattern based information could easily be exploited in the frequency domain using proximity rather than increasing margin of separation between the classes. We thus develop a simple proximity based approach known as nearest neighbor (NN) for classifying the 17 GPCRs subfamilies. The NN classifier has outperformed the one against all implementation of support vector machine using both Jackknife and independent dataset. The results validate the importance of the understanding and efficient exploitation of the feature space. It also shows that simple classification strategies may outperform complex ones because of the efficient exploitation of the feature space.  相似文献   

15.
Complex networks have been extensively used in the last decade to characterize and analyze complex systems, and they have been recently proposed as a novel instrument for the analysis of spectra extracted from biological samples. Yet, the high number of measurements composing spectra, and the consequent high computational cost, make a direct network analysis unfeasible. We here present a comparative analysis of three customary feature selection algorithms, including the binning of spectral data and the use of information theory metrics. Such algorithms are compared by assessing the score obtained in a classification task, where healthy subjects and people suffering from different types of cancers should be discriminated. Results indicate that a feature selection strategy based on Mutual Information outperforms the more classical data binning, while allowing a reduction of the dimensionality of the data set in two orders of magnitude.  相似文献   

16.
Feature selection algorithms play a crucial role in identifying and discovering important genes for cancer classification. Feature selection algorithms can be broadly categorized into two main groups: filter-based methods and wrapper-based methods. Filter-based methods have been quite popular in the literature due to their many advantages, including computational efficiency, simplistic architecture, and an intuitively simple means of discovering biological and clinical aspects. However, these methods have limitations, and the classification accuracy of the selected genes is less accurate. In this paper, we propose a set of univariate filter-based methods using a between-class overlapping criterion. The proposed techniques have been compared with many other univariate filter-based methods using an acute leukemia dataset. The following properties have been examined: classification accuracy of the selected individual genes and the gene subsets; redundancy check among selected genes using ridge regression and LASSO methods; similarity and sensitivity analyses; functional analysis; and, stability analysis. A comprehensive experiment shows promising results for our proposed techniques. The univariate filter based methods using between-class overlapping criterion are accurate and robust, have biological significance, and are computationally efficient and easy to implement. Therefore, they are well suited for biological and clinical discoveries.  相似文献   

17.
OBJECTIVE: To test if a combination of biomarkers can increase the classification power of autoantibodies to cyclic citrullinated peptides (anti-CCP) in the diagnosis of rheumatoid arthritis (RA) depending on the diagnostic situation. METHODS: Biomarkers were subject to three inclusion/exclusion criteria (discrimination between RA patients and healthy blood donors, ability to identify anti-CCP-negative RA patients, specificity in a panel with major non-rheumatological diseases) before univariate ranking and multivariate analysis was carried out using a modelling panel (n = 906). To enable the evaluation of the classification power in different diagnostic settings the disease controls (n = 542) were weighted according to the admission rates in rheumatology clinics modelling a clinic panel or according to the relative prevalences of musculoskeletal disorders in the general population seen by general practitioners modelling a GP panel. RESULT: Out of 131 biomarkers considered originally, we evaluated 32 biomarkers in this study, of which only seven passed the three inclusion/exclusion criteria and were combined by multivariate analysis using four different mathematical models. In the modelled clinic panel, anti-CCP was the lead marker with a sensitivity of 75.8% and a specificity of 94.0%. Due to the lack in specificity of the markers other than anti-CCP in this diagnostic setting, any gain in sensitivity by any marker combination is off-set by a corresponding loss in specificity. In the modelled GP panel, the best marker combination of anti-CCP and interleukin (IL)-6 resulted in a sensitivity gain of 7.6% (85.9% vs. 78.3%) at a minor loss in specificity of 1.6% (90.3% vs. 91.9%) compared with anti-CCP as the best single marker. CONCLUSION: Depending on the composition of the sample panel, anti-CCP alone or anti-CCP in combination with IL-6 has the highest classification power for the diagnosis of established RA.  相似文献   

18.
Image cytometry of DNA distribution in fine needle biopsies of breast carcinomas at first diagnosis was performed to see if there were significant differences in DNA histograms between patients having very different outcome but same tumor histological typing and similar therapy. Two groups of patients were considered retrospectively: the first (20 patients) with survival time shorter than 5 years and the second (20 patients) with survival time longer than 10 years. Seven benign tumors were used as controls. Ten plo?dy classes were defined. The frequencies of cells in those classes were used as independent features in a supervised multivariate analysis. The advantages of this approach was pointed out with respect to the four-type classification of Auer. The scattering of DNA histograms within the feature space showed that a subgroup of patients with poor prognosis was clearly separated from a subgroup of patients with good prognosis but both long survival patients and short survival patients were scattered in between. In order to replace the multivariate classification of histograms by a simpler approach, two parameters were computed which explained most of the scattering in the feature space: the plo?dy balance (difference between the percentages of euploid and aneuplo?d cells) and the proliferation index (percentage of cells between peaks). The scattergram of patients according to these parameters showed again that some DNA distributions were specific for either good or bad prognosis. But the separation was uncertain for seven short-survival patients and six long-survival patients. For six patients, the DNA distributions were very similar between long and short survival times. Those patients thus could not be separated even by means of discriminant analysis. The main conclusion of this study was that, for a significant number of patients, the objective multivariate classification of tumors DNA profiles is of little assistance to the pathologist who has to give a prognosis for the one patient under consideration.  相似文献   

19.
This review discusses data analysis strategies for the discovery of biomarkers in clinical proteomics. Proteomics studies produce large amounts of data, characterized by few samples of which many variables are measured. A wealth of classification methods exists for extracting information from the data. Feature selection plays an important role in reducing the dimensionality of the data prior to classification and in discovering biomarker leads. The question which classification strategy works best is yet unanswered. Validation is a crucial step for biomarker leads towards clinical use. Here we only discuss statistical validation, recognizing that biological and clinical validation is of utmost importance. First, there is the need for validated model selection to develop a generalized classifier that predicts new samples correctly. A cross-validation loop that is wrapped around the model development procedure assesses the performance using unseen data. The significance of the model should be tested; we use permutations of the data for comparison with uninformative data. This procedure also tests the correctness of the performance validation. Preferably, a new set of samples is measured to test the classifier and rule out results specific for a machine, analyst, laboratory or the first set of samples. This is not yet standard practice. We present a modular framework that combines feature selection, classification, biomarker discovery and statistical validation; these data analysis aspects are all discussed in this review. The feature selection, classification and biomarker discovery modules can be incorporated or omitted to the preference of the researcher. The validation modules, however, should not be optional. In each module, the researcher can select from a wide range of methods, since there is not one unique way that leads to the correct model and proper validation. We discuss many possibilities for feature selection, classification and biomarker discovery. For validation we advice a combination of cross-validation and permutation testing, a validation strategy supported in the literature.  相似文献   

20.
Contemporary protein microarrays such as the ProtoArray® are used for autoimmune antibody screening studies to discover biomarker panels. For ProtoArray data analysis, the software Prospector and a default workflow are suggested by the manufacturer. While analyzing a large data set of a discovery study for diagnostic biomarkers of the Parkinson's disease (ParkCHIP), we have revealed the need for distinct improvements of the suggested workflow concerning raw data acquisition, normalization and preselection method availability, batch effects, feature selection, and feature validation. In this work, appropriate improvements of the default workflow are proposed. It is shown that completely automatic data acquisition as a batch, a re‐implementation of Prospector's pre‐selection method, multivariate or hybrid feature selection, and validation of the selected protein panel using an independent test set define in combination an improved workflow for large studies.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号