首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In bioinformatics studies, supervised classification with high-dimensional input variables is frequently encountered. Examples routinely arise in genomic, epigenetic and proteomic studies. Feature selection can be employed along with classifier construction to avoid over-fitting, to generate more reliable classifier and to provide more insights into the underlying causal relationships. In this article, we provide a review of several recently developed penalized feature selection and classification techniques--which belong to the family of embedded feature selection methods--for bioinformatics studies with high-dimensional input. Classification objective functions, penalty functions and computational algorithms are discussed. Our goal is to make interested researchers aware of these feature selection and classification methods that are applicable to high-dimensional bioinformatics data.  相似文献   

2.
Osteosarcoma (OS) is the most common malignant bone tumor in children. To identify a plasma proteomic signature that can detect OS, we used SELDI MS to perform proteomic profiling on plasma specimens from 29 OS and 20 age-matched osteochondroma (OC) patients. Nineteen statistically significant ion peaks that were differentially expressed in OS when compared with OC patients were identified (p < 0.001 and false discovery rate < 10%). Using the proteomic profiles, we constructed a multivariate 3-nearest neighbors classifier to distinguish OS from OC patients with a sensitivity of 97% and a specificity of 80% based on external leave-one-out crossvalidation. Permutation test showed that the classification result was statistically significant (p < 0.00005). One of the proteins (m/z 11 704) in the proteomic signature was identified as serum amyloid protein A (SAA) by PMF. The higher plasma level of SAA in OS patients was further validated by Western blotting when compared to that of osteochrondroma patients and normal subjects as reference. The classifier based on this plasma proteomic signature may be useful to differentiate malignant bone cancer from benign bone tumors and for early detection of OS in high-risk individuals.  相似文献   

3.

Background

The goal of personalized medicine is to provide patients optimal drug screening and treatment based on individual genomic or proteomic profiles. Reverse-Phase Protein Array (RPPA) technology offers proteomic information of cancer patients which may be directly related to drug sensitivity. For cancer patients with different drug sensitivity, the proteomic profiling reveals important pathophysiologic information which can be used to predict chemotherapy responses.

Results

The goal of this paper is to present a framework for personalized medicine using both RPPA and drug sensitivity (drug resistance or intolerance). In the proposed personalized medicine system, the prediction of drug sensitivity is obtained by a proposed augmented naive Bayesian classifier (ANBC) whose edges between attributes are augmented in the network structure of naive Bayesian classifier. For discriminative structure learning of ANBC, local classification rate (LCR) is used to score augmented edges, and greedy search algorithm is used to find the discriminative structure that maximizes classification rate (CR). Once a classifier is trained by RPPA and drug sensitivity using cancer patient samples, the classifier is able to predict the drug sensitivity given RPPA information from a patient.

Conclusion

In this paper we proposed a framework for personalized medicine where a patient is profiled by RPPA and drug sensitivity is predicted by ANBC and LCR. Experimental results with lung cancer data demonstrate that RPPA can be used to profile patients for drug sensitivity prediction by Bayesian network classifier, and the proposed ANBC for personalized cancer medicine achieves better prediction accuracy than naive Bayes classifier in small sample size data on average and outperforms other the state-of-the-art classifier methods in terms of classification accuracy.
  相似文献   

4.
There is an increasing demand to develop cost‐effective and accurate approaches to analyzing biological tissue samples. This is especially relevant in the fishing industry where closely related fish samples can be mislabeled, and the high market value of certain fish leads to the use of alternative species as substitutes, for example, Barramundi and Nile Perch (belonging to the same genus, Lates). There is a need to combine selective proteomic datasets with sophisticated computational analysis to devise a robust classification approach. This paper describes an integrated MS‐based proteomics and bioinformatics approach to classifying a range of fish samples. A classifier is developed using training data that successfully discriminates between Barramundi and Nile Perch samples using a selected protein subset of the proteome. Additionally, the classifier is shown to successfully discriminate between test samples not used to develop the classifier, including samples that have been cooked, and to classify other fish species as neither Barramundi nor Nile Perch. This approach has applications to truth in labeling for fishmongers and restaurants, monitoring fish catches, and for scientific research into distances between species.  相似文献   

5.
A typical small-sample biomarker classification paper discriminates between types of pathology based on, say, 30,000 genes and a small labeled sample of less than 100 points. Some classification rule is used to design the classifier from this data, but we are given no good reason or conditions under which this algorithm should perform well. An error estimation rule is used to estimate the classification error on the population using the same data, but once again we are given no good reason or conditions under which this error estimator should produce a good estimate, and thus we do not know how well the classifier should be expected to perform. In fact, virtually, in all such papers the error estimate is expected to be highly inaccurate. In short, we are given no justification for any claims.Given the ubiquity of vacuous small-sample classification papers in the literature, one could easily conclude that scientific knowledge is impossible in small-sample settings. It is not that thousands of papers overtly claim that scientific knowledge is impossible in regard to their content; rather, it is that they utilize methods that preclude scientific knowledge. In this paper, we argue to the contrary that scientific knowledge in small-sample classification is possible provided there is sufficient prior knowledge. A natural way to proceed, discussed herein, is via a paradigm for pattern recognition in which we incorporate prior knowledge in the whole classification procedure (classifier design and error estimation), optimize each step of the procedure given available information, and obtain theoretical measures of performance for both classifiers and error estimators, the latter being the critical epistemological issue. In sum, we can achieve scientific validation for a proposed small-sample classifier and its error estimate.  相似文献   

6.
In clinical diagnostics, it is of outmost importance to correctly identify the source of a metastatic tumor, especially if no apparent primary tumor is present. Tissue-based proteomics might allow correct tumor classification. As a result, we performed MALDI imaging to generate proteomic signatures for different tumors. These signatures were used to classify common cancer types. At first, a cohort comprised of tissue samples from six adenocarcinoma entities located at different organ sites (esophagus, breast, colon, liver, stomach, thyroid gland, n = 171) was classified using two algorithms for a training and test set. For the test set, Support Vector Machine and Random Forest yielded overall accuracies of 82.74 and 81.18%, respectively. Then, colon cancer liver metastasis samples (n = 19) were introduced into the classification. The liver metastasis samples could be discriminated with high accuracy from primary tumors of colon cancer and hepatocellular carcinoma. Additionally, colon cancer liver metastasis samples could be successfully classified by using colon cancer primary tumor samples for the training of the classifier. These findings demonstrate that MALDI imaging-derived proteomic classifiers can discriminate between different tumor types at different organ sites and in the same site.  相似文献   

7.
HER2-testing in breast and gastric cancers is mandatory for the treatment with trastuzumab. We hypothesized that imaging mass spectrometry (IMS) of breast cancers may be useful for generating a classifier that may determine HER2-status in other cancer entities irrespective of primary tumor site. A total of 107 breast (n = 48) and gastric (n = 59) cryo tissue samples was analyzed by IMS (HER2 was present in 29 cases). The obtained proteomic profiles were used to create HER2 prediction models using different classification algorithms. A breast cancer proteome derived classifier, with HER2 present in 15 cases, correctly predicted HER2-status in gastric cancers with a sensitivity of 65% and a specificity of 92%. To create a universal classifier for HER2-status, breast and nonbreast cancer samples were combined, which increased sensitivity to 78%, and specificity was 88%. Our proof of principle study provides evidence that HER2-status can be identified on a proteomic level across different cancer types suggesting that HER2 overexpression may constitute a unique molecular event independent of the tumor site. Furthermore, these results indicate that IMS may be useful for the determination of potential drugable targets, as it offers a quicker, cheaper, and more objective analysis than the standard HER2-testing procedures immunohistochemistry and fluorescence in situ hybridization.  相似文献   

8.
Proteomics is more than just generating lists of proteins that increase or decrease in expression as a cause or consequence of pathology. The goal should be to characterize the information flow through the intercellular protein circuitry that communicates with the extracellular microenvironment and then ultimately to the serum/plasma macroenvironment. The nature of this information can be a cause, or a consequence, of disease and toxicity-based processes. Serum proteomic pattern diagnostics is a new type of proteomic platform in which patterns of proteomic signatures from high dimensional mass spectrometry data are used as a diagnostic classifier. This approach has recently shown tremendous promise in the detection of early-stage cancers. The biomarkers found by SELDI-TOF-based pattern recognition analysis are mostly low molecular weight fragments produced at the specific tumor microenvironment.  相似文献   

9.

Background  

Generally speaking, different classifiers tend to work well for certain types of data and conversely, it is usually not known a priori which algorithm will be optimal in any given classification application. In addition, for most classification problems, selecting the best performing classification algorithm amongst a number of competing algorithms is a difficult task for various reasons. As for example, the order of performance may depend on the performance measure employed for such a comparison. In this work, we present a novel adaptive ensemble classifier constructed by combining bagging and rank aggregation that is capable of adaptively changing its performance depending on the type of data that is being classified. The attractive feature of the proposed classifier is its multi-objective nature where the classification results can be simultaneously optimized with respect to several performance measures, for example, accuracy, sensitivity and specificity. We also show that our somewhat complex strategy has better predictive performance as judged on test samples than a more naive approach that attempts to directly identify the optimal classifier based on the training data performances of the individual classifiers.  相似文献   

10.
In this study, a novel spatial filter design method is introduced. Spatial filtering is an important processing step for feature extraction in motor imagery-based brain-computer interfaces. This paper introduces a new motor imagery signal classification method combined with spatial filter optimization. We simultaneously train the spatial filter and the classifier using a neural network approach. The proposed spatial filter network (SFN) is composed of two layers: a spatial filtering layer and a classifier layer. These two layers are linked to each other with non-linear mapping functions. The proposed method addresses two shortcomings of the common spatial patterns (CSP) algorithm. First, CSP aims to maximize the between-classes variance while ignoring the minimization of within-classes variances. Consequently, the features obtained using the CSP method may have large within-classes variances. Second, the maximizing optimization function of CSP increases the classification accuracy indirectly because an independent classifier is used after the CSP method. With SFN, we aimed to maximize the between-classes variance while minimizing within-classes variances and simultaneously optimizing the spatial filter and the classifier. To classify motor imagery EEG signals, we modified the well-known feed-forward structure and derived forward and backward equations that correspond to the proposed structure. We tested our algorithm on simple toy data. Then, we compared the SFN with conventional CSP and its multi-class version, called one-versus-rest CSP, on two data sets from BCI competition III. The evaluation results demonstrate that SFN is a good alternative for classifying motor imagery EEG signals with increased classification accuracy.  相似文献   

11.
Monte Carlo feature selection for supervised classification   总被引:4,自引:0,他引:4  
MOTIVATION: Pre-selection of informative features for supervised classification is a crucial, albeit delicate, task. It is desirable that feature selection provides the features that contribute most to the classification task per se and which should therefore be used by any classifier later used to produce classification rules. In this article, a conceptually simple but computer-intensive approach to this task is proposed. The reliability of the approach rests on multiple construction of a tree classifier for many training sets randomly chosen from the original sample set, where samples in each training set consist of only a fraction of all of the observed features. RESULTS: The resulting ranking of features may then be used to advantage for classification via a classifier of any type. The approach was validated using Golub et al. leukemia data and the Alizadeh et al. lymphoma data. Not surprisingly, we obtained a significantly different list of genes. Biological interpretation of the genes selected by our method showed that several of them are involved in precursors to different types of leukemia and lymphoma rather than being genes that are common to several forms of cancers, which is the case for the other methods. AVAILABILITY: Prototype available upon request.  相似文献   

12.
I present an architecture for acoustic pattern classification using trinary-trinary template correlation. In spite of its computational simplicity, the algorithm and architecture represent a method which greatly reduces bandwidth of the input, storage requirements of the classifier memory, and power consumption of the system without compromising classification accuracy. The linear system should be amenable to training using recently-developed methods such as Independent Component Analysis (ICA), and we predict that behavior will be qualitatively similar to that of structures in the auditory cortex.  相似文献   

13.
Proposed molecular classifiers may be overfit to idiosyncrasies of noisy genomic and proteomic data. Cross-validation methods are often used to obtain estimates of classification accuracy, but both simulations and case studies suggest that, when inappropriate methods are used, bias may ensue. Bias can be bypassed and generalizability can be tested by external (independent) validation. We evaluated 35 studies that have reported on external validation of a molecular classifier. We extracted information on study design and methodological features, and compared the performance of molecular classifiers in internal cross-validation versus external validation for 28 studies where both had been performed. We demonstrate that the majority of studies pursued cross-validation practices that are likely to overestimate classifier performance. Most studies were markedly underpowered to detect a 20% decrease in sensitivity or specificity between internal cross-validation and external validation [median power was 36% (IQR, 21-61%) and 29% (IQR, 15-65%), respectively]. The median reported classification performance for sensitivity and specificity was 94% and 98%, respectively, in cross-validation and 88% and 81% for independent validation. The relative diagnostic odds ratio was 3.26 (95% CI 2.04-5.21) for cross-validation versus independent validation. Finally, we reviewed all studies (n = 758) which cited those in our study sample, and identified only one instance of additional subsequent independent validation of these classifiers. In conclusion, these results document that many cross-validation practices employed in the literature are potentially biased and genuine progress in this field will require adoption of routine external validation of molecular classifiers, preferably in much larger studies than in current practice.  相似文献   

14.
It is important to understand the cause of amyloid illnesses by predicting the short protein fragments capable of forming amyloid-like fibril motifs aiding in the discovery of sequence-targeted anti-aggregation drugs. It is extremely desirable to design computational tools to provide affordable in silico predictions owing to the limitations of molecular techniques for their identification. In this research article, we tried to study, from a machine learning perspective, the performance of several machine learning classifiers that use heterogenous features based on biochemical and biophysical properties of amino acids to discriminate between amyloidogenic and non-amyloidogenic regions in peptides. Four conventional machine learning classifiers namely Support Vector Machine, Neural network, Decision tree and Random forest were trained and tested to find the best classifier that fits the problem domain well. Prior to classification, novel implementations of two biologically-inspired feature optimization techniques based on evolutionary algorithms and methodologies that mimic social life and a multivariate method based on projection are utilized in order to remove the unimportant and uninformative features. Among the dimenionality reduction algorithms considered under the study, prediction results show that algorithms based on evolutionary computation is the most effective. SVM best suits the problem domain in its fitment among the classifiers considered. The best classifier is also compared with an online predictor to evidence the equilibrium maintained between true positive rates and false positive rates in the proposed classifier. This exploratory study suggests that these methods are promising in providing amyloidogenity prediction and may be further extended for large-scale proteomic studies.  相似文献   

15.
16.
We present a novel computational method for predicting which proteins from highly and abnormally expressed genes in diseased human tissues, such as cancers, can be secreted into the bloodstream, suggesting possible marker proteins for follow-up serum proteomic studies. A main challenging issue in tackling this problem is that our understanding about the downstream localization after proteins are secreted outside the cells is very limited and not sufficient to provide useful hints about secretion to the bloodstream. To bypass this difficulty, we have taken a data mining approach by first collecting, through extensive literature searches, human proteins that are known to be secreted into the bloodstream due to various pathological conditions as detected by previous proteomic studies, and then asking the question: 'what do these secreted proteins have in common in terms of their physical and chemical properties, amino acid sequence and structural features that can be used to predict them?' We have identified a list of features, such as signal peptides, transmembrane domains, glycosylation sites, disordered regions, secondary structural content, hydrophobicity and polarity measures that show relevance to protein secretion. Using these features, we have trained a support vector machine-based classifier to predict protein secretion to the bloodstream. On a large test set containing 98 secretory proteins and 6601 non-secretory proteins of human, our classifier achieved approximately 90% prediction sensitivity and approximately 98% prediction specificity. Several additional datasets are used to further assess the performance of our classifier. On a set of 122 proteins that were found to be of abnormally high abundance in human blood due to various cancers, our program predicted 62 as blood-secreted proteins. By applying our program to abnormally highly expressed genes in gastric cancer and lung cancer tissues detected through microarray gene expression studies, we predicted 13 and 31 as blood secreted, respectively, suggesting that they could serve as potential biomarkers for these two cancers, respectively. Our study demonstrated that our method can provide highly useful information to link genomic and proteomic studies for disease biomarker discovery. Our software can be accessed at http://csbl1.bmb.uga.edu/cgi-bin/Secretion/secretion.cgi.  相似文献   

17.

Objectives

Epidermal growth factor receptor (EGFR) gene mutations in tumors predict tumor response to EGFR tyrosine kinase inhibitors (EGFR-TKIs) in non-small-cell lung cancer (NSCLC). However, obtaining tumor tissue for mutation analysis is challenging. Here, we aimed to detect serum peptides/proteins associated with EGFR gene mutation status, and test whether a classification algorithm based on serum proteomic profiling could be developed to analyze EGFR gene mutation status to aid therapeutic decision-making.

Patients and Methods

Serum collected from 223 stage IIIB or IV NSCLC patients with known EGFR gene mutation status in their tumors prior to therapy was analyzed by matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF-MS) and ClinProTools software. Differences in serum peptides/proteins between patients with EGFR gene TKI-sensitive mutations and wild-type EGFR genes were detected in a training group of 100 patients; based on this analysis, a serum proteomic classification algorithm was developed to classify EGFR gene mutation status and tested in an independent validation group of 123 patients. The correlation between EGFR gene mutation status, as identified with the serum proteomic classifier and response to EGFR-TKIs was analyzed.

Results

Nine peptide/protein peaks were significantly different between NSCLC patients with EGFR gene TKI-sensitive mutations and wild-type EGFR genes in the training group. A genetic algorithm model consisting of five peptides/proteins (m/z 4092.4, 4585.05, 1365.1, 4643.49 and 4438.43) was developed from the training group to separate patients with EGFR gene TKI-sensitive mutations and wild-type EGFR genes. The classifier exhibited a sensitivity of 84.6% and a specificity of 77.5% in the validation group. In the 81 patients from the validation group treated with EGFR-TKIs, 28 (59.6%) of 47 patients whose matched samples were labeled as “mutant” by the classifier and 3 (8.8%) of 34 patients whose matched samples were labeled as “wild” achieved an objective response (p<0.0001). Patients whose matched samples were labeled as “mutant” by the classifier had a significantly longer progression-free survival (PFS) than patients whose matched samples were labeled as “wild” (p=0.001).

Conclusion

Peptides/proteins related to EGFR gene mutation status were found in the serum. Classification of EGFR gene mutation status using the serum proteomic classifier established in the present study in patients with stage IIIB or IV NSCLC is feasible and may predict tumor response to EGFR-TKIs.  相似文献   

18.

Background  

Overfitting the data is a salient issue for classifier design in small-sample settings. This is why selecting a classifier from a constrained family of classifiers, ones that do not possess the potential to too finely partition the feature space, is typically preferable. But overfitting is not merely a consequence of the classifier family; it is highly dependent on the classification rule used to design a classifier from the sample data. Thus, it is possible to consider families that are rather complex but for which there are classification rules that perform well for small samples. Such classification rules can be advantageous because they facilitate satisfactory classification when the class-conditional distributions are not easily separated and the sample is not large. Here we consider neural networks, from the perspectives of classical design based solely on the sample data and from noise-injection-based design.  相似文献   

19.
This review discusses data analysis strategies for the discovery of biomarkers in clinical proteomics. Proteomics studies produce large amounts of data, characterized by few samples of which many variables are measured. A wealth of classification methods exists for extracting information from the data. Feature selection plays an important role in reducing the dimensionality of the data prior to classification and in discovering biomarker leads. The question which classification strategy works best is yet unanswered. Validation is a crucial step for biomarker leads towards clinical use. Here we only discuss statistical validation, recognizing that biological and clinical validation is of utmost importance. First, there is the need for validated model selection to develop a generalized classifier that predicts new samples correctly. A cross-validation loop that is wrapped around the model development procedure assesses the performance using unseen data. The significance of the model should be tested; we use permutations of the data for comparison with uninformative data. This procedure also tests the correctness of the performance validation. Preferably, a new set of samples is measured to test the classifier and rule out results specific for a machine, analyst, laboratory or the first set of samples. This is not yet standard practice. We present a modular framework that combines feature selection, classification, biomarker discovery and statistical validation; these data analysis aspects are all discussed in this review. The feature selection, classification and biomarker discovery modules can be incorporated or omitted to the preference of the researcher. The validation modules, however, should not be optional. In each module, the researcher can select from a wide range of methods, since there is not one unique way that leads to the correct model and proper validation. We discuss many possibilities for feature selection, classification and biomarker discovery. For validation we advice a combination of cross-validation and permutation testing, a validation strategy supported in the literature.  相似文献   

20.
As systems biology approaches to virology have become more tractable, highly studied viruses such as HIV can now be analyzed in new unbiased ways, including spatial proteomics. We employed here a differential centrifugation protocol to fractionate Jurkat T cells for proteomic analysis by mass spectrometry; these cells contain inducible HIV-1 genomes, enabling us to look for changes in the spatial proteome induced by viral gene expression. Using these proteomics data, we evaluated the merits of several reported machine learning pipelines for classification of the spatial proteome and identification of protein translocations. From these analyses, we found that classifier performance in this system was organelle dependent, with Bayesian t-augmented Gaussian mixture modeling outperforming support vector machine learning for mitochondrial and endoplasmic reticulum proteins but underperforming on cytosolic, nuclear, and plasma membrane proteins by QSep analysis. We also observed a generally higher performance for protein translocation identification using a Bayesian model, Bayesian analysis of differential localization experiments, on row-normalized data. Comparative Bayesian analysis of differential localization experiment analysis of cells induced to express the WT viral genome versus cells induced to express a genome unable to express the accessory protein Nef identified known Nef-dependent interactors such as T-cell receptor signaling components and coatomer complex. Finally, we found that support vector machine classification showed higher consistency and was less sensitive to HIV-dependent noise. These findings illustrate important considerations for studies of the spatial proteome following viral infection or viral gene expression and provide a reference for future studies of HIV-gene-dropout viruses.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号