首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Ovarian cancer recurs at the rate of 75% within a few months or several years later after therapy. Early recurrence, though responding better to treatment, is difficult to detect. Surface-enhanced laser desorption/ionization time-of-flight (SELDI-TOF) mass spectrometry has showed the potential to accurately identify disease biomarkers to help early diagnosis. A major challenge in the interpretation of SELDI-TOF data is the high dimensionality of the feature space. To tackle this problem, we have developed a multi-step data processing method composed of t-test, binning and backward feature selection. A new algorithm, support vector machine-Markov blanket/recursive feature elimination (SVM-MB/RFE) is presented for the backward feature selection. This method is an integration of minimum weight feature elimination by SVM-RFE and information theory based redundant/irrelevant feature removal by Markov Blanket. Subsequently, SVM was used for classification. We conducted the biomarker selection algorithm on 113 serum samples to identify early relapse from ovarian cancer patients after primary therapy. To validate the performance of the proposed algorithm, experiments were carried out in comparison with several other feature selection and classification algorithms.  相似文献   

2.
Metabolic markers are the core of metabonomic surveys. Hence selection of differential metabolites is of great importance for either biological or clinical purpose. Here, a feature selection method was developed for complex metabonomic data set. As an effective tool for metabonomics data analysis, support vector machine (SVM) was employed as the basic classifier. To find out meaningful features effectively, support vector machine recursive feature elimination (SVM-RFE) was firstly applied. Then, genetic algorithm (GA) and random forest (RF) which consider the interaction among the metabolites and independent performance of each metabolite in all samples, respectively, were used to obtain more informative metabolic difference and avoid the risk of false positive. A data set from plasma metabonomics study of rat liver diseases developed from hepatitis, cirrhosis to hepatocellular carcinoma was applied for the validation of the method. Besides the good classification results for 3 kinds of liver diseases, 31 important metabolites including lysophosphatidylethanolamine (LPE) C16:0, palmitoylcarnitine, lysophosphatidylethanolamine (LPC) C18:0 were also selected for further studies. A better complementary effect of the three feature selection methods could be seen from the current results. The combinational method also represented more differential metabolites and provided more metabolic information for a “global” understanding of diseases than any single method. Further more, this method is also suitable for other complex biological data sets.  相似文献   

3.

Background  

This paper presents the use of Support Vector Machines (SVMs) for prediction and analysis of antisense oligonucleotide (AO) efficacy. The collected database comprises 315 AO molecules including 68 features each, inducing a problem well-suited to SVMs. The task of feature selection is crucial given the presence of noisy or redundant features, and the well-known problem of the curse of dimensionality. We propose a two-stage strategy to develop an optimal model: (1) feature selection using correlation analysis, mutual information, and SVM-based recursive feature elimination (SVM-RFE), and (2) AO prediction using standard and profiled SVM formulations. A profiled SVM gives different weights to different parts of the training data to focus the training on the most important regions.  相似文献   

4.

Background:  

In class prediction problems using microarray data, gene selection is essential to improve the prediction accuracy and to identify potential marker genes for a disease. Among numerous existing methods for gene selection, support vector machine-based recursive feature elimination (SVM-RFE) has become one of the leading methods and is being widely used. The SVM-based approach performs gene selection using the weight vector of the hyperplane constructed by the samples on the margin. However, the performance can be easily affected by noise and outliers, when it is applied to noisy, small sample size microarray data.  相似文献   

5.

Background

Acupuncture has been practiced in China for thousands of years as part of the Traditional Chinese Medicine (TCM) and has gradually accepted in western countries as an alternative or complementary treatment. However, the underlying mechanism of acupuncture, especially whether there exists any difference between varies acupoints, remains largely unknown, which hinders its widespread use.

Results

In this study, we develop a novel Linear Programming based Feature Selection method (LPFS) to understand the mechanism of acupuncture effect, at molecular level, by revealing the metabolite biomarkers for acupuncture treatment. Specifically, we generate and investigate the high-throughput metabolic profiles of acupuncture treatment at several acupoints in human. To select the subsets of metabolites that best characterize the acupuncture effect for each meridian point, an optimization model is proposed to identify biomarkers from high-dimensional metabolic data from case and control samples. Importantly, we use nearest centroid as the prototype to simultaneously minimize the number of selected features and the leave-one-out cross validation error of classifier. We compared the performance of LPFS to several state-of-the-art methods, such as SVM recursive feature elimination (SVM-RFE) and sparse multinomial logistic regression approach (SMLR). We find that our LPFS method tends to reveal a small set of metabolites with small standard deviation and large shifts, which exactly serves our requirement for good biomarker. Biologically, several metabolite biomarkers for acupuncture treatment are revealed and serve as the candidates for further mechanism investigation. Also biomakers derived from five meridian points, Zusanli (ST36), Liangmen (ST21), Juliao (ST3), Yanglingquan (GB34), and Weizhong (BL40), are compared for their similarity and difference, which provide evidence for the specificity of acupoints.

Conclusions

Our result demonstrates that metabolic profiling might be a promising method to investigate the molecular mechanism of acupuncture. Comparing with other existing methods, LPFS shows better performance to select a small set of key molecules. In addition, LPFS is a general methodology and can be applied to other high-dimensional data analysis, for example cancer genomics.
  相似文献   

6.

Background

The majority of ovarian cancer biomarker discovery efforts focus on the identification of proteins that can improve the predictive power of presently available diagnostic tests. We here show that metabolomics, the study of metabolic changes in biological systems, can also provide characteristic small molecule fingerprints related to this disease.

Results

In this work, new approaches to automatic classification of metabolomic data produced from sera of ovarian cancer patients and benign controls are investigated. The performance of support vector machines (SVM) for the classification of liquid chromatography/time-of-flight mass spectrometry (LC/TOF MS) metabolomic data focusing on recognizing combinations or "panels" of potential metabolic diagnostic biomarkers was evaluated. Utilizing LC/TOF MS, sera from 37 ovarian cancer patients and 35 benign controls were studied. Optimum panels of spectral features observed in positive or/and negative ion mode electrospray (ESI) MS with the ability to distinguish between control and ovarian cancer samples were selected using state-of-the-art feature selection methods such as recursive feature elimination and L1-norm SVM.

Conclusion

Three evaluation processes (leave-one-out-cross-validation, 12-fold-cross-validation, 52-20-split-validation) were used to examine the SVM models based on the selected panels in terms of their ability for differentiating control vs. disease serum samples. The statistical significance for these feature selection results were comprehensively investigated. Classification of the serum sample test set was over 90% accurate indicating promise that the above approach may lead to the development of an accurate and reliable metabolomic-based approach for detecting ovarian cancer.  相似文献   

7.
Optimal experimental design is important for the efficient use of modern highthroughput technologies such as microarrays and proteomics. Multiple factors including the reliability of measurement system, which itself must be estimated from prior experimental work, could influence design decisions. In this study, we describe how the optimal number of replicate measures (technical replicates) for each biological sample (biological replicate) can be determined. Different allocations of biological and technical replicates were evaluated by minimizing the variance of the ratio of technical variance (measurement error) to the total variance (sum of sampling error and measurement error). We demonstrate that if the number of biological replicates and the number of technical replicates per biological sample are variable, while the total number of available measures is fixed, then the optimal allocation of replicates for measurement evaluation experiments requires two technical replicates for each biological replicate. Therefore, it is recommended to use two technical replicates for each biological replicate if the goal is to evaluate the reproducibility of measurements.  相似文献   

8.
A strategy for processing of metabolomic GC/MS data is presented. By considering the relationship between quantity and quality of detected profiles, representative data suitable for multiple sample comparisons and metabolite identification was generated. Design of experiments (DOE) and multivariate analysis was used to relate the changes in settings of the hierarchical multivariate curve resolution (H-MCR) method to quantitative and qualitative characteristics of the output data. These characteristics included number of resolved profiles, chromatographic quality in terms of reproducibility between analytical replicates, and spectral quality defined by purity and number of spectra containing structural information. The strategy was exemplified in two datasets: one containing 119 common metabolites, 18 of which were varied according to a DOE protocol; and one consisting of rat urine samples from control rats and rats exposed to a liver toxin. It was shown that the performance of the data processing could be optimized to produce metabolite data of high quality that allowed reliable sample comparisons and metabolite identification. This is a general approach applicable to any type of data processing where the important processing parameters are known and relevant output data characteristics can be defined. The results imply that this type of data quality optimization should be carried out as an integral step of data processing to ensure high quality data for further modeling and biological evaluation. Within metabolomics, this degree of optimization will be of high importance to generate models and extract biomarkers or biomarker patterns of biological or clinical relevance.  相似文献   

9.
Although metastasis is the principal cause of death cause for colorectal cancer (CRC) patients, the molecular mechanisms underlying CRC metastasis are still not fully understood. In an attempt to identify metastasis-related genes in CRC, we obtained gene expression profiles of 55 early stage primary CRCs, 56 late stage primary CRCs, and 34 metastatic CRCs from the expression project in Oncology (http://www.intgen.org/expo/). We developed a novel gene selection algorithm (SVM-T-RFE), which extends support vector machine recursive feature elimination (SVM-RFE) algorithm by incorporating T-statistic. We achieved highest classification accuracy (100%) with smaller gene subsets (10 and 6, respectively), when classifying between early and late stage primary CRCs, as well as between metastatic CRCs and late stage primary CRCs. We also compared the performance of SVM-T-RFE and SVM-RFE gene selection algorithms on another large-scale CRC dataset and the five public microarray datasets. SVM-T-RFE bestowed SVM-RFE algorithm in identifying more differentially expressed genes, and achieving highest prediction accuracy using equal or smaller number of selected genes. A fraction of selected genes have been reported to be associated with CRC development or metastasis.  相似文献   

10.
Conventional biomarker discovery focuses mostly on the identification of single markers and thus often has limited success in disease diagnosis and prognosis. This study proposes a method to identify an optimized protein biomarker panel based on MS studies for predicting the risk of major adverse cardiac events (MACE) in patients. Since the simplicity and concision requirement for the development of immunoassays can only tolerate the complexity of the prediction model with a very few selected discriminative biomarkers, established optimization methods, such as conventional genetic algorithm (GA), thus fails in the high‐dimensional space. In this paper, we present a novel variant of GA that embeds the recursive local floating enhancement technique to discover a panel of protein biomarkers with far better prognostic value for prediction of MACE than existing methods, including the one approved recently by FDA (Food and Drug Administration). The new pragmatic method applies the constraints of MACE relevance and biomarker redundancy to shrink the local searching space in order to avoid heavy computation penalty resulted from the local floating optimization. The proposed method is compared with standard GA and other variable selection approaches based on the MACE prediction experiments. Two powerful classification techniques, partial least squares logistic regression (PLS‐LR) and support vector machine classifier (SVMC), are deployed as the MACE predictors owing to their ability in dealing with small scale and binary response data. New preprocessing algorithms, such as low‐level signal processing, duplicated spectra elimination, and outliner patient's samples removal, are also included in the proposed method. The experimental results show that an optimized panel of seven selected biomarkers can provide more than 77.1% MACE prediction accuracy using SVMC. The experimental results empirically demonstrate that the new GA algorithm with local floating enhancement (GA‐LFE) can achieve the better MACE prediction performance comparing with the existing techniques. The method has been applied to SELDI/MALDI MS datasets to discover an optimized panel of protein biomarkers to distinguish disease from control.  相似文献   

11.
Pluripotent stem cells are able to self-renew, and to differentiate into all adult cell types. Many studies report data describing these cells, and characterize them in molecular terms. Machine learning yields classifiers that can accurately identify pluripotent stem cells, but there is a lack of studies yielding minimal sets of best biomarkers (genes/features). We assembled gene expression data of pluripotent stem cells and non-pluripotent cells from the mouse. After normalization and filtering, we applied machine learning, classifying samples into pluripotent and non-pluripotent with high cross-validated accuracy. Furthermore, to identify minimal sets of best biomarkers, we used three methods: information gain, random forests and a wrapper of genetic algorithm and support vector machine (GA/SVM). We demonstrate that the GA/SVM biomarkers work best in combination with each other; pathway and enrichment analyses show that they cover the widest variety of processes implicated in pluripotency. The GA/SVM wrapper yields best biomarkers, no matter which classification method is used. The consensus best biomarker based on the three methods is Tet1, implicated in pluripotency just recently. The best biomarker based on the GA/SVM wrapper approach alone is Fam134b, possibly a missing link between pluripotency and some standard surface markers of unknown function processed by the Golgi apparatus.  相似文献   

12.
Accurate prediction of the phenotypic performance of a hybrid plant based on the molecular fingerprints of its parents should lead to a more cost-effective breeding programme as it allows to reduce the number of expensive field evaluations. The construction of a reliable prediction model requires a representative sample of hybrids for which both molecular and phenotypic information are accessible. This phenotypic information is usually readily available as typical breeding programmes test numerous new hybrids in multi-location field trials on a yearly basis. Earlier studies indicated that a linear mixed model analysis of this typically unbalanced phenotypic data allows to construct ɛ-insensitive support vector machine regression and best linear prediction models for predicting the performance of single-cross maize hybrids. We compare these prediction methods using different subsets of the phenotypic and marker data of a commercial maize breeding programme and evaluate the resulting prediction accuracies by means of a specifically designed field experiment. This balanced field trial allows to assess the reliability of the cross-validation prediction accuracies reported here and in earlier studies. The limits of the predictive capabilities of both prediction methods are further examined by reducing the number of training hybrids and the size of the molecular fingerprints. The results indicate a considerable discrepancy between prediction accuracies obtained by cross-validation procedures and those obtained by correlating the predictions with the results of a validation field trial. The prediction accuracy of best linear prediction was less sensitive to a reduction of the number of training examples compared with that of support vector machine regression. The latter was, however, better at predicting hybrid performance when the size of the molecular fingerprints was reduced, especially if the initial set of markers had a low information content.  相似文献   

13.
MOTIVATION: Given the thousands of genes and the small number of samples, gene selection has emerged as an important research problem in microarray data analysis. Support Vector Machine-Recursive Feature Elimination (SVM-RFE) is one of a group of recently described algorithms which represent the stat-of-the-art for gene selection. Just like SVM itself, SVM-RFE was originally designed to solve binary gene selection problems. Several groups have extended SVM-RFE to solve multiclass problems using one-versus-all techniques. However, the genes selected from one binary gene selection problem may reduce the classification performance in other binary problems. RESULTS: In the present study, we propose a family of four extensions to SVM-RFE (called MSVM-RFE) to solve the multiclass gene selection problem, based on different frameworks of multiclass SVMs. By simultaneously considering all classes during the gene selection stages, our proposed extensions identify genes leading to more accurate classification.  相似文献   

14.
Summary. Isoprostanes, non-enzymatic peroxidation products of arachidonic acid, are attractive biomarkers of oxidative stress in research in biology, medicine and nutrition. For the appropriate use of biomarkers it is required that these are both biologically and technically valid. Whereas the biological validity of isoprostanes is well-established, it is technically quite complicated to measure isoprostanes and its metabolites in body fluids, and its rapid disappearance from plasma may hamper practical application. This paper shortly introduces isoprostanes as a biomarker for studies with humans, describes a novel fast and sensitive method for measuring isoprostanes in plasma by high-performance liquid chromatography and tandem mass spectrometry, and provides several examples of the use of the method in studies in humans. By taking care of the biological and technical validity of this biomarker it is possible to establish the antioxidant effects of some food ingredients in studies with human volunteers.  相似文献   

15.
Adverse health risks from environmental agents are generally related to average (long-term) exposures. Because a given individual's contact with a pollutant is highly variable and dependent on activity patterns, local sources and exposure pathways, simple ‘snapshot’ measurements of surrounding environmental media may not accurately assign the exposure level. Furthermore, susceptibility to adverse effects from contaminants is considered highly variable in the population so that even similar environmental exposure levels may result in differential health outcomes in different individuals. The use of biomarker measurements coupled to knowledge of rates of uptake, metabolism and elimination has been suggested as a remedy for reducing this type of uncertainty. To demonstrate the utility of such an approach, we invoke results from a series of controlled human exposure tests and classical first-order rate kinetic calculations to estimate how well spot measurements of methyl tertiary butyl ether and the primary metabolite, tertiary butyl alcohol, can be expected to predict different hypothetical scenarios of previous exposures. We found that blood and breath biomarker measurements give similar results and that the biological damping effect of the metabolite production gives more stable estimates of previous exposure. We also explore the value of a potential urinary biomarker, 2-hydroxyisobutyrate suggested in the literature. We find that individual biomarker measurements are a valuable tool in reconstruction of previous exposures and that a simple pharmacokinetic model can identify the time frames over which an exogenous chemical and the related chemical biomarker are useful. These techniques could be applied to broader ranges of environmental contaminants to assess cumulative exposure risks if ADME (Absorption, Distribution, Metabolization and Excretion) is understood and systemic biomarkers can be measured.  相似文献   

16.
Moon  Myungjin  Nakai  Kenta 《BMC genomics》2016,17(13):65-74
Background

Lately, biomarker discovery has become one of the most significant research issues in the biomedical field. Owing to the presence of high-throughput technologies, genomic data, such as microarray data and RNA-seq, have become widely available. Many kinds of feature selection techniques have been applied to retrieve significant biomarkers from these kinds of data. However, they tend to be noisy with high-dimensional features and consist of a small number of samples; thus, conventional feature selection approaches might be problematic in terms of reproducibility.

Results

In this article, we propose a stable feature selection method for high-dimensional datasets. We apply an ensemble L 1 -norm support vector machine to efficiently reduce irrelevant features, considering the stability of features. We define the stability score for each feature by aggregating the ensemble results, and utilize backward feature elimination on a purified feature set based on this score; therefore, it is possible to acquire an optimal set of features for performance without the need to set a specific threshold. The proposed methodology is evaluated by classifying the binary stage of renal clear cell carcinoma with RNA-seq data.

Conclusion

A comparison with established algorithms, i.e., a fast correlation-based filter, random forest, and an ensemble version of an L 2 -norm support vector machine-based recursive feature elimination, enabled us to prove the superior performance of our method in terms of classification as well as stability in general. It is also shown that the proposed approach performs moderately on high-dimensional datasets consisting of a very large number of features and a smaller number of samples. The proposed approach is expected to be applicable to many other researches aimed at biomarker discovery.

  相似文献   

17.
18.
Classification of gene microarrays by penalized logistic regression   总被引:2,自引:0,他引:2  
Classification of patient samples is an important aspect of cancer diagnosis and treatment. The support vector machine (SVM) has been successfully applied to microarray cancer diagnosis problems. However, one weakness of the SVM is that given a tumor sample, it only predicts a cancer class label but does not provide any estimate of the underlying probability. We propose penalized logistic regression (PLR) as an alternative to the SVM for the microarray cancer diagnosis problem. We show that when using the same set of genes, PLR and the SVM perform similarly in cancer classification, but PLR has the advantage of additionally providing an estimate of the underlying probability. Often a primary goal in microarray cancer diagnosis is to identify the genes responsible for the classification, rather than class prediction. We consider two gene selection methods in this paper, univariate ranking (UR) and recursive feature elimination (RFE). Empirical results indicate that PLR combined with RFE tends to select fewer genes than other methods and also performs well in both cross-validation and test samples. A fast algorithm for solving PLR is also described.  相似文献   

19.
In this study, we present a constructive algorithm for training cooperative support vector machine ensembles (CSVMEs). CSVME combines ensemble architecture design with cooperative training for individual SVMs in ensembles. Unlike most previous studies on training ensembles, CSVME puts emphasis on both accuracy and collaboration among individual SVMs in an ensemble. A group of SVMs selected on the basis of recursive classifier elimination is used in CSVME, and the number of the individual SVMs selected to construct CSVME is determined by 10-fold cross-validation. This kind of SVME has been tested on two ovarian cancer datasets previously obtained by proteomic mass spectrometry. By combining several individual SVMs, the proposed method achieves better performance than the SVME of all base SVMs.  相似文献   

20.

Oregon‐R, +3, and crossbred strains of Drosophila melanogaster were tested for their response to selection for abdominal bristle number. Various subsidiary tests, consisting of heritability estimations, testing for lethal second and third chromosomes, and chromosome assays were conducted on the selection replicates, which had undergone 14 generations of selection. Evidence showed that a plateau which occurred very early in the +3 high selection replicates was due to fixation of a few additive genes with large effects, thus accounting for the low phenotypic and additive genetic variance, the slight regression in abdominal bristle number on relaxation of selection, the absence of directional dominance, and the low frequency of recessive lethals.

High frequencies of second and third chromosome lethals were found in the Oregon‐R high and low replicates and in the +3 low replicates. That these lethals were not selected for heterozygote superiority for extreme bristle effect was indicated by the slight regression of these replicates on relaxation of selection, and by the absence of high, fluctuating phenotypic variances.

From chromosome assays it appears that the two parental strains had different arrays of genes affecting high bristle number, with these genes located mostly in chromosome II in the Oregon‐R high line but in chromosome III in the +3 high line. In the Crossbred high line, high bristle factors were located in both the second and third chromosomes. The low bristle factors were located mainly in the second chromosome in all three low selection lines.

It appears that the original cross had combined different genes favouring high bristle number, thus allowing greater response in the Crossbred high selection line. The same did not occur for low selection; the response from the Crossbred low line was similar to that of the parental low lines, suggesting that the gene arrays affecting low bristle number in the two original populations were comparable.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号