首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.

Background

Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study.

Methods

The study used a three-step methodology amalgamating multiple imputation, a machine learning boosted regression algorithm and logistic regression, to identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009–2010). Depression was measured using the Patient Health Questionnaire-9 and 67 biomarkers were analysed. Covariates in this study included gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass Index, physical activity, alcohol use, medical conditions and medications. The final imputed weighted multiple logistic regression model included possible confounders and moderators.

Results

After the creation of 20 imputation data sets from multiple chained regression sequences, machine learning boosted regression initially identified 21 biomarkers associated with depression. Using traditional logistic regression methods, including controlling for possible confounders and moderators, a final set of three biomarkers were selected. The final three biomarkers from the novel hybrid variable selection methodology were red cell distribution width (OR 1.15; 95% CI 1.01, 1.30), serum glucose (OR 1.01; 95% CI 1.00, 1.01) and total bilirubin (OR 0.12; 95% CI 0.05, 0.28). Significant interactions were found between total bilirubin with Mexican American/Hispanic group (p = 0.016), and current smokers (p<0.001).

Conclusion

The systematic use of a hybrid methodology for variable selection, fusing data mining techniques using a machine learning algorithm with traditional statistical modelling, accounted for missing data and complex survey sampling methodology and was demonstrated to be a useful tool for detecting three biomarkers associated with depression for future hypothesis generation: red cell distribution width, serum glucose and total bilirubin.  相似文献   

2.
Early diagnosis of inborn errors of metabolism is commonly performed through biofluid metabolomics, which detects specific metabolic biomarkers whose concentration is altered due to genomic mutations. The identification of new biomarkers is of major importance to biomedical research and is usually performed through data mining of metabolomic data. After the recent publication of the genome‐scale network model of human metabolism, we present a novel computational approach for systematically predicting metabolic biomarkers in stochiometric metabolic models. Applying the method to predict biomarkers for disruptions of red‐blood cell metabolism demonstrates a marked correlation with altered metabolic concentrations inferred through kinetic model simulations. Applying the method to the genome‐scale human model reveals a set of 233 metabolites whose concentration is predicted to be either elevated or reduced as a result of 176 possible dysfunctional enzymes. The method's predictions are shown to significantly correlate with known disease biomarkers and to predict many novel potential biomarkers. Using this method to prioritize metabolite measurement experiments to identify new biomarkers can provide an order of a 10‐fold increase in biomarker detection performance.  相似文献   

3.
Biomarker selection is an important topic in the omics sciences, where holistic measurement methods routinely generate results for many variables simultaneously. Very often, only a small fraction of these variables are really associated with the phenomena of interest. Selection and identification of these biomarkers is essential for obtaining an understanding of the complex biological processes under study. Finding biomarkers, however, is a difficult task. Even if a relative order can be established, e.g., on the basis of p values, it is usually hard to determine where to stop including candidates in the final set. Higher Criticism is an approach for finding data-dependent cutoff values when comparing two distinct groups of samples. Here, we extend its use to multivariate data, providing a principled approach to compromise between not selecting too many variables and catching as many true positives as possible. The results show a marked improvement in biomarker selection, compared to the standard settings available for some methods. Interestingly, HC thresholds can differ considerably from what has been suggested in literature before, again showing that it is not possible to use the same cutoff value for all data sets. The data-specific cutoff values provided by HC also open the way to more fair comparisons between biomarker selection methods, not biased by unlucky or suboptimal threshold choices.  相似文献   

4.
Introduction: Colorectal cancer (CRC) is one of the common types of cancer that affects a significant proportion of the population and is a major contributor to cancer related mortality. The relatively poor survival rate of CRC could be improved through the identification of clinically useful biomarkers.

Areas covered: This review highlights the need for biomarkers and discusses recent proteomics discoveries in the aspects of CRC clinical practice including diagnosis, prognosis, therapy, screening and molecular pathological epidemiology (MPE). Studies have been evaluated in relation to biomarker target, methodology, sample selection, limitations, and potential impact. Finally, the progress in proteomic approaches is briefly discussed and the main difficulties facing the translation of proteomics biomarkers into the clinical practice are highlighted.

Expert commentary: The establishment of specific guidelines, best practice recommendations and the improvement in proteomic strategies will significantly improve the prospects for developing clinically useful biomarkers.  相似文献   


5.

Background

In recent years, both single-nucleotide polymorphism (SNP) array and functional magnetic resonance imaging (fMRI) have been widely used for the study of schizophrenia (SCZ). In addition, a few studies have been reported integrating both SNPs data and fMRI data for comprehensive analysis.

Methods

In this study, a novel sparse representation based variable selection (SRVS) method has been proposed and tested on a simulation data set to demonstrate its multi-resolution properties. Then the SRVS method was applied to an integrative analysis of two different SCZ data sets, a Single-nucleotide polymorphism (SNP) data set and a functional resonance imaging (fMRI) data set, including 92 cases and 116 controls. Biomarkers for the disease were identified and validated with a multivariate classification approach followed by a leave one out (LOO) cross-validation. Then we compared the results with that of a previously reported sparse representation based feature selection method.

Results

Results showed that biomarkers from our proposed SRVS method gave significantly higher classification accuracy in discriminating SCZ patients from healthy controls than that of the previous reported sparse representation method. Furthermore, using biomarkers from both data sets led to better classification accuracy than using single type of biomarkers, which suggests the advantage of integrative analysis of different types of data.

Conclusions

The proposed SRVS algorithm is effective in identifying significant biomarkers for complicated disease as SCZ. Integrating different types of data (e.g. SNP and fMRI data) may identify complementary biomarkers benefitting the diagnosis accuracy of the disease.
  相似文献   

6.
The characterization of the normal urinary proteome is steadily progressing and represents a major interest in the assessment of clinical urinary biomarkers. To estimate quantitatively the variability of the normal urinary proteome, urines of 20 healthy people were collected. We first evaluated the impact of the sample conservation temperature on urine proteome integrity. Keeping the urine sample at RT or at +4°C until storage at -80°C seems the best way for long-term storage of samples for 2D-GE analysis. The quantitative variability of the normal urinary proteome was estimated on the 20 urines mapped by 2D-GE. The occurrence of the 910 identified spots was analysed throughout the gels and represented in a virtual 2D gel. Sixteen percent of the spots were found to occur in all samples and 23% occurred in at least 90% of urines. About 13% of the protein spots were present only in 10% or less of the samples, thus representing the most variable part of the normal urinary proteome. Twenty proteins corresponding to a fraction of the fully conserved spots were identified by mass spectrometry. In conclusion, a "public" urinary proteome, common to healthy individuals, seems to coexist with a "private" urinary proteome, which is more specific to each individual.  相似文献   

7.
《Biomarkers》2013,18(1):65-73
Context: It is known that there are usually several biomarkers and/or medium combinations that can be applied to answer a specific exposure question. To help determine an appropriate combination for the specific question, we have developed a weight-of-evidence Framework that provides a relative appropriateness score for competing combinations.

Methods: The Framework is based on an expert assessor’s evaluation of the relevance and suitability of the biomarker and medium for the question based on a set of criteria. We provide a computer based modeling tool to guide the researcher through the process.

Results: We present an example with six biomarkers of benzene exposure in one matrix; the six are either the most commonly used biomarkers and/or have recent widespread usage. The example clearly demonstrates the usefulness of the Framework for scoring the choices, as well as the transparency of the method that provides the basis for discussion.

Conclusions: The Framework provides for the first time a method to transparently document the rationale behind selecting, from among a set of alternatives, the most scientifically supportable exposure biomarker to address a specific biomonitoring question, thus providing a reproducible account of expert opinions on the suitability of a biomarker.  相似文献   

8.
Lichao Zhang  Liang Kong 《Genomics》2019,111(3):457-464
Recombination spot identification plays an important role in revealing genome evolution and developing DNA function study. Although some computational methods have been proposed, extracting discriminatory information embedded in DNA properties has not received enough attention. The DNA properties include dinucleotide flexibility, structure and thermodynamic parameter, which are significant for genome evolution research. To explore the potential effect of DNA properties, a novel feature extraction method, called iRSpot-PDI, is proposed. A wrapper feature selection method with the best first search is used to identify the best feature set. To verify the effectiveness of the proposed method, support vector machine is employed on the obtained features. Prediction results are reported on two benchmark datasets. Compared with the recently reported methods, iRSpot-PDI achieves the highest values of individual specificity, Matthew's correlation coefficient and overall accuracy. The experimental results confirm that iRSpot-PDI is effective for accurate identification of recombination spots. The datasets can be downloaded from the following URL: http://stxy.neuq.edu.cn/info/1095/1157.htm.  相似文献   

9.
In this report we present a catalogue of 162 proteins (including isoforms and variants) identified in a prototype of proteomic map of breast cancer cells. This work represents the prosecution of previous studies describing the protein complement of breast cancer cells of the line 8701-BC, which has been well characterized for several parameters, providing to be a useful model for the study of breast cancer-associated candidate biomarkers. In particular, 110 spots were identified ex novo by PMF, or validated following previous gel matching identification method; 30 were identified by N-terminal microsequencing and the remaining by gel matching with maps available from our former work. As a consequence of the expanded number of proteins, we have updated our previous classification extending the number of protein groups from 4 to 13. In order to facilitate comparative proteome studies of different kinds of breast cancers, in this report we provide the whole complement of proteins so far identified and grouped into the new classification. A consistent number of them were not described before in other proteomic maps of breast cancer cells or tissues, and therefore they represent a valuable contribution for breast cancer protein databases and for future application in basic and clinical researches.  相似文献   

10.
A strategy is presented to build a discrimination model in proteomics studies. The model is built using cross-validation. This cross-validation step can simply be combined with a variable selection method, called rank products. The strategy is especially suitable for the low-samples-to-variables-ratio (undersampling) case, as is often encountered in proteomics and metabolomics studies. As a classification method, Principal Component Discriminant Analysis is used; however, the methodology can be used with any classifier. A data set containing serum samples from breast cancer patients and healthy controls is analysed. Double cross-validation shows that the sensitivity of the model is 82% and the specificity 86%. Potential putative biomarkers are identified using the variable selection method. In each cross-validation loop a classification model is built. The final classification uses a majority voting scheme from the ensemble classifier.  相似文献   

11.
Stress responsiveness differs between individuals and is often categorized into different stress coping styles. Using these stress coping styles for selection in fish farming could be beneficial, since stress is one main factor affecting welfare. In Arctic charr (Salvelinus alpinus) carotenoid pigmentation is associated with stress responsiveness and stress coping styles. Thus this could be an important tool to use for selection of stress resilient charr. However, anaesthetics seem to affect carotenoid pigmentation, and it would be better if the method for selection could be implemented during normal maintenance, which usually includes anaesthetics. Therefore, this study investigated how the use of anaesthetics affected carotenoid pigmentation, i.e. number of spots, over time compared to no-anaesthetic treatment. Additionally, the stress indicators monoamines and glucocorticoids were investigated. The results indicate that the anaesthetic MS-222 affects number of spots on the right side. This anaesthetic also increased dopaminergic activity in the telencephalon. Both brain dopaminergic and serotonergic activity was associated with spottiness. Further, behaviour during anaesthetization was associated with spots on the left side, but not the right side. Repetition of the same treatment seemed to affect spot numbers on the right side. In conclusion, this study shows that inducing stress in charr affects the carotenoid spots. Thus, it is possible to use anaesthetics when evaluating spottiness although careful planning is needed.  相似文献   

12.
An MS-based metabolomics strategy including variable selection and PLSDA analysis has been assessed as a tool to discriminate between non-steatotic and steatotic human liver profiles. Different chemometric approaches for uninformative variable elimination were performed by using two of the most common software packages employed in the field of metabolomics (i.e., MATLAB and SIMCA-P). The first considered approach was performed with MATLAB where the PLS regression vector coefficient values were used to classify variables as informative or not. The second approach was run under SIMCA-P, where variable selection was performed according to both the PLS regression vector coefficients and VIP scores. PLSDA models performance features, such as model validation, variable selection criteria, and potential biomarker output, were assessed for comparison purposes. One interesting finding is that variable selection improved the classification predictiveness of all the models by facilitating metabolite identification and providing enhanced insight into the metabolic information acquired by the UPLC-MS method. The results prove that the proposed strategy is a potentially straightforward approach to improve model performance. Among others, GSH, lysophospholipids and bile acids were found to be the most important altered metabolites in the metabolomic profiles studied. However, further research and more in-depth biochemical interpretations are needed to unambiguously propose them as disease biomarkers.  相似文献   

13.
Histological subgroups of non-small cell lung cancer have different prognosis and they require different therapeutic approaches. Accordingly, there is a clinical need in this field to supplement conventional pathological diagnostics with protein and genetic biomarkers that can help to recognize patients responsive to these therapies. Methods for subgroup classification and target identification were developed using surgical samples (surgical lung tumor specimens are available only in 20% of all lung cancer cases). The majority of lung cancer patients, however, have tumors that are irresectable at the time of diagnosis. Therefore, their diagnosis is usually based on bronchoscopically removed tissue or needle biopsy samples analyzed mainly by cytology. Because of the growing need for immunohistochemistry and molecular pathology in lung cancer diagnosis, emphasis should be given to diagnostic bronchoscopic procedures providing tissue samples. Combination of the different biopsy techniques (histology, cytology, bronchial brush, BAL, TBNA etc.), embedding the cells (preparing cell blocks) and, moreover, the availability of immunohistochemical and molecular pathological facilities are all required to set up the proper diagnosis and therapeutic strategy in human lung cancer. Strausz J, Tímár J. Non-surgical biopsy in lung cancer: a paradigm shift.  相似文献   

14.
High-throughput biological technologies offer the promise of finding feature sets to serve as biomarkers for medical applications; however, the sheer number of potential features (genes, proteins, etc.) means that there needs to be massive feature selection, far greater than that envisioned in the classical literature. This paper considers performance analysis for feature-selection algorithms from two fundamental perspectives: How does the classification accuracy achieved with a selected feature set compare to the accuracy when the best feature set is used and what is the optimal number of features that should be used? The criteria manifest themselves in several issues that need to be considered when examining the efficacy of a feature-selection algorithm: (1) the correlation between the classifier errors for the selected feature set and the theoretically best feature set; (2) the regressions of the aforementioned errors upon one another; (3) the peaking phenomenon, that is, the effect of sample size on feature selection; and (4) the analysis of feature selection in the framework of high-dimensional models corresponding to high-throughput data.  相似文献   

15.
Epidemiologic studies can play a central role in risk assessments. They are used in all risk assessment phases: hazard identification, dose-response, and exposure assessment. Epidemiologic studies have often been the first to show that a particular environmental exposure is a hazard to health. They have numerous advantages with respect to other sources of data which are used in risk assessments, the most important being that they do not require the assumption that they are generalizable to humans. For this reason, fewer and lower uncertainty factors may be appropriate in risk characterization based on epidemiologic studies. Unfortunately, epidemiologic studies have numerous problems, the most important being that the exposures are often not precisely measured. This article presents in detail the advantages of and problems with epidemiologic studies. It discusses two approaches to ensure their usefulness, biomarkers and an ordinance which requires baseline and subsequent surveillance of possible exposures and health effects from newly sited potentially polluting facilities. Biomarkers are biochemical measures of exposure, susceptibility factors, or preclinical pathological changes. Biomarkers are a way of dealing with the problems of poor measures, differential susceptibility and lack of early measures of disease occurrence that inherent in many environmental epidemiologic studies. The advantages of biomarkers is they can provide objective information on exposure days, months or even years later and evidence of pathology perhaps years earlier. The ordinance makes possible the use of a powerful epidemiologic study design, the prospective cohort study, where confounder(s) are best measured, and exposures, pathological changes, and health effects can be detected as soon as possible.  相似文献   

16.
We are studying variable selection in multiple regression models in which molecular markers and/or gene-expression measurements as well as intensity measurements from protein spectra serve as predictors for the outcome variable (i.e., trait or disease state). Finding genetic biomarkers and searching genetic–epidemiological factors can be formulated as a statistical problem of variable selection, in which, from a large set of candidates, a small number of trait-associated predictors are identified. We illustrate our approach by analyzing the data available for chronic fatigue syndrome (CFS). CFS is a complex disease from several aspects, e.g., it is difficult to diagnose and difficult to quantify. To identify biomarkers we used microarray data and SELDI-TOF-based proteomics data. We also analyzed genetic marker information for a large number of SNPs for an overlapping set of individuals. The objectives of the analyses were to identify markers specific to fatigue that are also possibly exclusive to CFS. The use of such models can be motivated, for example, by the search for new biomarkers for the diagnosis and prognosis of cancer and measures of response to therapy. Generally, for this we use Bayesian hierarchical modeling and Markov Chain Monte Carlo computation.  相似文献   

17.
Discriminant functions   总被引:1,自引:0,他引:1  
J M England 《Blood cells》1989,15(3):463-71; discussion 472-3
Discriminant Functions (DFs), first described by Fisher in 1936, have been applied to the classification of microcytic disorders such as iron deficiency and heterozygous thalassemia. Mathematically DFs are weighted linear combinations of variables. If the underlying assumption of multivariate normality is valid DFs provide the best possible classification. Variables may need to be transformed before the DF is derived. When two groups have to be classified it is easy to visualise the DF. With one variable the DF is represented by the point which provides the best separation. In the bivariate situation the two groups form ellipses and the DF is the best line of separation whilst in the trivariate case the two groups are ellipsoids and a plane forms the best separation. Ratios and power functions are equivalent to DFs but they are less efficient and less rigorously derived. To apply DFs in hematological practice it is necessary to carefully select the measurements to be included and to define the case selection criteria. Once the DF has been derived it should be tested on a new data set and its transferability assessed. Like any single test the DF will have sensitivity and specificity which may need to be adjusted by changing the "cut-off" if the DF is used for screening rather than for differential diagnosis.  相似文献   

18.

Background

The Signal-to-Noise-Ratio (SNR) is often used for identification of biomarkers for two-class problems and no formal and useful generalization of SNR is available for multiclass problems. We propose innovative generalizations of SNR for multiclass cancer discrimination through introduction of two indices, Gene Dominant Index and Gene Dormant Index (GDIs). These two indices lead to the concepts of dominant and dormant genes with biological significance. We use these indices to develop methodologies for discovery of dominant and dormant biomarkers with interesting biological significance. The dominancy and dormancy of the identified biomarkers and their excellent discriminating power are also demonstrated pictorially using the scatterplot of individual gene and 2-D Sammon's projection of the selected set of genes. Using information from the literature we have shown that the GDI based method can identify dominant and dormant genes that play significant roles in cancer biology. These biomarkers are also used to design diagnostic prediction systems.

Results and discussion

To evaluate the effectiveness of the GDIs, we have used four multiclass cancer data sets (Small Round Blue Cell Tumors, Leukemia, Central Nervous System Tumors, and Lung Cancer). For each data set we demonstrate that the new indices can find biologically meaningful genes that can act as biomarkers. We then use six machine learning tools, Nearest Neighbor Classifier (NNC), Nearest Mean Classifier (NMC), Support Vector Machine (SVM) classifier with linear kernel, and SVM classifier with Gaussian kernel, where both SVMs are used in conjunction with one-vs-all (OVA) and one-vs-one (OVO) strategies. We found GDIs to be very effective in identifying biomarkers with strong class specific signatures. With all six tools and for all data sets we could achieve better or comparable prediction accuracies usually with fewer marker genes than results reported in the literature using the same computational protocols. The dominant genes are usually easy to find while good dormant genes may not always be available as dormant genes require stronger constraints to be satisfied; but when they are available, they can be used for authentication of diagnosis.

Conclusion

Since GDI based schemes can find a small set of dominant/dormant biomarkers that is adequate to design diagnostic prediction systems, it opens up the possibility of using real-time qPCR assays or antibody based methods such as ELISA for an easy and low cost diagnosis of diseases. The dominant and dormant genes found by GDIs can be used in different ways to design more reliable diagnostic prediction systems.  相似文献   

19.
This paper addresses the question of biomarker discovery in proteomics. Given clinical data regarding a list of proteins for a set of individuals, the tackled problem is to extract a short subset of proteins the concentrations of which are an indicator of the biological status (healthy or pathological). In this paper, it is formulated as a specific instance of variable selection. The originality is that the proteins are not investigated one after the other but the best partition between discriminant and non-discriminant proteins is directly sought. In this way, correlations between the proteins are intrinsically taken into account in the decision. The developed strategy is derived in a Bayesian setting, and the decision is optimal in the sense that it minimizes a global mean error. It is finally based on the posterior probabilities of the partitions. The main difficulty is to calculate these probabilities since they are based on the so-called evidence that require marginalization of all the unknown model parameters. Two models are presented that relate the status to the protein concentrations, depending whether the latter are biomarkers or not. The first model accounts for biological variabilities by assuming that the concentrations are Gaussian distributed with a mean and a covariance matrix that depend on the status only for the biomarkers. The second one is an extension that also takes into account the technical variabilities that may significantly impact the observed concentrations. The main contributions of the paper are: (1) a new Bayesian formulation of the biomarker selection problem, (2) the closed-form expression of the posterior probabilities in the noiseless case, and (3) a suitable approximated solution in the noisy case. The methods are numerically assessed and compared to the state-of-the-art methods (t test, LASSO, Battacharyya distance, FOHSIC) on synthetic and real data from proteins quantified in human serum by mass spectrometry in selected reaction monitoring mode.  相似文献   

20.
Cholangiocarcinoma is one of the deadliest malignancies worldwide. Recent studies reported that treatment with gemcitabine was effective in prolonging survival. However, as the treatment only benefited a limited subset of patients, selection of patients before treatment is required. To discover biomarkers predictive of the response to gemcitabine treatment in cholangiocarcinoma, we examined the proteome of three types of material resource; ten cell lines, nine xenografts and nine surgically resected primary tumors from patients who exhibited different response to gemcitabine treatment. Two-dimensional difference gel electrophoresis generated quantitative protein expression profiles including 3571 protein spots. We detected 172 protein spots with significant correlation with response to gemcitabine treatment. All proteins corresponding to these 172 protein spots were identified by mass spectrometry. We found that the macrophage-capping protein (CapG) was associated with response to gemcitabin treatment in all three types of material source. Immunohistochemical validation in an additional set of 196 cholangiocarcinoma cases revealed that CapG expression was associated with lymphatic invasion status and overall survival. Multivariate analysis showed that CapG protein expression was an independent prognostic factor for overall survival. In conclusion, CapG was identified as a novel candidate biomarker to predict response to gemcitabine treatment and survival in cholangiocarcinoma.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号