首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 359 毫秒
1.
Yasui Y  Pepe M  Hsu L  Adam BL  Feng Z 《Biometrics》2004,60(1):199-206
Training data in a supervised learning problem consist of the class label and its potential predictors for a set of observations. Constructing effective classifiers from training data is the goal of supervised learning. In biomedical sciences and other scientific applications, class labels may be subject to errors. We consider a setting where there are two classes but observations with labels corresponding to one of the classes may in fact be mislabeled. The application concerns the use of protein mass-spectrometry data to discriminate between serum samples from cancer and noncancer patients. The patients in the training set are classified on the basis of tissue biopsy. Although biopsy is 100% specific in the sense that a tissue that shows itself to have malignant cells is certainly cancer, it is less than 100% sensitive. Reference gold standards that are subject to this special type of misclassification due to imperfect diagnosis certainty arise in many fields. We consider the development of a supervised learning algorithm under these conditions and refer to it as partially supervised learning. Boosting is a supervised learning algorithm geared toward high-dimensional predictor data, such as those generated in protein mass-spectrometry. We propose a modification of the boosting algorithm for partially supervised learning. The proposal is to view the true class membership of the samples that are labeled with the error-prone class label as missing data, and apply an algorithm related to the EM algorithm for minimization of a loss function. To assess the usefulness of the proposed method, we artificially mislabeled a subset of samples and applied the original and EM-modified boosting (EM-Boost) algorithms for comparison. Notable improvements in misclassification rates are observed with EM-Boost.  相似文献   

2.
The problem of discrimination and classification is central to much of epidemiology. Here we consider the estimation of a logistic regression/discrimination function from training samples, when one of the training samples is subject to misclassification or mislabeling, e.g. diseased individuals are incorrectly classified/labeled as healthy controls. We show that this leads to zero-inflated binomial model with a defective logistic regression or discrimination function, whose parameters can be estimated using standard statistical methods such as maximum likelihood. These parameters can be used to estimate the probability of true group membership among those, possibly erroneously, classified as controls. Two examples are analyzed and discussed. A simulation study explores properties of the maximum likelihood parameter estimates and the estimates of the number of mislabeled observations.  相似文献   

3.
The aim of this study was to investigate if a machine learning algorithm utilizing triaxial accelerometer, gyroscope, and magnetometer data from an inertial motion unit (IMU) could detect surface- and age-related differences in walking. Seventeen older (71.5 ± 4.2 years) and eighteen young (27.0 ± 4.7 years) healthy adults walked over flat and uneven brick surfaces wearing an inertial measurement unit (IMU) over the L5 vertebra. IMU data were binned into smaller data segments using 4-s sliding windows with 1-s step lengths. Ninety percent of the data were used as training inputs and the remaining ten percent were saved for testing. A deep learning network with long short-term memory units was used for training (fully supervised), prediction, and implementation. Four models were trained using the following inputs: all nine channels from every sensor in the IMU (fully trained model), accelerometer signals alone, gyroscope signals alone, and magnetometer signals alone. The fully trained models for surface and age outperformed all other models (area under the receiver operator curve, AUC = 0.97 and 0.96, respectively; p ≤ .045). The fully trained models for surface and age had high accuracy (96.3, 94.7%), precision (96.4, 95.2%), recall (96.3, 94.7%), and f1-score (96.3, 94.6%). These results demonstrate that processing the signals of a single IMU device with machine-learning algorithms enables the detection of surface conditions and age-group status from an individual’s walking behavior which, with further learning, may be utilized to facilitate identifying and intervening on fall risk.  相似文献   

4.
An increasing number of genes have been experimentally confirmed in recent years as causative genes to various human diseases. The newly available knowledge can be exploited by machine learning methods to discover additional unknown genes that are likely to be associated with diseases. In particular, positive unlabeled learning (PU learning) methods, which require only a positive training set P (confirmed disease genes) and an unlabeled set U (the unknown candidate genes) instead of a negative training set N, have been shown to be effective in uncovering new disease genes in the current scenario. Using only a single source of data for prediction can be susceptible to bias due to incompleteness and noise in the genomic data and a single machine learning predictor prone to bias caused by inherent limitations of individual methods. In this paper, we propose an effective PU learning framework that integrates multiple biological data sources and an ensemble of powerful machine learning classifiers for disease gene identification. Our proposed method integrates data from multiple biological sources for training PU learning classifiers. A novel ensemble-based PU learning method EPU is then used to integrate multiple PU learning classifiers to achieve accurate and robust disease gene predictions. Our evaluation experiments across six disease groups showed that EPU achieved significantly better results compared with various state-of-the-art prediction methods as well as ensemble learning classifiers. Through integrating multiple biological data sources for training and the outputs of an ensemble of PU learning classifiers for prediction, we are able to minimize the potential bias and errors in individual data sources and machine learning algorithms to achieve more accurate and robust disease gene predictions. In the future, our EPU method provides an effective framework to integrate the additional biological and computational resources for better disease gene predictions.  相似文献   

5.

Background  

A recent publication described a supervised classification method for microarray data: Between Group Analysis (BGA). This method which is based on performing multivariate ordination of groups proved to be very efficient for both classification of samples into pre-defined groups and disease class prediction of new unknown samples. Classification and prediction with BGA are classically performed using the whole set of genes and no variable selection is required. We hypothesize that an optimized selection of highly discriminating genes might improve the prediction power of BGA.  相似文献   

6.
7.
In this paper, I describe a set of procedures that automate forest disturbance mapping using a pair of Landsat images. The approach is built on the traditional pair-wise change detection method, but is designed to extract training data without user interaction and uses a robust classification algorithm capable of handling incorrectly labeled training data. The steps in this procedure include: i) creating masks for water, non-forested areas, clouds, and cloud shadows; ii) identifying training pixels whose value is above or below a threshold defined by the number of standard deviations from the mean value of the histograms generated from local windows in the short-wave infrared (SWIR) difference image; iii) filtering the original training data through a number of classification algorithms using an n-fold cross validation to eliminate mislabeled training samples; and finally, iv) mapping forest disturbance using a supervised classification algorithm. When applied to 17 Landsat footprints across the U.S. at five-year intervals between 1985 and 2010, the proposed approach produced forest disturbance maps with 80 to 95% overall accuracy, comparable to those obtained from traditional approaches to forest change detection. The primary sources of mis-classification errors included inaccurate identification of forests (errors of commission), issues related to the land/water mask, and clouds and cloud shadows missed during image screening. The approach requires images from the peak growing season, at least for the deciduous forest sites, and cannot readily distinguish forest harvest from natural disturbances or other types of land cover change. The accuracy of detecting forest disturbance diminishes with the number of years between the images that make up the image pair. Nevertheless, the relatively high accuracies, little or no user input needed for processing, speed of map production, and simplicity of the approach make the new method especially practical for forest cover change analysis over very large regions.  相似文献   

8.

Background

A tremendous amount of efforts have been devoted to identifying genes for diagnosis and prognosis of diseases using microarray gene expression data. It has been demonstrated that gene expression data have cluster structure, where the clusters consist of co-regulated genes which tend to have coordinated functions. However, most available statistical methods for gene selection do not take into consideration the cluster structure.

Results

We propose a supervised group Lasso approach that takes into account the cluster structure in gene expression data for gene selection and predictive model building. For gene expression data without biological cluster information, we first divide genes into clusters using the K-means approach and determine the optimal number of clusters using the Gap method. The supervised group Lasso consists of two steps. In the first step, we identify important genes within each cluster using the Lasso method. In the second step, we select important clusters using the group Lasso. Tuning parameters are determined using V-fold cross validation at both steps to allow for further flexibility. Prediction performance is evaluated using leave-one-out cross validation. We apply the proposed method to disease classification and survival analysis with microarray data.

Conclusion

We analyze four microarray data sets using the proposed approach: two cancer data sets with binary cancer occurrence as outcomes and two lymphoma data sets with survival outcomes. The results show that the proposed approach is capable of identifying a small number of influential gene clusters and important genes within those clusters, and has better prediction performance than existing methods.  相似文献   

9.
MOTIVATION: Many practical tasks in biomedicine require accessing specific types of information in scientific literature; e.g. information about the methods, results or conclusions of the study in question. Several approaches have been developed to identify such information in scientific journal articles. The best of these have yielded promising results and proved useful for biomedical text mining tasks. However, relying on fully supervised machine learning (ml) and a large body of annotated data, existing approaches are expensive to develop and port to different tasks. A potential solution to this problem is to employ weakly supervised learning instead. In this article, we investigate a weakly supervised approach to identifying information structure according to a scheme called Argumentative Zoning (az). We apply four weakly supervised classifiers to biomedical abstracts and evaluate their performance both directly and in a real-life scenario in the context of cancer risk assessment. RESULTS: Our best weakly supervised classifier (based on the combination of active learning and self-training) performs well on the task, outperforming our best supervised classifier: it yields a high accuracy of 81% when just 10% of the labeled data is used for training. When cancer risk assessors are presented with the resulting annotated abstracts, they find relevant information in them significantly faster than when presented with unannotated abstracts. These results suggest that weakly supervised learning could be used to improve the practical usefulness of information structure for real-life tasks in biomedicine.  相似文献   

10.
Hong H  Tong W  Perkins R  Fang H  Xie Q  Shi L 《DNA and cell biology》2004,23(10):685-694
The wealth of knowledge imbedded in gene expression data from DNA microarrays portends rapid advances in both research and clinic. Turning the prodigious and noisy data into knowledge is a challenge to the field of bioinformatics, and development of classifiers using supervised learning techniques is the primary methodological approach for clinical application using gene expression data. In this paper, we present a novel classification method, multiclass Decision Forest (DF), that is the direct extension of the two-class DF previously developed in our lab. Central to DF is the synergistic combining of multiple heterogenic but comparable decision trees to reach a more accurate and robust classification model. The computationally inexpensive multiclass DF algorithm integrates gene selection and model development, and thus eliminates the bias of gene preselection in crossvalidation. Importantly, the method provides several statistical means for assessment of prediction accuracy, prediction confidence, and diagnostic capability. We demonstrate the method by application to gene expression data for 83 small round blue-cell tumors (SRBCTs) samples belonging to one of four different classes. Based on 500 runs of 10-fold crossvalidation, tumor prediction accuracy was approximately 97%, sensitivity was approximately 95%, diagnostic sensitivity was approximately 91%, and diagnostic accuracy was approximately 99.5%. Among 25 genes selected to distinguish tumor class, 12 have functional information in the literature implicating their involvement in cancer. The four types of SRBCTs samples are also distinguishable in a clustering analysis based on the expression profiles of these 25 genes. The results demonstrated that the multiclass DF is an effective classification method for analysis of gene expression data for the purpose of molecular diagnostics.  相似文献   

11.
The classification methodology based on morphometric data and supervised artificial neural networks (ANN) was tested on five fly species of the parasitoid genera Tachina and Ectophasia (Diptera, Tachinidae). Objects were initially photographed, then digitalized; consequently the picture was scaled and measured by means of an image analyser. The 16 variables used for classification included length of different wing veins or their parts and width of antennal segments. The sex was found to have some influence on the data and was included in the study as another input variable. Better and reliable classification was obtained when data from both the right and left wings were entered, the data from one wing were however found to be sufficient. The prediction success (correct identification of unknown test samples) varied from 88 to 100% throughout the study depending especially on the number of specimens in the training set. Classification of the studied Diptera species using ANN is possible assuming a sufficiently high number (tens) of specimens of each species is available for the ANN training. The methodology proposed is quite general and can be applied for all biological objects where it is possible to define adequate diagnostic characters and create the appropriate database.  相似文献   

12.
Obtaining satisfactory results with neural networks depends on the availability of large data samples. The use of small training sets generally reduces performance. Most classical Quantitative Structure-Activity Relationship (QSAR) studies for a specific enzyme system have been performed on small data sets. We focus on the neuro-fuzzy prediction of biological activities of HIV-1 protease inhibitory compounds when inferring from small training sets. We propose two computational intelligence prediction techniques which are suitable for small training sets, at the expense of some computational overhead. Both techniques are based on the FAMR model. The FAMR is a Fuzzy ARTMAP (FAM) incremental learning system used for classification and probability estimation. During the learning phase, each sample pair is assigned a relevance factor proportional to the importance of that pair. The two proposed algorithms in this paper are: 1) The GA-FAMR algorithm, which is new, consists of two stages: a) During the first stage, we use a genetic algorithm (GA) to optimize the relevances assigned to the training data. This improves the generalization capability of the FAMR. b) In the second stage, we use the optimized relevances to train the FAMR. 2) The Ordered FAMR is derived from a known algorithm. Instead of optimizing relevances, it optimizes the order of data presentation using the algorithm of Dagher et al. In our experiments, we compare these two algorithms with an algorithm not based on the FAM, the FS-GA-FNN introduced in [4], [5]. We conclude that when inferring from small training sets, both techniques are efficient, in terms of generalization capability and execution time. The computational overhead introduced is compensated by better accuracy. Finally, the proposed techniques are used to predict the biological activities of newly designed potential HIV-1 protease inhibitors.  相似文献   

13.
MOTIVATION: Class distinction is a supervised learning approach that has been successfully employed in the analysis of high-throughput gene expression data. Identification of a set of genes that predicts differential biological states allows for the development of basic and clinical scientific approaches to the diagnosis of disease. The Independent Consistent Expression Discriminator (ICED) was designed to provide a more biologically relevant search criterion during predictor selection by embracing the inherent variability of gene expression in any biological state. The four components of ICED include (i) normalization of raw data; (ii) assignment of weights to genes from both classes; (iii) counting of votes to determine optimal number of predictor genes for class distinction; (iv) calculation of prediction strengths for classification results. The search criteria employed by ICED is designed to identify not only genes that are consistently expressed at one level in one class and at a consistently different level in another class but identify genes that are variable in one class and consistent in another. The result is a novel approach to accurately select biologically relevant predictors of differential disease states from a small number of microarray samples. RESULTS: The data described herein utilized ICED to analyze the large AML/ALL training and test data set (Golub et al., 1999, Science, 286, 531-537) in addition to a smaller data set consisting of an animal model of the childhood neurodegenerative disorder, Batten disease, generated for this study. Both of the analyses presented herein have correctly predicted biologically relevant perturbations that can be used for disease classification, irrespective of sample size. Furthermore, the results have provided candidate proteins for future study in understanding the disease process and the identification of potential targets for therapeutic intervention.  相似文献   

14.
MOTIVATION: Classification is widely used in medical applications. However, the quality of the classifier depends critically on the accurate labeling of the training data. But for many medical applications, labeling a sample or grading a biopsy can be subjective. Existing studies confirm this phenomenon and show that even a very small number of mislabeled samples could deeply degrade the performance of the obtained classifier, particularly when the sample size is small. The problem we address in this paper is to develop a method for automatically detecting samples that are possibly mislabeled. RESULTS: We propose two algorithms, a classification-stability algorithm and a leave-one-out-error-sensitivity algorithm for detecting possibly mislabeled samples. For both algorithms, the key structure is the computation of the leave-one-out perturbation matrix. The classification-stability algorithm is based on measuring the stability of the label of a sample with respect to label changes of other samples and the version of this algorithm based on the support vector machine appears to be quite accurate for three real datasets. The suspect list produced by the version is of high quality. Furthermore, when human intervention is not available, the correction heuristic appears to be beneficial.  相似文献   

15.
16.
17.
Rahman ME  Islam R  Islam S  Mondal SI  Amin MR 《Genomics》2012,99(4):189-194
MicroRNA (miRNA) is a special class of short noncoding RNA that serves pivotal function of regulating gene expression. The computational prediction of new miRNA candidates involves various methods such as learning methods and methods using expression data. This article has proposed a reliable model - miRANN which is a supervised machine learning approach. MiRANN used known pre-miRNAs as positive set and a novel negative set from human CDS regions. The number of known miRNAs is now huge and diversified that could cover almost all characteristics of unknown miRNAs which increases the quality of the result (99.9% accuracy, 99.8% sensitivity, 100% specificity) and provides a more reliable prediction. MiRANN performs better than other state-of-the-art approaches and declares to be the most potential tool to predict novel miRNAs. We have also tested our result using a previous negative set. MiRANN, opens new ground using ANN for predicting pre-miRNAs with a promise of better performance.  相似文献   

18.
Analysis of cellular phenotypes in large imaging data sets conventionally involves supervised statistical methods, which require user-annotated training data. This paper introduces an unsupervised learning method, based on temporally constrained combinatorial clustering, for automatic prediction of cell morphology classes in time-resolved images. We applied the unsupervised method to diverse fluorescent markers and screening data and validated accurate classification of human cell phenotypes, demonstrating fully objective data labeling in image-based systems biology.  相似文献   

19.
20.
Diffuse large B-cell lymphoma (DLBCL), the most common lymphoid malignancy in adults, is curable in less than 50% of patients. Prognostic models based on pre-treatment characteristics, such as the International Prognostic Index (IPI), are currently used to predict outcome in DLBCL. However, clinical outcome models identify neither the molecular basis of clinical heterogeneity, nor specific therapeutic targets. We analyzed the expression of 6,817 genes in diagnostic tumor specimens from DLBCL patients who received cyclophosphamide, adriamycin, vincristine and prednisone (CHOP)-based chemotherapy, and applied a supervised learning prediction method to identify cured versus fatal or refractory disease. The algorithm classified two categories of patients with very different five-year overall survival rates (70% versus 12%). The model also effectively delineated patients within specific IPI risk categories who were likely to be cured or to die of their disease. Genes implicated in DLBCL outcome included some that regulate responses to B-cell-receptor signaling, critical serine/threonine phosphorylation pathways and apoptosis. Our data indicate that supervised learning classification techniques can predict outcome in DLBCL and identify rational targets for intervention.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号