首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 343 毫秒
1.
MOTIVATION: Classification is widely used in medical applications. However, the quality of the classifier depends critically on the accurate labeling of the training data. But for many medical applications, labeling a sample or grading a biopsy can be subjective. Existing studies confirm this phenomenon and show that even a very small number of mislabeled samples could deeply degrade the performance of the obtained classifier, particularly when the sample size is small. The problem we address in this paper is to develop a method for automatically detecting samples that are possibly mislabeled. RESULTS: We propose two algorithms, a classification-stability algorithm and a leave-one-out-error-sensitivity algorithm for detecting possibly mislabeled samples. For both algorithms, the key structure is the computation of the leave-one-out perturbation matrix. The classification-stability algorithm is based on measuring the stability of the label of a sample with respect to label changes of other samples and the version of this algorithm based on the support vector machine appears to be quite accurate for three real datasets. The suspect list produced by the version is of high quality. Furthermore, when human intervention is not available, the correction heuristic appears to be beneficial.  相似文献   

2.
MOTIVATION: An accurate diagnostic and prediction will not be achieved unless the disease subtype status for every training sample used in the supervised learning step is accurately known. Such an assumption requires the existence of a perfect tool for disease diagnostic and classification, which is seldom available in the majority of the cases. Thus, the supervised learning step has to be conducted with a statistical model that contemplates and handles potential mislabeling in the input data. RESULTS: A procedure for handling potential mislabeling among training samples in the prediction of disease subtypes using gene expression data was proposed. A real data-based simulation study about the estrogen receptor status (ER+/ER-) of breast cancer patients was conducted. The results demonstrated that when 1-4 training samples (N = 30) were artificially mislabeled, the proposed method was able not only in correcting the ER status of mislabeled training samples but also more importantly in predicting the ER status of validation samples as well as using 'true' training data.  相似文献   

3.
Machine learning of functional class from phenotype data   总被引:5,自引:0,他引:5  
MOTIVATION: Mutant phenotype growth experiments are an important novel source of functional genomics data which have received little attention in bioinformatics. We applied supervised machine learning to the problem of using phenotype data to predict the functional class of Open Reading Frames (ORFs) in Saccaromyces cerevisiae. Three sources of data were used: TRansposon-Insertion Phenotypes, Localization and Expression in Saccharomyces (TRIPLES), European Functional Analysis Network (EUROFAN) and Munich Information Center for Protein Sequences (MIPS). The analysis of the data presented a number of challenges to machine learning: multi-class labels, a large number of sparsely populated classes, the need to learn a set of accurate rules (not a complete classification), and a very large amount of missing values. We modified the algorithm C4.5 to deal with these problems. RESULTS: Rules were learnt which are accurate and biologically meaningful. The rules predict function of 83 ORFs of unknown function at an estimated accuracy of > or = 80%.  相似文献   

4.
The problem of discrimination and classification is central to much of epidemiology. Here we consider the estimation of a logistic regression/discrimination function from training samples, when one of the training samples is subject to misclassification or mislabeling, e.g. diseased individuals are incorrectly classified/labeled as healthy controls. We show that this leads to zero-inflated binomial model with a defective logistic regression or discrimination function, whose parameters can be estimated using standard statistical methods such as maximum likelihood. These parameters can be used to estimate the probability of true group membership among those, possibly erroneously, classified as controls. Two examples are analyzed and discussed. A simulation study explores properties of the maximum likelihood parameter estimates and the estimates of the number of mislabeled observations.  相似文献   

5.
Some African grey parrots (Psittacus erithacus), the most famous being Pepperberg's parrot Alex, are able to imitate human speech and produce labels referentially. In this study, the aim was to teach ten African grey parrots from two laboratories to label items. Training three parrots from the first laboratory for several months with the Model/Rival method, developed by Pepperberg, in which two humans interact in front of the subject to demonstrate the use of a label, led to disappointing results. Similarly, seven parrots from the second laboratory, having been trained with several variants of Model/Rival attained little success. After the informal observation of the efficiency of other methods (i.e. learning to imitate labels either spontaneously or with specific learning methods and use of these labels referentially), four different teaching methods were tested with two birds: the Model/Rival; Repetition/Association which consisted of repeating a label and presenting the item only when the parrot produced the label; Intuitive in which the experimenter handled an item and repeated its name in front of the subject; Diffusion in which labels with either variable or flat intonation were played back daily to parrots. One bird learned three labels, one of which was used referentially, with the Repetition/Association method. He learned one label non-referentially with the Model/Rival but no labels were acquired using the other methods. The second bird did not learn any labels. This study demonstrates that different methods can be efficient to teach labels referentially and it suggests that rearing conditions and interindividual variability are important features when assessing learning ability of African grey parrots.  相似文献   

6.
In biomarker discovery studies, uncertainty associated with case and control labels is often overlooked. By omitting to take into account label uncertainty, model parameters and the predictive risk can become biased, sometimes severely. The most common situation is when the control set contains an unknown number of undiagnosed, or future, cases. This has a marked impact in situations where the model needs to be well-calibrated, e.g., when the prediction performance of a biomarker panel is evaluated. Failing to account for class label uncertainty may lead to underestimation of classification performance and bias in parameter estimates. This can further impact on meta-analysis for combining evidence from multiple studies. Using a simulation study, we outline how conventional statistical models can be modified to address class label uncertainty leading to well-calibrated prediction performance estimates and reduced bias in meta-analysis. We focus on the problem of mislabeled control subjects in case-control studies, i.e., when some of the control subjects are undiagnosed cases, although the procedures we report are generic. The uncertainty in control status is a particular situation common in biomarker discovery studies in the context of genomic and molecular epidemiology, where control subjects are commonly sampled from the general population with an established expected disease incidence rate.  相似文献   

7.
Boosting for tumor classification with gene expression data   总被引:7,自引:0,他引:7  
MOTIVATION: Microarray experiments generate large datasets with expression values for thousands of genes but not more than a few dozens of samples. Accurate supervised classification of tissue samples in such high-dimensional problems is difficult but often crucial for successful diagnosis and treatment. A promising way to meet this challenge is by using boosting in conjunction with decision trees. RESULTS: We demonstrate that the generic boosting algorithm needs some modification to become an accurate classifier in the context of gene expression data. In particular, we present a feature preselection method, a more robust boosting procedure and a new approach for multi-categorical problems. This allows for slight to drastic increase in performance and yields competitive results on several publicly available datasets. AVAILABILITY: Software for the modified boosting algorithms as well as for decision trees is available for free in R at http://stat.ethz.ch/~dettling/boosting.html.  相似文献   

8.
We propose a novel strategy for incorporating hierarchical supervised label information into nonlinear dimensionality reduction techniques. Specifically, we extend t-SNE, UMAP, and PHATE to include known or predicted class labels and demonstrate the efficacy of our approach on multiple single-cell RNA sequencing datasets. Our approach, “Haisu,” is applicable across domains and methods of nonlinear dimensionality reduction. In general, the mathematical effect of Haisu can be summarized as a variable perturbation of the high dimensional space in which the original data is observed. We thereby preserve the core characteristics of the visualization method and only change the manifold to respect known or assumed class labels when provided. Our strategy is designed to aid in the discovery and understanding of underlying patterns in a dataset that is heavily influenced by parent-child relationships. We show that using our approach can also help in semi-supervised settings where labels are known for only some datapoints (for instance when only a fraction of the cells are labeled). In summary, Haisu extends existing popular visualization methods to enable a user to incorporate labels known a priori into a visualization, including their hierarchical relationships as defined by a user input graph.  相似文献   

9.
目的:类别决策是人类重要的认知方式之一,情景对类别决策的准确性具有重要影响,但并没有引起足够的重视,对其研究也较少。方法:本研究选取大学生被试,通过ERPs对大学生被试电生理水平探讨情境标签的作用。结果:在基于情境标签下的分类活动激活了更多的大脑区域,基于情景标签下被试分类的潜伏期更短,判断更加快速准确。结论:类别决策过程中情景标签有重要的作用,通过情景标签能提高类别决策能力。  相似文献   

10.
Mass spectrometry (MS)-based metabolomics studies often require handling of both identified and unidentified metabolite data. In order to avoid bias in data interpretation, it would be of advantage for the data analysis to include all available data. A practical challenge in exploratory metabolomics analysis is therefore how to interpret the changes related to unidentified peaks. In this paper, we address the challenge by predicting the class membership of unknown peaks by applying and comparing multiple supervised classifiers to selected lipidomics datasets. The employed classifiers include k-nearest neighbours (k-NN), support vector machines (SVM), partial least squares and discriminant analysis (PLS-DA) and Naive Bayes methods which are known to be effective and efficient in predicting the labels for unseen data. Here, the class label predictions are sought for unidentified lipid profiles coming from high throughput global screening in Ultra Performance Liquid Chromatography Mass Spectrometry (UPLCTM/MS) experimental setup. Our investigation reveals that k-NN and SVM classifiers outperform both PLS-DA and Naive Bayes classifiers. Naive Bayes classifier perform poorly among all models and this observation seems logical as lipids are highly co-regulated and do not respect Naive Bayes assumptions of features being conditionally independent given the class. Common label predictions from k-NN and SVM can serve as a good starting point to explore full data and thereby facilitating exploratory studies where label information is critical for the data interpretation.  相似文献   

11.
Scientists are using acoustic monitoring to assess the impact of altered soundscapes on wildlife communities and human systems. In the soundscape ecology field, monitoring and analyses approaches rely on the interdisciplinary intersection of ecology, acoustics, and computer science. Combining theory and practice of each field in the context of Knowledge Discovery in Databases (KDD), soundscape ecologists provide innovative monitoring solutions for ecologically-driven research questions. We propose a soundscape content analysis framework for improved knowledge outcome with assistance of the new multi-label (ML) concept.Here, we investigated the effectiveness of a ML k-nearest neighbor algorithm (ML-kNN) for labeling concurrent soundscape components within a single recording. We manually labeled 1200 field recordings for the presence of soundscape components and extracted ecological acoustic features, audio profile features, and Gaussian-mixture model features for each recording. Then, we tested the ML-kNN algorithm accuracy with well-established metrics adapted to ML learning.We found that seventeen unique acoustic features could predict a set of biophonic, geophonic, and anthrophonic labels for a single field recording with average precision of 0.767. However, certain labels were predicted incorrectly depending on the time of day and co-occurrence of that label with another label, suggesting further refinement is needed to improve the accuracy of predicted labels.Overall, this ML classification approach could enable researchers to label field recordings more quickly and generate an “alert” system for monitoring changes in a specific sound class. Ultimately, the adaptation of the ML algorithm may provide soundscape ecologists with new metadata labels that are searchable in large databases of soundscape field recordings.  相似文献   

12.
13.

Background  

Various statistical scores have been proposed for evaluating the significance of genes that may exhibit differential expression between two or more controlled conditions. However, in many clinical studies to detect clinical marker genes for example, the conditions have not necessarily been controlled well, thus condition labels are sometimes hard to obtain due to physical, financial, and time costs. In such a situation, we can consider an unsupervised case where labels are not available or a semi-supervised case where labels are available for a part of the whole sample set, rather than a well-studied supervised case where all samples have their labels.  相似文献   

14.
环境微生物研究中机器学习算法及应用   总被引:1,自引:0,他引:1  
陈鹤  陶晔  毛振镀  邢鹏 《微生物学报》2022,62(12):4646-4662
微生物在环境中无处不在,它们不仅是生物地球化学循环和环境演化的关键参与者,也在环境监测、生态治理和保护中发挥着重要作用。随着高通量技术的发展,大量微生物数据产生,运用机器学习对环境微生物大数据进行建模和分析,在微生物标志物识别、污染物预测和环境质量预测等领域的科学研究和社会应用方面均具有重要意义。机器学习可分为监督学习和无监督学习2大类。在微生物组学研究当中,无监督学习通过聚类、降维等方法高效地学习输入数据的特征,进而对微生物数据进行整合和归类。监督学习运用有特征和标记的微生物数据集训练模型,在面对只有特征没有标记的数据时可以判断出标记,从而实现对新数据的分类、识别和预测。然而,复杂的机器学习算法通常以牺牲可解释性为代价来重点关注模型预测的准确性。机器学习模型通常可以看作预测特定结果的“黑匣子”,即对模型如何得出预测所知甚少。为了将机器学习更多地运用于微生物组学研究、提高我们提取有价值的微生物信息的能力,深入了解机器学习算法、提高模型的可解释性尤为重要。本文主要介绍在环境微生物领域常用的机器学习算法和基于微生物组数据的机器学习模型的构建步骤,包括特征选择、算法选择、模型构建和评估等,并对各种机器学习模型在环境微生物领域的应用进行综述,深入探究微生物组与周围环境之间的关联,探讨提高模型可解释性的方法,并为未来环境监测、环境健康预测提供科学参考。  相似文献   

15.
The classification of jobs or workers by exposure is an important undertaking in any occupational epidemiological study. Hitherto, the exposure classification designs have been strongly motivated by a desire to generate a sufficient number of exposure classes for the determination of a potential exposure-response relationship. Thus, the partitioning of exposures has been more or less arbitrary. The misclassification problems created by the selection of an arbitrary number of exposure assignment classes have not been addressed. In any quantitative exposure classification scheme, specific job titles may be indistinguishable in existing employment records; therefore, between worker variability must be addressed when characterizing worker exposures. Also, industrial hygiene exposure measurements frequently used to characterize worker exposures are often treated as valid representations of exposures; but they are neither random nor systematic evaluations of worker exposures. As a result they do not represent sampling from the proper exposure stratification of workers. These observations suggest that the selection of exposure groups should be based on a more rigorous examination of the data and its limitations. Considering the probability of any given worker being placed into the proper class as the probability of finding the mean exposure for that worker within the class boundary, the general equations to quantify the misclassification rates for any classification design as well as the exposure class limits and their width for any acceptable misclassification rate are developed. If between worker variability could not be calculated from the available exposure measurements, then it might be estimated from the proper data compiled from the literature. By considering an acceptable level of exposure misclassification, it is possible to calculate the allowable number of exposure classes and the proper partitioning ratio for these classes. Thus, the trade-off between misclassification and number of exposure classes might be a satisfactory solution to this difficulty encountered in occupational epidemiology.  相似文献   

16.
Assigning biological functions to uncharacterized proteins is a fundamental problem in the postgenomic era. The increasing availability of large amounts of data on protein-protein interactions (PPIs) has led to the emergence of a considerable number of computational methods for determining protein function in the context of a network. These algorithms, however, treat each functional class in isolation and thereby often suffer from the difficulty of the scarcity of labeled data. In reality, different functional classes are naturally dependent on one another. We propose a new algorithm, Multi-label Correlated Semi-supervised Learning (MCSL), to incorporate the intrinsic correlations among functional classes into protein function prediction by leveraging the relationships provided by the PPI network and the functional class network. The guiding intuition is that the classification function should be sufficiently smooth on subgraphs where the respective topologies of these two networks are a good match. We encode this intuition as regularized learning with intraclass and interclass consistency, which can be understood as an extension of the graph-based learning with local and global consistency (LGC) method. Cross validation on the yeast proteome illustrates that MCSL consistently outperforms several state-of-the-art methods. Most notably, it effectively overcomes the problem associated with scarcity of label data. The supplementary files are freely available at http://sites.google.com/site/csaijiang/MCSL.  相似文献   

17.
Coherent anti-Stokes Raman scattering (CARS) is an emerging tool for label-free characterization of living cells. Here, unsupervised multivariate analysis of CARS datasets was used to visualize the subcellular compartments. In addition, a supervised learning algorithm based on the “random forest” ensemble learning method as a classifier, was trained with CARS spectra using immunofluorescence images as a reference. The supervised classifier was then used, to our knowledge for the first time, to automatically identify lipid droplets, nucleus, nucleoli, and endoplasmic reticulum in datasets that are not used for training. These four subcellular components were simultaneously and label-free monitored instead of using several fluorescent labels. These results open new avenues for label-free time-resolved investigation of subcellular components in different cells, especially cancer cells.  相似文献   

18.
Coherent anti-Stokes Raman scattering (CARS) is an emerging tool for label-free characterization of living cells. Here, unsupervised multivariate analysis of CARS datasets was used to visualize the subcellular compartments. In addition, a supervised learning algorithm based on the “random forest” ensemble learning method as a classifier, was trained with CARS spectra using immunofluorescence images as a reference. The supervised classifier was then used, to our knowledge for the first time, to automatically identify lipid droplets, nucleus, nucleoli, and endoplasmic reticulum in datasets that are not used for training. These four subcellular components were simultaneously and label-free monitored instead of using several fluorescent labels. These results open new avenues for label-free time-resolved investigation of subcellular components in different cells, especially cancer cells.  相似文献   

19.
The label switching problem occurs as a result of the nonidentifiability of posterior distribution over various permutations of component labels when using Bayesian approach to estimate parameters in mixture models. In the cases where the number of components is fixed and known, we propose a relabelling algorithm, an allocation variable-based (denoted by AVP) probabilistic relabelling approach, to deal with label switching problem. We establish a model for the posterior distribution of allocation variables with label switching phenomenon. The AVP algorithm stochastically relabel the posterior samples according to the posterior probabilities of the established model. Some existing deterministic and other probabilistic algorithms are compared with AVP algorithm in simulation studies, and the success of the proposed approach is demonstrated in simulation studies and a real dataset.  相似文献   

20.
A genotype calling algorithm for affymetrix SNP arrays   总被引:11,自引:0,他引:11  
MOTIVATION: A classification algorithm, based on a multi-chip, multi-SNP approach is proposed for Affymetrix SNP arrays. Current procedures for calling genotypes on SNP arrays process all the features associated with one chip and one SNP at a time. Using a large training sample where the genotype labels are known, we develop a supervised learning algorithm to obtain more accurate classification results on new data. The method we propose, RLMM, is based on a robustly fitted, linear model and uses the Mahalanobis distance for classification. The chip-to-chip non-biological variance is reduced through normalization. This model-based algorithm captures the similarities across genotype groups and probes, as well as across thousands of SNPs for accurate classification. In this paper, we apply RLMM to Affymetrix 100 K SNP array data, present classification results and compare them with genotype calls obtained from the Affymetrix procedure DM, as well as to the publicly available genotype calls from the HapMap project.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号