首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A recently proposed optimal Bayesian classification paradigm addresses optimal error rate analysis for small-sample discrimination, including optimal classifiers, optimal error estimators, and error estimation analysis tools with respect to the probability of misclassification under binary classes. Here, we address multi-class problems and optimal expected risk with respect to a given risk function, which are common settings in bioinformatics. We present Bayesian risk estimators (BRE) under arbitrary classifiers, the mean-square error (MSE) of arbitrary risk estimators under arbitrary classifiers, and optimal Bayesian risk classifiers (OBRC). We provide analytic expressions for these tools under several discrete and Gaussian models and present a new methodology to approximate the BRE and MSE when analytic expressions are not available. Of particular note, we present analytic forms for the MSE under Gaussian models with homoscedastic covariances, which are new even in binary classification.  相似文献   

2.
MOTIVATION: Gene expression data offer a large number of potentially useful predictors for the classification of tissue samples into classes, such as diseased and non-diseased. The predictive error rate of classifiers can be estimated using methods such as cross-validation. We have investigated issues of interpretation and potential bias in the reporting of error rate estimates. The issues considered here are optimization and selection biases, sampling effects, measures of misclassification rate, baseline error rates, two-level external cross-validation and a novel proposal for detection of bias using the permutation mean. RESULTS: Reporting an optimal estimated error rate incurs an optimization bias. Downward bias of 3-5% was found in an existing study of classification based on gene expression data and may be endemic in similar studies. Using a simulated non-informative dataset and two example datasets from existing studies, we show how bias can be detected through the use of label permutations and avoided using two-level external cross-validation. Some studies avoid optimization bias by using single-level cross-validation and a test set, but error rates can be more accurately estimated via two-level cross-validation. In addition to estimating the simple overall error rate, we recommend reporting class error rates plus where possible the conditional risk incorporating prior class probabilities and a misclassification cost matrix. We also describe baseline error rates derived from three trivial classifiers which ignore the predictors. AVAILABILITY: R code which implements two-level external cross-validation with the PAMR package, experiment code, dataset details and additional figures are freely available for non-commercial use from http://www.maths.qut.edu.au/profiles/wood/permr.jsp  相似文献   

3.
This paper addresses the question of maximizing classifier accuracy for classifying task-related mental activity from Magnetoencelophalography (MEG) data. We propose the use of different sources of information and introduce an automatic channel selection procedure. To determine an informative set of channels, our approach combines a variety of machine learning algorithms: feature subset selection methods, classifiers based on regularized logistic regression, information fusion, and multiobjective optimization based on probabilistic modeling of the search space. The experimental results show that our proposal is able to improve classification accuracy compared to approaches whose classifiers use only one type of MEG information or for which the set of channels is fixed a priori.  相似文献   

4.
The ribosomal rRNA genes are widely used as genetic markers for taxonomic identification of microbes. Particularly the small subunit (SSU; 16S/18S) rRNA gene is frequently used for species‐ or genus‐level identification, but also the large subunit (LSU; 23S/28S) rRNA gene is employed in taxonomic assignment. The metaxa software tool is a popular utility for extracting partial rRNA sequences from large sequencing data sets and assigning them to an archaeal, bacterial, nuclear eukaryote, mitochondrial or chloroplast origin. This study describes a comprehensive update to metaxa – metaxa 2 – that extends the capabilities of the tool, introducing support for the LSU rRNA gene, a greatly improved classifier allowing classification down to genus or species level, as well as enhanced support for short‐read (100 bp) and paired‐end sequences, among other changes. The performance of metaxa 2 was compared to other commonly used taxonomic classifiers, showing that metaxa 2 often outperforms previous methods in terms of making correct predictions while maintaining a low misclassification rate. metaxa 2 is freely available from http://microbiology.se/software/metaxa2/ .  相似文献   

5.
Recently, ensemble learning methods have been widely used to improve classification performance in machine learning. In this paper, we present a novel ensemble learning method: argumentation based multi-agent joint learning (AMAJL), which integrates ideas from multi-agent argumentation, ensemble learning, and association rule mining. In AMAJL, argumentation technology is introduced as an ensemble strategy to integrate multiple base classifiers and generate a high performance ensemble classifier. We design an argumentation framework named Arena as a communication platform for knowledge integration. Through argumentation based joint learning, high quality individual knowledge can be extracted, and thus a refined global knowledge base can be generated and used independently for classification. We perform numerous experiments on multiple public datasets using AMAJL and other benchmark methods. The results demonstrate that our method can effectively extract high quality knowledge for ensemble classifier and improve the performance of classification.  相似文献   

6.
Bayesian networks are knowledge representation tools that model the (in)dependency relationships among variables for probabilistic reasoning. Classification with Bayesian networks aims to compute the class with the highest probability given a case. This special kind is referred to as Bayesian network classifiers. Since learning the Bayesian network structure from a dataset can be viewed as an optimization problem, heuristic search algorithms may be applied to build high-quality networks in medium- or large-scale problems, as exhaustive search is often feasible only for small problems. In this paper, we present our new algorithm, ABC-Miner, and propose several extensions to it. ABC-Miner uses ant colony optimization for learning the structure of Bayesian network classifiers. We report extended computational results comparing the performance of our algorithm with eight other classification algorithms, namely six variations of well-known Bayesian network classifiers, cAnt-Miner for discovering classification rules and a support vector machine algorithm.  相似文献   

7.

Background  

Generally speaking, different classifiers tend to work well for certain types of data and conversely, it is usually not known a priori which algorithm will be optimal in any given classification application. In addition, for most classification problems, selecting the best performing classification algorithm amongst a number of competing algorithms is a difficult task for various reasons. As for example, the order of performance may depend on the performance measure employed for such a comparison. In this work, we present a novel adaptive ensemble classifier constructed by combining bagging and rank aggregation that is capable of adaptively changing its performance depending on the type of data that is being classified. The attractive feature of the proposed classifier is its multi-objective nature where the classification results can be simultaneously optimized with respect to several performance measures, for example, accuracy, sensitivity and specificity. We also show that our somewhat complex strategy has better predictive performance as judged on test samples than a more naive approach that attempts to directly identify the optimal classifier based on the training data performances of the individual classifiers.  相似文献   

8.
A Bayesian network classification methodology for gene expression data.   总被引:5,自引:0,他引:5  
We present new techniques for the application of a Bayesian network learning framework to the problem of classifying gene expression data. The focus on classification permits us to develop techniques that address in several ways the complexities of learning Bayesian nets. Our classification model reduces the Bayesian network learning problem to the problem of learning multiple subnetworks, each consisting of a class label node and its set of parent genes. We argue that this classification model is more appropriate for the gene expression domain than are other structurally similar Bayesian network classification models, such as Naive Bayes and Tree Augmented Naive Bayes (TAN), because our model is consistent with prior domain experience suggesting that a relatively small number of genes, taken in different combinations, is required to predict most clinical classes of interest. Within this framework, we consider two different approaches to identifying parent sets which are supported by the gene expression observations and any other currently available evidence. One approach employs a simple greedy algorithm to search the universe of all genes; the second approach develops and applies a gene selection algorithm whose results are incorporated as a prior to enable an exhaustive search for parent sets over a restricted universe of genes. Two other significant contributions are the construction of classifiers from multiple, competing Bayesian network hypotheses and algorithmic methods for normalizing and binning gene expression data in the absence of prior expert knowledge. Our classifiers are developed under a cross validation regimen and then validated on corresponding out-of-sample test sets. The classifiers attain a classification rate in excess of 90% on out-of-sample test sets for two publicly available datasets. We present an extensive compilation of results reported in the literature for other classification methods run against these same two datasets. Our results are comparable to, or better than, any we have found reported for these two sets, when a train-test protocol as stringent as ours is followed.  相似文献   

9.
MOTIVATION: In the context of sample (e.g. tumor) classifications with microarray gene expression data, many methods have been proposed. However, almost all the methods ignore existing biological knowledge and treat all the genes equally a priori. On the other hand, because some genes have been identified by previous studies to have biological functions or to be involved in pathways related to the outcome (e.g. cancer), incorporating this type of prior knowledge into a classifier can potentially improve both the predictive performance and interpretability of the resulting model. RESULTS: We propose a simple and general framework to incorporate such prior knowledge into building a penalized classifier. As two concrete examples, we apply the idea to two penalized classifiers, nearest shrunken centroids (also called PAM) and penalized partial least squares (PPLS). Instead of treating all the genes equally a priori as in standard penalized methods, we group the genes according to their functional associations based on existing biological knowledge or data, and adopt group-specific penalty terms and penalization parameters. Simulated and real data examples demonstrate that, if prior knowledge on gene grouping is indeed informative, our new methods perform better than the two standard penalized methods, yielding higher predictive accuracy and screening out more irrelevant genes.  相似文献   

10.
Functional characterization of proteins belonging to the MHC I superfamily involves knowing their cognate ligands, which can be peptides, lipids or none. However, the experimental identification of these ligands is not an easy task and generally requires some a priori knowledge of their chemical nature (ligand-type specificity). Here, we trained k-nearest neighbor and support vector machine classifiers that predict the ligand-type specificity MHC I proteins with great accuracy. Moreover, we applied these classifiers to human and mouse MHC I proteins of uncharacterized ligands, obtaining some results that can be instrumental to unravel the function of these proteins.  相似文献   

11.
基于知识库的像斑光谱向量相似度土地覆盖变化检测方法   总被引:1,自引:0,他引:1  
宋翔  颜长珍 《生态学报》2014,34(24):7175-7180
土地利用/覆盖变化检测是国内外全球化进程研究的重要内容,选择适当的变化检测方法对西北地区土地利用/覆盖变化进行研究在"生态十年项目"中具有重要的意义。选择西北地区具有典型代表性的TM轨道号134033区域作为变化检测方法验证的试验区,采用2005和2010年两期Landsat TM影像,在e Cognition Developer 8.64软件支持下,采用基于像斑的光谱特征特征向量相似度方法进行变化检测,并利用2010年土地覆盖数据作为先验知识库对变化区域分类,提取土地利用/覆盖变化信息,并对变化结果进行定量分析。结果表明,采用基于像斑的光谱特征特征向量相似度方法对于试验区的土地利用/覆盖变化制图具有检测快速、检测精度高等优点,适合试验区以及整个西北地区的土地利用/覆盖变化的检测。最终采用该方法以及分类后比较法获得了西北地区2000—2010年近10年的土地利用/覆盖分类图。  相似文献   

12.
Data transformations prior to analysis may be beneficial in classification tasks. In this article we investigate a set of such transformations on 2D graph-data derived from facial images and their effect on classification accuracy in a high-dimensional setting. These transformations are low-variance in the sense that each involves only a fixed small number of input features. We show that classification accuracy can be improved when penalized regression techniques are employed, as compared to a principal component analysis (PCA) pre-processing step. In our data example classification accuracy improves from 47% to 62% when switching from PCA to penalized regression. A second goal is to visualize the resulting classifiers. We develop importance plots highlighting the influence of coordinates in the original 2D space. Features used for classification are mapped to coordinates in the original images and combined into an importance measure for each pixel. These plots assist in assessing plausibility of classifiers, interpretation of classifiers, and determination of the relative importance of different features.  相似文献   

13.
In classification, prior knowledge is incorporated in a Bayesian framework by assuming that the feature-label distribution belongs to an uncertainty class of feature-label distributions governed by a prior distribution. A posterior distribution is then derived from the prior and the sample data. An optimal Bayesian classifier (OBC) minimizes the expected misclassification error relative to the posterior distribution. From an application perspective, prior construction is critical. The prior distribution is formed by mapping a set of mathematical relations among the features and labels, the prior knowledge, into a distribution governing the probability mass across the uncertainty class. In this paper, we consider prior knowledge in the form of stochastic differential equations (SDEs). We consider a vector SDE in integral form involving a drift vector and dispersion matrix. Having constructed the prior, we develop the optimal Bayesian classifier between two models and examine, via synthetic experiments, the effects of uncertainty in the drift vector and dispersion matrix. We apply the theory to a set of SDEs for the purpose of differentiating the evolutionary history between two species.  相似文献   

14.
For medical classification problems, it is often desirable to have a probability associated with each class. Probabilistic classifiers have received relatively little attention for small n large p classification problems despite of their importance in medical decision making. In this paper, we introduce 2 criteria for assessment of probabilistic classifiers: well-calibratedness and refinement and develop corresponding evaluation measures. We evaluated several published high-dimensional probabilistic classifiers and developed 2 extensions of the Bayesian compound covariate classifier. Based on simulation studies and analysis of gene expression microarray data, we found that proper probabilistic classification is more difficult than deterministic classification. It is important to ensure that a probabilistic classifier is well calibrated or at least not "anticonservative" using the methods developed here. We provide this evaluation for several probabilistic classifiers and also evaluate their refinement as a function of sample size under weak and strong signal conditions. We also present a cross-validation method for evaluating the calibration and refinement of any probabilistic classifier on any data set.  相似文献   

15.
We propose a novel technique for automatically generating the SCOP classification of a protein structure with high accuracy. We achieve accurate classification by combining the decisions of multiple methods using the consensus of a committee (or an ensemble) classifier. Our technique, based on decision trees, is rooted in machine learning which shows that by judicially employing component classifiers, an ensemble classifier can be constructed to outperform its components. We use two sequence- and three structure-comparison tools as component classifiers. Given a protein structure and using the joint hypothesis, we first determine if the protein belongs to an existing category (family, superfamily, fold) in the SCOP hierarchy. For the proteins that are predicted as members of the existing categories, we compute their family-, superfamily-, and fold-level classifications using the consensus classifier. We show that we can significantly improve the classification accuracy compared to the individual component classifiers. In particular, we achieve error rates that are 3-12 times less than the individual classifiers' error rates at the family level, 1.5-4.5 times less at the superfamily level, and 1.1-2.4 times less at the fold level.  相似文献   

16.
Bats are a species-rich order of mammals providing key ecosystem services. Because bats are threatened by human action and also serve as important bioindicators, monitoring their populations is of utmost importance. However, surveying bats is difficult because of their nocturnal habits, elusiveness and sensitivity to disturbance. Bat detectors allow echolocating bats to be surveyed non-invasively and record species that would otherwise be difficult to observe by capture or roost inspection. Unfortunately, several bat species cannot be identified confidently from their calls so acoustic classification remains ambiguous or impossible in some cases.The popularity of automated classifiers of bat echolocation calls has escalated rapidly, including that of several packages available on purchase. Such products have filled a vacant niche on the market mostly in relation to the expanding monitoring efforts related to the development of wind energy production worldwide.We highlight that no classifier has yet proven capable of providing correct classifications in 100% of cases or getting close enough to this ideal performance. Besides, from the literature available and our own experience we argue that such tools have not yet been tested sufficiently in the field. Visual inspection of calls whose automated classification is judged suspicious is often recommended, but human intervention a posteriori represents a circular argument and requires noticeable experience.We are concerned that neophytes – including consultants with little experience with bats but specialized into other taxonomical groups – will accept passively automated responses of tools still awaiting sufficient validation. We remark that bat call identification is a serious practical issue because biases in the assessment of bat distribution or habitat preferences may lead to wrong management decisions with serious conservation consequences. Automated classifiers may crucially aid bat research and certainly merit further investigations but the boost in commercially available software may have come too early. Thorough field tests need to be carried out to assess limitations and strengths of these tools.  相似文献   

17.
Question: Are direct and indirect trait‐based approaches similar in their usefulness to synthesize species responses to successional stages? Location: Northern hardwood forests, Québec, Canada (45°01′–45°08′N; 73°58′–74°21′W). Methods: Two different trait‐based approaches were used to relate plant functional traits to succession on an old‐field – deciduous forest chronosequence: (i) a frequently used approach based on co‐occurrence of traits (emergent groups), and (ii) a new version of a direct functional approach at the trait level (the fourth‐corner method). Additionally, we selected two different cut‐off levels for the herb subset of the emergent group classification in order to test its robustness and ecological relevance. Results: Clear patterns of trait associations with stand developmental stages emerged from both the emergent group and the direct approach at the trait level. However, the emergent group classification was found to hide some trait‐level differences such as a shift in seed size, light requirement and plant form along the chronosequence. Contrasting results were obtained for the seven or nine group classification of the herbaceous subset, illustrating how critical is the number of groups for emergent group classification. Conclusion: The simultaneous use of two different trait‐based approaches provided a robust and comprehensive characterization of vegetation responses in the old‐field – deciduous forest chronosequence. It also underlines the different goals as well as the limitations and benefits of these two approaches. Both approaches indicated that abandoned pastures of the northern hardwood biome have good potential for natural recovery. Conversion of these lands to other functions may lead to irremediable loss of biodiversity.  相似文献   

18.
Although habitat fragmentation is one of the greatest threats to biodiversity worldwide, virtually no attention has been paid to the quantification of error in fragmentation statistics. Landscape pattern indices (LPIs), such as mean patch size and number of patches, are routinely used to quantify fragmentation and are often calculated using remote-sensing imagery that has been classified into different land-cover classes. No classified map is ever completely correct, so we asked if different maps with similar misclassification rates could result in widely different errors in pattern indices. We simulated landscapes with varying proportions of habitat and clumpiness (autocorrelation) and then simulated classification errors on the same maps. We simulated higher misclassification at patch edges (as is often observed), and then used a smoothing algorithm routinely used on images to correct salt-and-pepper classification error. We determined how well classification errors (and smoothing) corresponded to errors seen in four pattern indices. Maps with low misclassification rates often yielded errors in LPIs of much larger magnitude and substantial variability. Although smoothing usually improved classification error, it sometimes increased LPI error and reversed the direction of error in LPIs introduced by misclassification. Our results show that classification error is not always a good predictor of errors in LPIs, and some types of image postprocessing (for example, smoothing) might result in the underestimation of habitat fragmentation. Furthermore, our results suggest that there is potential for large errors in nearly every landscape pattern analysis ever published, because virtually none quantify the errors in LPIs themselves.  相似文献   

19.
MOTIVATION: The nearest shrunken centroids classifier has become a popular algorithm in tumor classification problems using gene expression microarray data. Feature selection is an embedded part of the method to select top-ranking genes based on a univariate distance statistic calculated for each gene individually. The univariate statistics summarize gene expression profiles outside of the gene co-regulation network context, leading to redundant information being included in the selection procedure. RESULTS: We propose an Eigengene-based Linear Discriminant Analysis (ELDA) to address gene selection in a multivariate framework. The algorithm uses a modified rotated Spectral Decomposition (SpD) technique to select 'hub' genes that associate with the most important eigenvectors. Using three benchmark cancer microarray datasets, we show that ELDA selects the most characteristic genes, leading to substantially smaller classifiers than the univariate feature selection based analogues. The resulting de-correlated expression profiles make the gene-wise independence assumption more realistic and applicable for the shrunken centroids classifier and other diagonal linear discriminant type of models. Our algorithm further incorporates a misclassification cost matrix, allowing differential penalization of one type of error over another. In the breast cancer data, we show false negative prognosis can be controlled via a cost-adjusted discriminant function. AVAILABILITY: R code for the ELDA algorithm is available from author upon request.  相似文献   

20.
Multiclass classification is one of the fundamental tasks in bioinformatics and typically arises in cancer diagnosis studies by gene expression profiling. There have been many studies of aggregating binary classifiers to construct a multiclass classifier based on one-versus-the-rest (1R), one-versus-one (11), or other coding strategies, as well as some comparison studies between them. However, the studies found that the best coding depends on each situation. Therefore, a new problem, which we call the ldquooptimal coding problem,rdquo has arisen: how can we determine which coding is the optimal one in each situation? To approach this optimal coding problem, we propose a novel framework for constructing a multiclass classifier, in which each binary classifier to be aggregated has a weight value to be optimally tuned based on the observed data. Although there is no a priori answer to the optimal coding problem, our weight tuning method can be a consistent answer to the problem. We apply this method to various classification problems including a synthesized data set and some cancer diagnosis data sets from gene expression profiling. The results demonstrate that, in most situations, our method can improve classification accuracy over simple voting heuristics and is better than or comparable to state-of-the-art multiclass predictors.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号