首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This study presents a 2-stage heartbeat classifier of supraventricular (SVB) and ventricular (VB) beats. Stage 1 makes computationally-efficient classification of SVB-beats, using simple correlation threshold criterion for finding close match with a predominant normal (reference) beat template. The non-matched beats are next subjected to measurement of 20 basic features, tracking the beat and reference template morphology and RR-variability for subsequent refined classification in SVB or VB-class by Stage 2. Four linear classifiers are compared: cluster, fuzzy, linear discriminant analysis (LDA) and classification tree (CT), all subjected to iterative training for selection of the optimal feature space among extended 210-sized set, embodying interactive second-order effects between 20 independent features. The optimization process minimizes at equal weight the false positives in SVB-class and false negatives in VB-class. The training with European ST-T, AHA, MIT-BIH Supraventricular Arrhythmia databases found the best performance settings of all classification models: Cluster (30 features), Fuzzy (72 features), LDA (142 coefficients), CT (221 decision nodes) with top-3 best scored features: normalized current RR-interval, higher/lower frequency content ratio, beat-to-template correlation. Unbiased test-validation with MIT-BIH Arrhythmia database rates the classifiers in descending order of their specificity for SVB-class: CT (99.9%), LDA (99.6%), Cluster (99.5%), Fuzzy (99.4%); sensitivity for ventricular ectopic beats as part from VB-class (commonly reported in published beat-classification studies): CT (96.7%), Fuzzy (94.4%), LDA (94.2%), Cluster (92.4%); positive predictivity: CT (99.2%), Cluster (93.6%), LDA (93.0%), Fuzzy (92.4%). CT has superior accuracy by 0.3–6.8% points, with the advantage for easy model complexity configuration by pruning the tree consisted of easy interpretable ‘if-then’ rules.  相似文献   

2.
MOTIVATION: The increasing use of DNA microarray-based tumor gene expression profiles for cancer diagnosis requires mathematical methods with high accuracy for solving clustering, feature selection and classification problems of gene expression data. RESULTS: New algorithms are developed for solving clustering, feature selection and classification problems of gene expression data. The clustering algorithm is based on optimization techniques and allows the calculation of clusters step-by-step. This approach allows us to find as many clusters as a data set contains with respect to some tolerance. Feature selection is crucial for a gene expression database. Our feature selection algorithm is based on calculating overlaps of different genes. The database used, contains over 16 000 genes and this number is considerably reduced by feature selection. We propose a classification algorithm where each tissue sample is considered as the center of a cluster which is a ball. The results of numerical experiments confirm that the classification algorithm in combination with the feature selection algorithm perform slightly better than the published results for multi-class classifiers based on support vector machines for this data set. AVAILABILITY: Available on request from the authors.  相似文献   

3.
For small samples, classifier design algorithms typically suffer from overfitting. Given a set of features, a classifier must be designed and its error estimated. For small samples, an error estimator may be unbiased but, owing to a large variance, often give very optimistic estimates. This paper proposes mitigating the small-sample problem by designing classifiers from a probability distribution resulting from spreading the mass of the sample points to make classification more difficult, while maintaining sample geometry. The algorithm is parameterized by the variance of the spreading distribution. By increasing the spread, the algorithm finds gene sets whose classification accuracy remains strong relative to greater spreading of the sample. The error gives a measure of the strength of the feature set as a function of the spread. The algorithm yields feature sets that can distinguish the two classes, not only for the sample data, but for distributions spread beyond the sample data. For linear classifiers, the topic of the present paper, the classifiers are derived analytically from the model, thereby providing an enormous savings in computation time. The algorithm is applied to cancer classification via cDNA microarrays. In particular, the genes BRCA1 and BRCA2 are associated with a hereditary disposition to breast cancer, and the algorithm is used to find gene sets whose expressions can be used to classify BRCA1 and BRCA2 tumors.  相似文献   

4.
The most widely spread measure of performance, accuracy, suffers from a paradox: predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy. Despite optimizing classification error rate, high accuracy models may fail to capture crucial information transfer in the classification task. We present evidence of this behavior by means of a combinatorial analysis where every possible contingency matrix of 2, 3 and 4 classes classifiers are depicted on the entropy triangle, a more reliable information-theoretic tool for classification assessment.Motivated by this, we develop from first principles a measure of classification performance that takes into consideration the information learned by classifiers. We are then able to obtain the entropy-modulated accuracy (EMA), a pessimistic estimate of the expected accuracy with the influence of the input distribution factored out, and the normalized information transfer factor (NIT), a measure of how efficient is the transmission of information from the input to the output set of classes.The EMA is a more natural measure of classification performance than accuracy when the heuristic to maximize is the transfer of information through the classifier instead of classification error count. The NIT factor measures the effectiveness of the learning process in classifiers and also makes it harder for them to “cheat” using techniques like specialization, while also promoting the interpretability of results. Their use is demonstrated in a mind reading task competition that aims at decoding the identity of a video stimulus based on magnetoencephalography recordings. We show how the EMA and the NIT factor reject rankings based in accuracy, choosing more meaningful and interpretable classifiers.  相似文献   

5.
Fuzzy J-Means and VNS methods for clustering genes from microarray data   总被引:4,自引:0,他引:4  
MOTIVATION: In the interpretation of gene expression data from a group of microarray experiments that include samples from either different patients or conditions, special consideration must be given to the pleiotropic and epistatic roles of genes, as observed in the variation of gene coexpression patterns. Crisp clustering methods assign each gene to one cluster, thereby omitting information about the multiple roles of genes. RESULTS: Here, we present the application of a local search heuristic, Fuzzy J-Means, embedded into the variable neighborhood search metaheuristic for the clustering of microarray gene expression data. We show that for all the datasets studied this algorithm outperforms the standard Fuzzy C-Means heuristic. Different methods for the utilization of cluster membership information in determining gene coregulation are presented. The clustering and data analyses were performed on simulated datasets as well as experimental cDNA microarray data for breast cancer and human blood from the Stanford Microarray Database. AVAILABILITY: The source code of the clustering software (C programming language) is freely available from Nabil.Belacel@nrc-cnrc.gc.ca  相似文献   

6.
Validation of computational methods in genomics   总被引:1,自引:1,他引:0  
High-throughput technologies for genomics provide tens of thousands of genetic measurements, for instance, gene-expression measurements on microarrays, and the availability of these measurements has motivated the use of machine learning (inference) methods for classification, clustering, and gene networks. Generally, a design method will yield a model that satisfies some model constraints and fits the data in some manner. On the other hand, a scientific theory consists of two parts: (1) a mathematical model to characterize relations between variables, and (2) a set of relations between model variables and observables that are used to validate the model via predictive experiments. Although machine learning algorithms are constructed to hopefully produce valid scientific models, they do not ipso facto do so. In some cases, such as classifier estimation, there is a well-developed error theory that relates to model validity according to various statistical theorems, but in others such as clustering, there is a lack of understanding of the relationship between the learning algorithms and validation. The issue of validation is especially problematic in situations where the sample size is small in comparison with the dimensionality (number of variables), which is commonplace in genomics, because the convergence theory of learning algorithms is typically asymptotic and the algorithms often perform in counter-intuitive ways when used with samples that are small in relation to the number of variables. For translational genomics, validation is perhaps the most critical issue, because it is imperative that we understand the performance of a diagnostic or therapeutic procedure to be used in the clinic, and this performance relates directly to the validity of the model behind the procedure. This paper treats the validation issue as it appears in two classes of inference algorithms relating to genomics - classification and clustering. It formulates the problem and reviews salient results.  相似文献   

7.
Classification is a data mining task the goal of which is to learn a model, from a training dataset, that can predict the class of a new data instance, while clustering aims to discover natural instance-groupings within a given dataset. Learning cluster-based classification systems involves partitioning a training set into data subsets (clusters) and building a local classification model for each data cluster. The class of a new instance is predicted by first assigning the instance to its nearest cluster and then using that cluster’s local classification model to predict the instance’s class. In this paper, we present an ant colony optimization (ACO) approach to building cluster-based classification systems. Our ACO approach optimizes the number of clusters, the positioning of the clusters, and the choice of classification algorithm to use as the local classifier for each cluster. We also present an ensemble approach that allows the system to decide on the class of a given instance by considering the predictions of all local classifiers, employing a weighted voting mechanism based on the fuzzy degree of membership in each cluster. Our experimental evaluation employs five widely used classification algorithms: naïve Bayes, nearest neighbour, Ripper, C4.5, and support vector machines, and results are reported on a suite of 54 popular UCI benchmark datasets.  相似文献   

8.
Prototype based classifiers are effective algorithms in modeling classification problems and have been applied in multiple domains. While many supervised learning algorithms have been successfully extended to kernels to improve the discrimination power by means of the kernel concept, prototype based classifiers are typically still used with Euclidean distance measures. Kernelized variants of prototype based classifiers are currently too complex to be applied for larger data sets. Here we propose an extension of Kernelized Generalized Learning Vector Quantization (KGLVQ) employing a sparsity and approximation technique to reduce the learning complexity. We provide generalization error bounds and experimental results on real world data, showing that the extended approach is comparable to SVM on different public data.  相似文献   

9.
Finding subtypes of heterogeneous diseases is the biggest challenge in the area of biology. Often, clustering is used to provide a hypothesis for the subtypes of a heterogeneous disease. However, there are usually discrepancies between the clusterings produced by different algorithms. This work introduces a simple method which provides the most consistent clusters across three different clustering algorithms for a melanoma and a breast cancer data set. The method is validated by showing that the Silhouette, Dunne's and Davies-Bouldin's cluster validation indices are better for the proposed algorithm than those obtained by k-means and another consensus clustering algorithm. The hypotheses of the consensus clusters on both the data sets are corroborated by clear genetic markers and 100 percent classification accuracy. In Bittner et al.'s melanoma data set, a previously hypothesized primary cluster is recognized as the largest consensus cluster and a new partition of this cluster into two subclusters is proposed. In van't Veer et al.'s breast cancer data set, previously proposed "basal” and "luminal A” subtypes are clearly recognized as the two predominant clusters. Furthermore, a new hypothesis is provided about the existence of two subgroups within the "basal” subtype in this data set. The clusters of van't Veer's data set is also validated by high classification accuracy obtained in the data set of van de Vijver et al.  相似文献   

10.
Whole-cell biosensors are mostly non-specific with respect to their detection capabilities for toxicants, and therefore offering an interesting perspective in environmental monitoring. However, to fully employ this feature, a robust classification method needs to be implemented into these sensor systems to allow further identification of detected substances. Substance-specific information can be extracted from signals derived from biosensors harbouring one or multiple biological components. Here, a major task is the identification of substance-specific information among considerable amounts of biosensor data. For this purpose, several approaches make use of statistical methods or machine learning algorithms. Genetic Programming (GP), a heuristic machine learning technique offers several advantages compared to other machine learning approaches and consequently may be a promising tool for biosensor data classification. In the present study, we have evaluated the use of GP for the classification of herbicides and herbicide classes (chemical classes) by analysis of substance-specific patterns derived from a whole-cell multi-species biosensor. We re-analysed data from a previously described array-based biosensor system employing diverse microalgae (Podola and Melkonian, 2005), aiming on the identification of five individual herbicides as well as two herbicide classes. GP analyses were performed using the commercially available GP software 'Discipulus', resulting in classifiers (computer programs) for the binary classification of each individual herbicide or herbicide class. GP-generated classifiers both for individual herbicides and herbicide classes were able to perform a statistically significant identification of herbicides or herbicide classes, respectively. The majority of classifiers were able to perform correct classifications (sensitivity) of about 80-95% of test data sets, whereas the false positive rate (specificity) was lower than 20% for most classifiers. Results suggest that a higher number of data sets may lead to a better classification performance. In the present paper, GP-based classification was combined with a biosensor for the first time. Our results demonstrate GP was able to identify substance-specific information within complex biosensor response patterns and furthermore use this information for successful toxicant classification in unknown samples. This suggests further research to assess perspectives and limitations of this approach in the field of biosensors.  相似文献   

11.

Background  

Machine learning techniques have shown to improve bacterial species classification based on fatty acid methyl ester (FAME) data. Nonetheless, FAME analysis has a limited resolution for discrimination of bacteria at the species level. In this paper, we approach the species classification problem from a taxonomic point of view. Such a taxonomy or tree is typically obtained by applying clustering algorithms on FAME data or on 16S rRNA gene data. The knowledge gained from the tree can then be used to evaluate FAME-based classifiers, resulting in a novel framework for bacterial species classification.  相似文献   

12.
Inference from clustering with application to gene-expression microarrays.   总被引:7,自引:0,他引:7  
There are many algorithms to cluster sample data points based on nearness or a similarity measure. Often the implication is that points in different clusters come from different underlying classes, whereas those in the same cluster come from the same class. Stochastically, the underlying classes represent different random processes. The inference is that clusters represent a partition of the sample points according to which process they belong. This paper discusses a model-based clustering toolbox that evaluates cluster accuracy. Each random process is modeled as its mean plus independent noise, sample points are generated, the points are clustered, and the clustering error is the number of points clustered incorrectly according to the generating random processes. Various clustering algorithms are evaluated based on process variance and the key issue of the rate at which algorithmic performance improves with increasing numbers of experimental replications. The model means can be selected by hand to test the separability of expected types of biological expression patterns. Alternatively, the model can be seeded by real data to test the expected precision of that output or the extent of improvement in precision that replication could provide. In the latter case, a clustering algorithm is used to form clusters, and the model is seeded with the means and variances of these clusters. Other algorithms are then tested relative to the seeding algorithm. Results are averaged over various seeds. Output includes error tables and graphs, confusion matrices, principal-component plots, and validation measures. Five algorithms are studied in detail: K-means, fuzzy C-means, self-organizing maps, hierarchical Euclidean-distance-based and correlation-based clustering. The toolbox is applied to gene-expression clustering based on cDNA microarrays using real data. Expression profile graphics are generated and error analysis is displayed within the context of these profile graphics. A large amount of generated output is available over the web.  相似文献   

13.
Identification and characterization of antigenic determinants on proteins has received considerable attention utilizing both, experimental as well as computational methods. For computational routines mostly structural as well as physicochemical parameters have been utilized for predicting the antigenic propensity of protein sites. However, the performance of computational routines has been low when compared to experimental alternatives. Here we describe the construction of machine learning based classifiers to enhance the prediction quality for identifying linear B-cell epitopes on proteins. Our approach combines several parameters previously associated with antigenicity, and includes novel parameters based on frequencies of amino acids and amino acid neighborhood propensities. We utilized machine learning algorithms for deriving antigenicity classification functions assigning antigenic propensities to each amino acid of a given protein sequence. We compared the prediction quality of the novel classifiers with respect to established routines for epitope scoring, and tested prediction accuracy on experimental data available for HIV proteins. The major finding is that machine learning classifiers clearly outperform the reference classification systems on the HIV epitope validation set.  相似文献   

14.
An improved algorithm for clustering gene expression data   总被引:1,自引:0,他引:1  
MOTIVATION: Recent advancements in microarray technology allows simultaneous monitoring of the expression levels of a large number of genes over different time points. Clustering is an important tool for analyzing such microarray data, typical properties of which are its inherent uncertainty, noise and imprecision. In this article, a two-stage clustering algorithm, which employs a recently proposed variable string length genetic scheme and a multiobjective genetic clustering algorithm, is proposed. It is based on the novel concept of points having significant membership to multiple classes. An iterated version of the well-known Fuzzy C-Means is also utilized for clustering. RESULTS: The significant superiority of the proposed two-stage clustering algorithm as compared to the average linkage method, Self Organizing Map (SOM) and a recently developed weighted Chinese restaurant-based clustering method (CRC), widely used methods for clustering gene expression data, is established on a variety of artificial and publicly available real life data sets. The biological relevance of the clustering solutions are also analyzed.  相似文献   

15.
Liu Z  Tan M 《Biometrics》2008,64(4):1155-1161
SUMMARY: In medical diagnosis, the diseased and nondiseased classes are usually unbalanced and one class may be more important than the other depending on the diagnosis purpose. Most standard classification methods, however, are designed to maximize the overall accuracy and cannot incorporate different costs to different classes explicitly. In this article, we propose a novel nonparametric method to directly maximize the weighted specificity and sensitivity of the receiver operating characteristic curve. Combining advances in machine learning, optimization theory, and statistics, the proposed method has excellent generalization property and assigns different error costs to different classes explicitly. We present experiments that compare the proposed algorithms with support vector machines and regularized logistic regression using data from a study on HIV-1 protease as well as six public available datasets. Our main conclusion is that the performance of proposed algorithm is significantly better in most cases than the other classifiers tested. Software package in MATLAB is available upon request.  相似文献   

16.
《IRBM》2020,41(4):229-239
Feature selection algorithms are the cornerstone of machine learning. By increasing the properties of the samples and samples, the feature selection algorithm selects the significant features. The general name of the methods that perform this function is the feature selection algorithm. The general purpose of feature selection algorithms is to select the most relevant properties of data classes and to increase the classification performance. Thus, we can select features based on their classification performance. In this study, we have developed a feature selection algorithm based on decision support vectors classification performance. The method can work according to two different selection criteria. We tested the classification performances of the features selected with P-Score with three different classifiers. Besides, we assessed P-Score performance with 13 feature selection algorithms in the literature. According to the results of the study, the P-Score feature selection algorithm has been determined as a method which can be used in the field of machine learning.  相似文献   

17.
A Bayesian network classification methodology for gene expression data.   总被引:5,自引:0,他引:5  
We present new techniques for the application of a Bayesian network learning framework to the problem of classifying gene expression data. The focus on classification permits us to develop techniques that address in several ways the complexities of learning Bayesian nets. Our classification model reduces the Bayesian network learning problem to the problem of learning multiple subnetworks, each consisting of a class label node and its set of parent genes. We argue that this classification model is more appropriate for the gene expression domain than are other structurally similar Bayesian network classification models, such as Naive Bayes and Tree Augmented Naive Bayes (TAN), because our model is consistent with prior domain experience suggesting that a relatively small number of genes, taken in different combinations, is required to predict most clinical classes of interest. Within this framework, we consider two different approaches to identifying parent sets which are supported by the gene expression observations and any other currently available evidence. One approach employs a simple greedy algorithm to search the universe of all genes; the second approach develops and applies a gene selection algorithm whose results are incorporated as a prior to enable an exhaustive search for parent sets over a restricted universe of genes. Two other significant contributions are the construction of classifiers from multiple, competing Bayesian network hypotheses and algorithmic methods for normalizing and binning gene expression data in the absence of prior expert knowledge. Our classifiers are developed under a cross validation regimen and then validated on corresponding out-of-sample test sets. The classifiers attain a classification rate in excess of 90% on out-of-sample test sets for two publicly available datasets. We present an extensive compilation of results reported in the literature for other classification methods run against these same two datasets. Our results are comparable to, or better than, any we have found reported for these two sets, when a train-test protocol as stringent as ours is followed.  相似文献   

18.
MOTIVATION: Consensus clustering, also known as cluster ensemble, is one of the important techniques for microarray data analysis, and is particularly useful for class discovery from microarray data. Compared with traditional clustering algorithms, consensus clustering approaches have the ability to integrate multiple partitions from different cluster solutions to improve the robustness, stability, scalability and parallelization of the clustering algorithms. By consensus clustering, one can discover the underlying classes of the samples in gene expression data. RESULTS: In addition to exploring a graph-based consensus clustering (GCC) algorithm to estimate the underlying classes of the samples in microarray data, we also design a new validation index to determine the number of classes in microarray data. To our knowledge, this is the first time in which GCC is applied to class discovery for microarray data. Given a pre specified maximum number of classes (denoted as K(max) in this article), our algorithm can discover the true number of classes for the samples in microarray data according to a new cluster validation index called the Modified Rand Index. Experiments on gene expression data indicate that our new algorithm can (i) outperform most of the existing algorithms, (ii) identify the number of classes correctly in real cancer datasets, and (iii) discover the classes of samples with biological meaning. AVAILABILITY: Matlab source code for the GCC algorithm is available upon request from Zhiwen Yu.  相似文献   

19.
Strict assignment of genes to one class, dimensionality reduction, a priori specification of the number of classes, the need for a training set, nonunique solution, and complex learning mechanisms are some of the inadequacies of current clustering algorithms. Existing algorithms cluster genes on the basis of high positive correlations between their expression patterns. However, genes with strong negative correlations can also have similar functions and are most likely to have a role in the same pathways. To address some of these issues, we propose the adaptive centroid algorithm (ACA), which employs an analysis of variance (ANOVA)-based performance criterion. The ACA also uses Euclidian distances, the center-of-mass principle for heterogeneously distributed mass elements, and the given data set to give unique solutions. The proposed approach involves three stages. In the first stage a two-way ANOVA of the gene expression matrix is performed. The two factors in the ANOVA are gene expression and experimental condition. The residual mean squared error (MSE) from the ANOVA is used as a performance criterion in the ACA. Finally, correlated clusters are found based on the Pearson correlation coefficients. To validate the proposed approach, a two-way ANOVA is again performed on the discovered clusters. The results from this last step indicate that MSEs of the clusters are significantly lower compared to that of the fibroblast-serum gene expression matrix. The ACA is employed in this study for single- as well as multi-cluster gene assignments.  相似文献   

20.
Bayesian networks are knowledge representation tools that model the (in)dependency relationships among variables for probabilistic reasoning. Classification with Bayesian networks aims to compute the class with the highest probability given a case. This special kind is referred to as Bayesian network classifiers. Since learning the Bayesian network structure from a dataset can be viewed as an optimization problem, heuristic search algorithms may be applied to build high-quality networks in medium- or large-scale problems, as exhaustive search is often feasible only for small problems. In this paper, we present our new algorithm, ABC-Miner, and propose several extensions to it. ABC-Miner uses ant colony optimization for learning the structure of Bayesian network classifiers. We report extended computational results comparing the performance of our algorithm with eight other classification algorithms, namely six variations of well-known Bayesian network classifiers, cAnt-Miner for discovering classification rules and a support vector machine algorithm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号