首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
PCP: a program for supervised classification of gene expression profiles   总被引:1,自引:0,他引:1  
PCP (Pattern Classification Program) is an open-source machine learning program for supervised classification of patterns (vectors of measurements). The principal use of PCP in bioinformatics is design and evaluation of classifiers for use in clinical diagnostic tests based on measurements of gene expression. PCP implements leading pattern classification and gene selection algorithms and incorporates cross-validation estimation of classifier performance. Importantly, the implementation integrates gene selection and class prediction stages, which is vital for computing reliable performance estimates in small-sample scenarios. Additionally, the program includes automated and efficient model selection (optimization of parameters) for support vector machine (SVM) classifier. The distribution includes Linux and Windows/Cygwin binaries. The program can easily be ported to other platforms. AVAILABILITY: Free download at http://pcp.sourceforge.net  相似文献   

2.
ABSTRACT: BACKGROUND: Relative expression algorithms such as the top-scoring pair (TSP) and the top-scoring triplet (TST) have several strengths that distinguish them from other classification methods, including resistance to overfitting, invariance to most data normalization methods, and biological interpretability. The top-scoring 'N' (TSN) algorithm is a generalized form of other relative expression algorithms which uses generic permutations and a dynamic classifier size to control both the permutation and combination space available for classification. RESULTS: TSN was tested on nine cancer datasets, showing statistically significant differences in classification accuracy between different classifier sizes (choices of N). TSN also performed competitively against a wide variety of different classification methods, including artificial neural networks, classification trees, discriminant analysis, k-Nearest neighbor, naive Bayes, and support vector machines, when tested on the Microarray Quality Control II datasets. Furthermore, TSN exhibits low levels of overfitting on training data compared to other methods, giving confidence that results obtained during cross validation will be more generally applicable to external validation sets. CONCLUSIONS: TSN preserves the strengths of other relative expression algorithms while allowing a much larger permutation and combination space to be explored, potentially improving classification accuracies when fewer numbers of measured features are available.  相似文献   

3.
In this paper, an immune-inspired model, named innate and adaptive artificial immune system (IA-AIS) is proposed and applied to the problem of identification of unsolicited bulk e-mail messages (SPAM). It integrates entities analogous to macrophages, B and T lymphocytes, modeling both the innate and the adaptive immune systems. An implementation of the algorithm was capable of identifying more than 99% of legitimate or SPAM messages in particular parameter configurations. It was compared to an optimized version of the naïve Bayes classifier, which has been attained extremely high correct classification rates. It has been concluded that IA-AIS has a greater ability to identify SPAM messages, although the identification of legitimate messages is not as high as that of the implemented naïve Bayes classifier.  相似文献   

4.
MOTIVATION: Several kernel-based methods have been recently introduced for the classification of small molecules. Most available kernels on molecules are based on 2D representations obtained from chemical structures, but far less work has focused so far on the definition of effective kernels that can also exploit 3D information. RESULTS: We introduce new ideas for building kernels on small molecules that can effectively use and combine 2D and 3D information. We tested these kernels in conjunction with support vector machines for binary classification on the 60 NCI cancer screening datasets as well as on the NCI HIV data set. Our results show that 3D information leveraged by these kernels can consistently improve prediction accuracy in all datasets. AVAILABILITY: An implementation of the small molecule classifier is available from http://www.dsi.unifi.it/neural/src/3DDK.  相似文献   

5.
6.
Classification, which is the task of assigning objects to one of several predefined categories, is a pervasive problem that encompasses many diverse applications. Decision tree classifier, which is a simple yet widely used classification technique, employs training data to yield decision rules; moreover, it can create thresholds and then split the list of continuous attributes into descrete intervals for handling continuous attributes (Quinlan in Journal of Artificial Intelligence Research 4:77–90, 1996). Rough set theory (Pawlak in International Journal of Computer and Information Sciences 11:341–356, 1982; International Journal of Man-Machine Studies 20:469–483, 1984; Rough sets: theoretical aspects of reasoning about data. Kluwer, Dordrecht, 1991) has been applied to a wide variety of decision analysis problems for the extraction of rules from databases. This paper proposes a hybrid approach that takes advantage of combining decision tree and rough sets classifier and applies it to plant classification. The introduced approach starts with decision tree classifier (C4.5) as preprocessing technique to make interval-discretization, subsequently, and uses rough set method for extracting rules. The proposed approach aims at finding out classification rules via analyzing lamina attributes (leaf stalk, leaf width, leaf length, length/width ratio) of Cinnamomum, which are gathered and measured by plant specialists in the field of Taiwan. A comparison with the widely used algorithms (e.g., decision tree, multilayer perceptrons, naïve Bayes, and rough sets classifier) is carried out to show numerous advantages of the proposed approach. Finally, employing with test data in which species are unknown, results of classification are approved by consulting the relative plant specialists.  相似文献   

7.
MOTIVATION: Ranking gene feature sets is a key issue for both phenotype classification, for instance, tumor classification in a DNA microarray experiment, and prediction in the context of genetic regulatory networks. Two broad methods are available to estimate the error (misclassification rate) of a classifier. Resubstitution fits a single classifier to the data, and applies this classifier in turn to each data observation. Cross-validation (in leave-one-out form) removes each observation in turn, constructs the classifier, and then computes whether this leave-one-out classifier correctly classifies the deleted observation. Resubstitution typically underestimates classifier error, severely so in many cases. Cross-validation has the advantage of producing an effectively unbiased error estimate, but the estimate is highly variable. In many applications it is not the misclassification rate per se that is of interest, but rather the construction of gene sets that have the potential to classify or predict. Hence, one needs to rank feature sets based on their performance. RESULTS: A model-based approach is used to compare the ranking performances of resubstitution and cross-validation for classification based on real-valued feature sets and for prediction in the context of probabilistic Boolean networks (PBNs). For classification, a Gaussian model is considered, along with classification via linear discriminant analysis and the 3-nearest-neighbor classification rule. Prediction is examined in the steady-distribution of a PBN. Three metrics are proposed to compare feature-set ranking based on error estimation with ranking based on the true error, which is known owing to the model-based approach. In all cases, resubstitution is competitive with cross-validation relative to ranking accuracy. This is in addition to the enormous savings in computation time afforded by resubstitution.  相似文献   

8.
ABSTRACT: BACKGROUND: Many problems in bioinformatics involve classification based on features such as sequence, structure or morphology. Given multiple classifiers, two crucial questions arise: how does their performance compare, and how can they best be combined to produce a better classifier? A classifier can be evaluated in terms of sensitivity and specificity using benchmark, or gold standard, data, that is, data for which the true classification is known. However, a gold standard is not always available. Here we demonstrate that a Bayesian model for comparing medical diagnostics without a gold standard can be successfully applied in the bioinformatics domain, to genomic scale data sets. We present a new implementation, which unlike previous implementations is applicable to any number of classifiers. We apply this model, for the first time, to the problem of finding the globally optimal logical combination of classifiers. RESULTS: We compared three classifiers of protein subcellular localisation, and evaluated our estimates of sensitivity and specificity against estimates obtained using a gold standard. The method overestimated sensitivity and specificity with only a small discrepancy, and correctly ranked the classifiers. Diagnostic tests for swine flu were then compared on a small data set. Lastly, classifiers for a genome-wide association study of macular degeneration with 541094 SNPs were analysed. In all cases, run times were feasible, and results precise. The optimal logical combination of classifiers was also determined for all three data sets. Code and data are available from http://bioinformatics.monash.edu.au/downloads/. CONCLUSIONS: The examples demonstrate the methods are suitable for both small and large data sets, applicable to the wide range of bioinformatics classification problems, and robust to dependence between classifiers. In all three test cases, the globally optimal logical combination of the classifiers was found to be their union, according to three out of four ranking criteria. We propose as a general rule of thumb that the union of classifiers will be close to optimal.  相似文献   

9.
MOTIVATION: A major problem of pattern classification is estimation of the Bayes error when only small samples are available. One way to estimate the Bayes error is to design a classifier based on some classification rule applied to sample data, estimate the error of the designed classifier, and then use this estimate as an estimate of the Bayes error. Relative to the Bayes error, the expected error of the designed classifier is biased high, and this bias can be severe with small samples. RESULTS: This paper provides a correction for the bias by subtracting a term derived from the representation of the estimation error. It does so for Boolean classifiers, these being defined on binary features. Although the general theory applies to any Boolean classifier, a model is introduced to reduce the number of parameters. A key point is that the expected correction is conservative. Properties of the corrected estimate are studied via simulation. The correction applies to binary predictors because they are mathematically identical to Boolean classifiers. In this context the correction is adapted to the coefficient of determination, which has been used to measure nonlinear multivariate relations between genes and design genetic regulatory networks. An application using gene-expression data from a microarray experiment is provided on the website http://gspsnap.tamu.edu/smallsample/ (user:'smallsample', password:'smallsample)').  相似文献   

10.
This paper proposes Bayesian approach to classification of EEG patterns on the basis of imaginary movements of extremities based on analysis ofcovariance matrices of native EEG recordings. An efficacy of a Brain-Computer Interface (BCI) based on the proposed classifier is evaluated. Bayesian classifier is shown to be competitive with the MCSP (Multiclass Common Spatial Patterns) classifier known from the literature as one of the efficient variant for BCI implementation. The influence of eye movement and blinking artifacts on the BCI performance is investigated. It is shown that the presence of such artifacts does not affect the classification accuracy.  相似文献   

11.
《Ecological Engineering》2005,24(1-2):5-15
In this paper, the implementation of a pilot computerized system for the classification of landscape images (SCAPEVIEWER) is presented. A total of 108 landscape photographs have been organized, according to the mean estimation of scenic beauty from seven experts, into three classes: indistinctive (C1), typical or common (C2), and distinctive (C3). For each of the landscape photographs, 10 indices are estimated. These indices are then fed to a classifier based on neural network (NN) technology. In order to examine whether NNs are suitable for this specific application, two different approaches have been tested and compared against a linear discrimination method (LDM) classifier. The first approach is a feed forward NN (Classic-NN), while the second approach (Hybrid-NN) is based on the Classic-NN modified by using genetic algorithms (GAs). The correct classification performances achieved by the Classic-NN and the Hybrid-NN were 87% and 84%, respectively, while the classification performance of the LDM classifier was only 68%. Although the Classic-NN achieved slightly better results than the Hybrid-NN, the latter is preferred due to its ability of index selection and automatical adjustment of internal NN parameters. The pilot system has shown the feasibility for classifying landscape photographs according to scenic beauty by means of a computerized system combining the knowledge of an expert with a NN classifier.  相似文献   

12.
R.D. Badgujar  P.J. Deore 《IRBM》2019,40(2):69-77
Background: The diabetic retinopathy can result in loss of vision if not detected in the earlier stages. Exudates are the lesions which play a crucial role in early diagnosis of diabetic retinopathy. The localization of exudates lesions with high values of performance metrics is complicated due to presence of blood vessels and other noisy artifacts. Method: We present computer aided system for classification of retinal fundus images using a novel nature inspired spider monkey optimization for parameter tuning of gradient boosting machines classifier. The image enhancement has been performed with histogram equalization and contourlet transform. The pixels belonging to optic disc region are detected and eliminated using circular Hough transform and Otsu's segmentation method. We have employed Kirsch's matrices for blood vessel detection. The GLCM based feature vector extraction has been employed for textural features. The classification has been performed with hybrid SMO-GBM classifier. Result: We have utilized the STARE database for validation of proposed technique. The proposed system can effectively classify entire image set from test data. The SMO-GBM classifier can further sub-segregate into sub classes with an average accuracy of 97.5%. Conclusion: The proposed approach provides detection and grading of diabetic retinopathy. The abnormality is further categories as soft, moderate and severe. The hybrid SMO-GBM classifier yields a better statistical metrics than the existing exudates classification approaches.  相似文献   

13.
MOTIVATION: Temporal gene expression profiles provide an important characterization of gene function, as biological systems are predominantly developmental and dynamic. We propose a method of classifying collections of temporal gene expression curves in which individual expression profiles are modeled as independent realizations of a stochastic process. The method uses a recently developed functional logistic regression tool based on functional principal components, aimed at classifying gene expression curves into known gene groups. The number of eigenfunctions in the classifier can be chosen by leave-one-out cross-validation with the aim of minimizing the classification error. RESULTS: We demonstrate that this methodology provides low-error-rate classification for both yeast cell-cycle gene expression profiles and Dictyostelium cell-type specific gene expression patterns. It also works well in simulations. We compare our functional principal components approach with a B-spline implementation of functional discriminant analysis for the yeast cell-cycle data and simulations. This indicates comparative advantages of our approach which uses fewer eigenfunctions/base functions. The proposed methodology is promising for the analysis of temporal gene expression data and beyond. AVAILABILITY: MATLAB programs are available upon request.  相似文献   

14.
BACKGROUND: Multiplex or multicolor fluorescence in situ hybridization (M-FISH) is a recently developed cytogenetic technique for cancer diagnosis and research on genetic disorders. By simultaneously viewing the multiply labeled specimens in different color channels, M-FISH facilitates the detection of subtle chromosomal aberrations. The success of this technique largely depends on the accuracy of pixel classification (color karyotyping). Improvements in classifier performance would allow the elucidation of more complex and more subtle chromosomal rearrangements. Normalization of M-FISH images has a significant effect on the accuracy of classification. In particular, misalignment or misregistration across multiple channels seriously affects classification accuracy. Image normalization, including automated registration, must be done before pixel classification. METHODS AND RESULTS: We studied several image normalization approaches that affect image classification. In particular, we developed an automated registration technique to correct misalignment across the different fluor images (caused by chromatic aberration and other factors). This new registration algorithm is based on wavelets and spline approximations that have computational advantages and improved accuracy. To evaluate the performance improvement brought about by these data normalization approaches, we used the downstream pixel classification accuracy as a measurement. A Bayesian classifier assumed that each of 24 chromosome classes had a normal probability distribution. The effects that this registration and other normalization steps have on subsequent classification accuracy were evaluated on a comprehensive M-FISH database established by Advanced Digital Imaging Research (http://www.adires.com/05/Project/MFISH_DB/MFISH_DB.shtml). CONCLUSIONS: Pixel misclassification errors result from different factors. These include uneven hybridization, spectral overlap among fluors, and image misregistration. Effective preprocessing of M-FISH images can decrease the effects of those factors and thereby increase pixel classification accuracy. The data normalization steps described in this report, such as image registration and background flattening, can significantly improve subsequent classification accuracy. An improved classifier in turn would allow subtle DNA rearrangements to be identified in genetic diagnosis and cancer research.  相似文献   

15.
MOTIVATION: DNA microarray data analysis has been used previously to identify marker genes which discriminate cancer from normal samples. However, due to the limited sample size of each study, there are few common markers among different studies of the same cancer. With the rapid accumulation of microarray data, it is of great interest to integrate inter-study microarray data to increase sample size, which could lead to the discovery of more reliable markers. RESULTS: We present a novel, simple method of integrating different microarray datasets to identify marker genes and apply the method to prostate cancer datasets. In this study, by applying a new statistical method, referred to as the top-scoring pair (TSP) classifier, we have identified a pair of robust marker genes (HPN and STAT6) by integrating microarray datasets from three different prostate cancer studies. Cross-platform validation shows that the TSP classifier built from the marker gene pair, which simply compares relative expression values, achieves high accuracy, sensitivity and specificity on independent datasets generated using various array platforms. Our findings suggest a new model for the discovery of marker genes from accumulated microarray data and demonstrate how the great wealth of microarray data can be exploited to increase the power of statistical analysis. CONTACT: leixu@jhu.edu.  相似文献   

16.

Background  

Recently, supervised learning methods have been exploited to reconstruct gene regulatory networks from gene expression data. The reconstruction of a network is modeled as a binary classification problem for each pair of genes. A statistical classifier is trained to recognize the relationships between the activation profiles of gene pairs. This approach has been proven to outperform previous unsupervised methods. However, the supervised approach raises open questions. In particular, although known regulatory connections can safely be assumed to be positive training examples, obtaining negative examples is not straightforward, because definite knowledge is typically not available that a given pair of genes do not interact.  相似文献   

17.
18.
Monte Carlo feature selection for supervised classification   总被引:4,自引:0,他引:4  
MOTIVATION: Pre-selection of informative features for supervised classification is a crucial, albeit delicate, task. It is desirable that feature selection provides the features that contribute most to the classification task per se and which should therefore be used by any classifier later used to produce classification rules. In this article, a conceptually simple but computer-intensive approach to this task is proposed. The reliability of the approach rests on multiple construction of a tree classifier for many training sets randomly chosen from the original sample set, where samples in each training set consist of only a fraction of all of the observed features. RESULTS: The resulting ranking of features may then be used to advantage for classification via a classifier of any type. The approach was validated using Golub et al. leukemia data and the Alizadeh et al. lymphoma data. Not surprisingly, we obtained a significantly different list of genes. Biological interpretation of the genes selected by our method showed that several of them are involved in precursors to different types of leukemia and lymphoma rather than being genes that are common to several forms of cancers, which is the case for the other methods. AVAILABILITY: Prototype available upon request.  相似文献   

19.
Taxonomic and phylogenetic fingerprinting based on sequence analysis of gene fragments from the large-subunit rRNA (LSU) gene or the internal transcribed spacer (ITS) region is becoming an integral part of fungal classification. The lack of an accurate and robust classification tool trained by a validated sequence database for taxonomic placement of fungal LSU genes is a severe limitation in taxonomic analysis of fungal isolates or large data sets obtained from environmental surveys. Using a hand-curated set of 8,506 fungal LSU gene fragments, we determined the performance characteristics of a naïve Bayesian classifier across multiple taxonomic levels and compared the classifier performance to that of a sequence similarity-based (BLASTN) approach. The naïve Bayesian classifier was computationally more rapid (>460-fold with our system) than the BLASTN approach, and it provided equal or superior classification accuracy. Classifier accuracies were compared using sequence fragments of 100 bp and 400 bp and two different PCR primer anchor points to mimic sequence read lengths commonly obtained using current high-throughput sequencing technologies. Accuracy was higher with 400-bp sequence reads than with 100-bp reads. It was also significantly affected by sequence location across the 1,400-bp test region. The highest accuracy was obtained across either the D1 or D2 variable region. The naïve Bayesian classifier provides an effective and rapid means to classify fungal LSU sequences from large environmental surveys. The training set and tool are publicly available through the Ribosomal Database Project (http://rdp.cme.msu.edu/classifier/classifier.jsp).  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号