期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Feature selection based on empirical-risk function to detect lesions in vascular computed tomography

《IRBM》2014,35(5):244-254

ObjectiveThe overall goal of the study is to detect coronary artery lesions regardless their nature, calcified or hypo-dense. To avoid explicit modelling of heterogeneous lesions, we adopted an approach based on machine learning and using unsupervised or semi-supervised classifiers. The success of the classifiers based on machine learning strongly depends on the appropriate choice of features differentiating between lesions and regular appearance. The specific goal of this article is to propose a novel strategy devised to select the best feature set for the classifiers used, out of a given set of candidate features.Materials and methodsThe features are calculated in image planes orthogonal to the artery centerline, and the classifier assigns to each of these cross-sections a label “healthy” or “diseased”. The contribution of this article is a feature-selection strategy based on the empirical risk function that is used as a criterion in the initial feature ranking and in the selection process itself. We have assessed this strategy in association with two classifiers based on the density-level detection approach that seeks outliers from the distribution corresponding to the regular appearance. The method was evaluated using a total of 13,687 cross-sections extracted from 53 coronary arteries in 15 patients.ResultsUsing the feature subset selected by the risk-based strategy, balanced error rates achieved by the unsupervised and semi-supervised classifiers respectively were equal to 13.5% and 15.4%. These results were substantially better than the rates achieved using feature subsets selected by supervised strategies. The unsupervised and semi-supervised methods also outperformed supervised classifiers using feature subsets selected by the corresponding supervised strategies.DiscussionSupervised methods require large data sets annotated by experts, both to select the features and to train the classifiers, and collecting these annotations is time-consuming. With these methods, lesions whose appearance differs from the training data may remain undetected. Lesion-detection problem is highly imbalanced, since healthy cross-sections usually are much more numerous than the diseased ones. Training the classifiers based on the density-level detection approach needs a small number of annotations or no annotations at all. The same annotations are sufficient to compute the empirical risk and to perform the selection. Therefore, our strategy associated with an unsupervised or semi-supervised classifier requires a considerably smaller number of annotations as compared to conventional supervised selection strategies. The approach proposed is also better suited for highly imbalanced problems and can detect lesions differing from the training set.ConclusionThe risk-based selection strategy, associated with classifiers using the density-level detection approach, outperformed other strategies and classifiers when used to detect coronary artery lesions. It is well suited for highly imbalanced problems, where the lesions are represented as low-density regions of the feature space, and it can be used in other anomaly detection problems interpretable as a binary classification problem where the empirical risk can be calculated. 相似文献

2.

Predictive gene lists for breast cancer prognosis: A topographic visualisation study

Mingmanas Sivaraksa David Lowe 《BMC medical genomics》2008,1(1):1-23

Background

The controversy surrounding the non-uniqueness of predictive gene lists (PGL) of small selected subsets of genes from very large potential candidates as available in DNA microarray experiments is now widely acknowledged [1]. Many of these studies have focused on constructing discriminative semi-parametric models and as such are also subject to the issue of random correlations of sparse model selection in high dimensional spaces. In this work we outline a different approach based around an unsupervised patient-specific nonlinear topographic projection in predictive gene lists.

Methods

We construct nonlinear topographic projection maps based on inter-patient gene-list relative dissimilarities. The Neuroscale, the Stochastic Neighbor Embedding(SNE) and the Locally Linear Embedding(LLE) techniques have been used to construct two-dimensional projective visualisation plots of 70 dimensional PGLs per patient, classifiers are also constructed to identify the prognosis indicator of each patient using the resulting projections from those visualisation techniques and investigate whether a-posteriori two prognosis groups are separable on the evidence of the gene lists. A literature-proposed predictive gene list for breast cancer is benchmarked against a separate gene list using the above methods. Generalisation ability is investigated by using the mapping capability of Neuroscale to visualise the follow-up study, but based on the projections derived from the original dataset.

Results

The results indicate that small subsets of patient-specific PGLs have insufficient prognostic dissimilarity to permit a distinction between two prognosis patients. Uncertainty and diversity across multiple gene expressions prevents unambiguous or even confident patient grouping. Comparative projections across different PGLs provide similar results.

Conclusion

The random correlation effect to an arbitrary outcome induced by small subset selection from very high dimensional interrelated gene expression profiles leads to an outcome with associated uncertainty. This continuum and uncertainty precludes any attempts at constructing discriminative classifiers. However a patient's gene expression profile could possibly be used in treatment planning, based on knowledge of other patients' responses. We conclude that many of the patients involved in such medical studies are intrinsically unclassifiable on the basis of provided PGL evidence. This additional category of 'unclassifiable' should be accommodated within medical decision support systems if serious errors and unnecessary adjuvant therapy are to be avoided. 相似文献

3.

StAR: a simple tool for the statistical comparison of ROC curves

Ismael A Vergara Tomás Norambuena Evandro Ferrada Alex W Slater Francisco Melo 《BMC bioinformatics》2008,9(1):265

Background

As in many different areas of science and technology, most important problems in bioinformatics rely on the proper development and assessment of binary classifiers. A generalized assessment of the performance of binary classifiers is typically carried out through the analysis of their receiver operating characteristic (ROC) curves. The area under the ROC curve (AUC) constitutes a popular indicator of the performance of a binary classifier. However, the assessment of the statistical significance of the difference between any two classifiers based on this measure is not a straightforward task, since not many freely available tools exist. Most existing software is either not free, difficult to use or not easy to automate when a comparative assessment of the performance of many binary classifiers is intended. This constitutes the typical scenario for the optimization of parameters when developing new classifiers and also for their performance validation through the comparison to previous art. 相似文献

4.

Predicting Classifier Performance with Limited Training Data: Applications to Computer-Aided Diagnosis in Breast and Prostate Cancer

Ajay Basavanhally Satish Viswanath Anant Madabhushi 《PloS one》2015,10(5)

Clinical trials increasingly employ medical imaging data in conjunction with supervised classifiers, where the latter require large amounts of training data to accurately model the system. Yet, a classifier selected at the start of the trial based on smaller and more accessible datasets may yield inaccurate and unstable classification performance. In this paper, we aim to address two common concerns in classifier selection for clinical trials: (1) predicting expected classifier performance for large datasets based on error rates calculated from smaller datasets and (2) the selection of appropriate classifiers based on expected performance for larger datasets. We present a framework for comparative evaluation of classifiers using only limited amounts of training data by using random repeated sampling (RRS) in conjunction with a cross-validation sampling strategy. Extrapolated error rates are subsequently validated via comparison with leave-one-out cross-validation performed on a larger dataset. The ability to predict error rates as dataset size increases is demonstrated on both synthetic data as well as three different computational imaging tasks: detecting cancerous image regions in prostate histopathology, differentiating high and low grade cancer in breast histopathology, and detecting cancerous metavoxels in prostate magnetic resonance spectroscopy. For each task, the relationships between 3 distinct classifiers (k-nearest neighbor, naive Bayes, Support Vector Machine) are explored. Further quantitative evaluation in terms of interquartile range (IQR) suggests that our approach consistently yields error rates with lower variability (mean IQRs of 0.0070, 0.0127, and 0.0140) than a traditional RRS approach (mean IQRs of 0.0297, 0.0779, and 0.305) that does not employ cross-validation sampling for all three datasets. 相似文献

5.

Detecting epileptic seizure with different feature extracting strategies using robust machine learning classification techniques by applying advance parameter optimization approach

Lal Hussain 《Cognitive neurodynamics》2018,12(3):271-294

Epilepsy is a neurological disorder produced due to abnormal excitability of neurons in the brain. The research reveals that brain activity is monitored through electroencephalogram (EEG) of patients suffered from seizure to detect the epileptic seizure. The performance of EEG detection based epilepsy require feature extracting strategies. In this research, we have extracted varying features extracting strategies based on time and frequency domain characteristics, nonlinear, wavelet based entropy and few statistical features. A deeper study was undertaken using novel machine learning classifiers by considering multiple factors. The support vector machine kernels are evaluated based on multiclass kernel and box constraint level. Likewise, for K-nearest neighbors (KNN), we computed the different distance metrics, Neighbor weights and Neighbors. Similarly, the decision trees we tuned the paramours based on maximum splits and split criteria and ensemble classifiers are evaluated based on different ensemble methods and learning rate. For training/testing tenfold Cross validation was employed and performance was evaluated in form of TPR, NPR, PPV, accuracy and AUC. In this research, a deeper analysis approach was performed using diverse features extracting strategies using robust machine learning classifiers with more advanced optimal options. Support Vector Machine linear kernel and KNN with City block distance metric give the overall highest accuracy of 99.5% which was higher than using the default parameters for these classifiers. Moreover, highest separation (AUC = 0.9991, 0.9990) were obtained at different kernel scales using SVM. Additionally, the K-nearest neighbors with inverse squared distance weight give higher performance at different Neighbors. Moreover, to distinguish the postictal heart rate oscillations from epileptic ictal subjects, and highest performance of 100% was obtained using different machine learning classifiers. 相似文献

6.

A critical evaluation of network and pathway-based classifiers for outcome prediction in breast cancer

Staiger C Cadot S Kooter R Dittrich M Müller T Klau GW Wessels LF 《PloS one》2012,7(4):e34796

Recently, several classifiers that combine primary tumor data, like gene expression data, and secondary data sources, such as protein-protein interaction networks, have been proposed for predicting outcome in breast cancer. In these approaches, new composite features are typically constructed by aggregating the expression levels of several genes. The secondary data sources are employed to guide this aggregation. Although many studies claim that these approaches improve classification performance over single genes classifiers, the gain in performance is difficult to assess. This stems mainly from the fact that different breast cancer data sets and validation procedures are employed to assess the performance. Here we address these issues by employing a large cohort of six breast cancer data sets as benchmark set and by performing an unbiased evaluation of the classification accuracies of the different approaches. Contrary to previous claims, we find that composite feature classifiers do not outperform simple single genes classifiers. We investigate the effect of (1) the number of selected features; (2) the specific gene set from which features are selected; (3) the size of the training set and (4) the heterogeneity of the data set on the performance of composite feature and single genes classifiers. Strikingly, we find that randomization of secondary data sources, which destroys all biological information in these sources, does not result in a deterioration in performance of composite feature classifiers. Finally, we show that when a proper correction for gene set size is performed, the stability of single genes sets is similar to the stability of composite feature sets. Based on these results there is currently no reason to prefer prognostic classifiers based on composite features over single genes classifiers for predicting outcome in breast cancer. 相似文献

7.

Comparison of Two In Vitro Extraction Protocols for Assessing Metals’ Bioaccessibility Using Dust and Soil Reference Materials

Matt Dodd Pat E. Rasmussen Marc Chénier 《人类与生态风险评估》2013,19(4):1014-1027

The bioaccessibility of arsenic, cadmium, chromium, copper, lead, nickel, and zinc in four National Institute of Standards and Technology (NIST) standard reference materials and two Canadian dust samples as determined using the Solubility/Bioavailability Research Consortium (SBRC) in vitro procedure ranged from a low of 1.8% for chromium in standard reference material NIST 2711 to a high of 95.2% for cadmium in NIST 2584. The SBRC data were compared to data generated using a modified EN-71 Toy Safety protocol conducted at two different laboratories. Results for the two extraction methods compared well with differences between the means (SBRC vs. modified EN-71) generally less than 10% for the majority of the metals. These differences between the two extraction methods were negligible compared to variability caused by (a) the inherent heterogeneity of typical house dust samples and (b) differences in ICP-MS analytical approaches employed in the different laboratories. Results indicate that the modified EN-71 method is useful and appropriate as a relatively simple, rapid, and reproducible screening test for estimating metals’ bioaccessibility in soil and dust samples. 相似文献

8.

Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art

Walia RR Caragea C Lewis BA Towfic FG Terribilini M El-Manzalawy Y Dobbs D Honavar V 《BMC bioinformatics》2012,13(1):89

ABSTRACT: BACKGROUND: RNA molecules play diverse functional and structural roles in cells. They function as messengers for transferring genetic information from DNA to proteins, as the primary genetic material in many viruses, as catalysts (ribozymes) important for protein synthesis and RNA processing, and as essential and ubiquitous regulators of gene expression in living organisms. Many of these functions depend on precisely orchestrated interactions between RNA molecules and specific proteins in cells. Understanding the molecular mechanisms by which proteins recognize and bind RNA is essential for comprehending the functional implications of these interactions, but the recognition 'code' that mediates interactions between proteins and RNA is not yet understood. Success in deciphering this code would dramatically impact the development of new therapeutic strategies for intervening in devastating diseases such as AIDS and cancer. Because of the high cost of experimental determination of protein-RNA interfaces, there is an increasing reliance on statistical machine learning methods for training predictors of RNA-binding residues in proteins. However, because of differences in the choice of datasets, performance measures, and data representations used, it has been difficult to obtain an accurate assessment of the current state of the art in protein-RNA interface prediction. RESULTS: We provide a review of published approaches for predicting RNA-binding residues in proteins and a systematic comparison and critical assessment of protein-RNA interface residue predictors trained using these approaches on three carefully curated non-redundant datasets. We directly compare two widely used machine learning algorithms (Naive Bayes (NB) and Support Vector Machine (SVM)) using three different data representations in which features are encoded using either sequence- or structure-based windows. Our results show that (i) Sequence-based classifiers that use a position-specific scoring matrix (PSSM)-based representation (PSSMSeq) outperform those that use an amino acid identity based representation (IDSeq) or a smoothed PSSM (SmoPSSMSeq); (ii) Structure-based classifiers that use smoothed PSSM representation (SmoPSSMStr) outperform those that use PSSM (PSSMStr) as well as sequence identity based representation (IDStr). PSSMSeq classifiers, when tested on an independent test set of 44 proteins, achieve performance that is comparable to that of three state-of-the-art structure-based predictors (including those that exploit geometric features) in terms of Matthews Correlation Coefficient (MCC), although the structure-based methods achieve substantially higher Specificity (albeit at the expense of Sensitivity) compared to sequence-based methods. We also find that the expected performance of the classifiers on a residue level can be markedly different from that on a protein level. Our experiments show that the classifiers trained on three different non-redundant protein-RNA interface datasets achieve comparable cross-validation performance. However, we find that the results are significantly affected by differences in the distance threshold used to define interface residues. CONCLUSIONS: Our results demonstrate that protein-RNA interface residue predictors that use a PSSM-based encoding of sequence windows outperform classifiers that use other encodings of sequence windows. While structure-based methods that exploit geometric features can yield significant increases in the Specificity of protein-RNA interface residue predictions, such increases are offset by decreases in Sensitivity. These results underscore the importance of comparing alternative methods using rigorous statistical procedures, multiple performance measures, and datasets that are constructed based on several alternative definitions of interface residues and redundancy cutoffs as well as including evaluations on independent test sets into the comparisons. 相似文献

9.

High performance set of PseAAC and sequence based descriptors for protein classification

Loris Nanni Sheryl Brahnam 《Journal of theoretical biology》2010,266(1):1-3036

相似文献

10.

Integrating new data balancing technique with committee networks for imbalanced data: GRSOM approach

Danaipong Chetchotsak Sirorat Pattanapairoj Banchar Arnonkijpanich 《Cognitive neurodynamics》2015,9(6):627-638

To deal with imbalanced data in a classification problem, this paper proposes a data balancing technique to be used in conjunction with a committee network. The proposed data balancing technique is based on the concept of the growing ring self-organizing map (GRSOM) which is an unsupervised learning algorithm. GRSOM balances the data through growing new data on a well-defined ring structure, which is iteratively developed based on the winning node nearby the samples. Accordingly, the new balanced data still preserve the topology of the original data. The performance of our proposed method is evaluated using four real data sets from the UCI Machine Learning Repository and the classification performance is measured using the fivefold cross validation method. Classifiers with most common data balancing techniques, namely the Minority Over-Sampling Technique (SMOTE) and the Random under-sampling Technique (RT), are used as the baseline methods in this study. The results reveal that a committee of classifiers constructed using GRSOM performs at least as well as the baseline methods. The results also suggest that classifiers constructed using neural networks with the backpropagation algorithm are more robust than those using the support vector machine. 相似文献

11.

PSP_MCSVM: brainstorming consensus prediction of protein secondary structures using two-stage multiclass support vector machines

Chatterjee P Basu S Kundu M Nasipuri M Plewczynski D 《Journal of molecular modeling》2011,17(9):2191-2201

Secondary structure prediction is a crucial task for understanding the variety of protein structures and performed biological functions. Prediction of secondary structures for new proteins using their amino acid sequences is of fundamental importance in bioinformatics. We propose a novel technique to predict protein secondary structures based on position-specific scoring matrices (PSSMs) and physico-chemical properties of amino acids. It is a two stage approach involving multiclass support vector machines (SVMs) as classifiers for three different structural conformations, viz., helix, sheet and coil. In the first stage, PSSMs obtained from PSI-BLAST and five specially selected physicochemical properties of amino acids are fed into SVMs as features for sequence-to-structure prediction. Confidence values for forming helix, sheet and coil that are obtained from the first stage SVM are then used in the second stage SVM for performing structure-to-structure prediction. The two-stage cascaded classifiers (PSP_MCSVM) are trained with proteins from RS126 dataset. The classifiers are finally tested on target proteins of critical assessment of protein structure prediction experiment-9 (CASP9). PSP_MCSVM with brainstorming consensus procedure performs better than the prediction servers like Predator, DSC, SIMPA96, for randomly selected proteins from CASP9 targets. The overall performance is found to be comparable with the current state-of-the art. PSP_MCSVM source code, train-test datasets and supplementary files are available freely in public domain at: and 相似文献

12.

Comparison of disease prevalence in two populations in the presence of misclassification

Man‐Lai Tang Shi‐Fang Qiu Wai‐Yin Poon 《Biometrical journal. Biometrische Zeitschrift》2012,54(6):786-807

Comparing disease prevalence in two groups is an important topic in medical research, and prevalence rates are obtained by classifying subjects according to whether they have the disease. Both high‐cost infallible gold‐standard classifiers or low‐cost fallible classifiers can be used to classify subjects. However, statistical analysis that is based on data sets with misclassifications leads to biased results. As a compromise between the two classification approaches, partially validated sets are often used in which all individuals are classified by fallible classifiers, and some of the individuals are validated by the accurate gold‐standard classifiers. In this article, we develop several reliable test procedures and approximate sample size formulas for disease prevalence studies based on the difference between two disease prevalence rates with two independent partially validated series. Empirical studies show that (i) the Score test produces close‐to‐nominal level and is preferred in practice; and (ii) the sample size formula based on the Score test is also fairly accurate in terms of the empirical power and type I error rate, and is hence recommended. A real example from an aplastic anemia study is used to illustrate the proposed methodologies. 相似文献

13.

Identification of Plasmodium vivax proteins with potential role in invasion using sequence redundancy reduction and profile hidden Markov models

Restrepo-Montoya D Becerra D Carvajal-Patiño JG Mongui A Niño LF Patarroyo ME Patarroyo MA 《PloS one》2011,6(10):e25189

相似文献

14.

A Bayesian method for comparing and combining binary classifiers in the absence of a gold standard

JM Keith CM Davey SE Boyd 《BMC bioinformatics》2012,13(1):179

ABSTRACT: BACKGROUND: Many problems in bioinformatics involve classification based on features such as sequence, structure or morphology. Given multiple classifiers, two crucial questions arise: how does their performance compare, and how can they best be combined to produce a better classifier? A classifier can be evaluated in terms of sensitivity and specificity using benchmark, or gold standard, data, that is, data for which the true classification is known. However, a gold standard is not always available. Here we demonstrate that a Bayesian model for comparing medical diagnostics without a gold standard can be successfully applied in the bioinformatics domain, to genomic scale data sets. We present a new implementation, which unlike previous implementations is applicable to any number of classifiers. We apply this model, for the first time, to the problem of finding the globally optimal logical combination of classifiers. RESULTS: We compared three classifiers of protein subcellular localisation, and evaluated our estimates of sensitivity and specificity against estimates obtained using a gold standard. The method overestimated sensitivity and specificity with only a small discrepancy, and correctly ranked the classifiers. Diagnostic tests for swine flu were then compared on a small data set. Lastly, classifiers for a genome-wide association study of macular degeneration with 541094 SNPs were analysed. In all cases, run times were feasible, and results precise. The optimal logical combination of classifiers was also determined for all three data sets. Code and data are available from http://bioinformatics.monash.edu.au/downloads/. CONCLUSIONS: The examples demonstrate the methods are suitable for both small and large data sets, applicable to the wide range of bioinformatics classification problems, and robust to dependence between classifiers. In all three test cases, the globally optimal logical combination of the classifiers was found to be their union, according to three out of four ranking criteria. We propose as a general rule of thumb that the union of classifiers will be close to optimal. 相似文献

15.

Malignancy associated changes in epithelial cells of buccal mucosa: a potential cancer detection test

Us-Krasovec M Erzen J Zganec M Strojan-Flezar M Lavrencak J Garner D Doudkine A Palcic B 《Analytical and quantitative cytology and histology / the International Academy of Cytology [and] American Society of Cytology》2005,27(5):254-262

OBJECTIVE: To analyze the presence of malignancy associated changes (MACs) in normal buccal mucosa cells of lung and breast cancer patients and their relationship to tumor subtype, stage and size. STUDY DESIGN: Buccal mucosa smears of 107 lung cancer and 100 breast cancer patients and corresponding healthy subjects were collected, stained by the DNA-specific Feulgen-thionin method and scanned using an automated high-resolution cytometer. Nuclear texture features of a minimum of 500 nuclei per slide were calculated, and statistical classifiers using Gaussian models of class-probability distribution were designed, trained and tested in 3 parts: (1) ability to separate cancer patient samples from controls, (2) cross-validation of classifiers for different cancer types, and (3) correlation of MAC expression with tumor subtype, stage and size. RESULTS: Lung and breast cancer induce MACs in normal buccal mucosa cells. The classifiers based on the selected nuclear features correctly recognized >80% of lung and breast cancer cases. The results indicate that MAC detection is not dependent on the tumor subtype, stage or size. CONCLUSION: The presence of MACs in buccal mucosa cells offers the potential for developing a new noninvasive cancer screening test. 相似文献

16.

Evaluation of different biological data and computational classification methods for use in protein interaction prediction 总被引：1，自引：0，他引：1

Qi Y Bar-Joseph Z Klein-Seetharaman J 《Proteins》2006,63(3):490-500

Protein–protein interactions play a key role in many biological systems. High‐throughput methods can directly detect the set of interacting proteins in yeast, but the results are often incomplete and exhibit high false‐positive and false‐negative rates. Recently, many different research groups independently suggested using supervised learning methods to integrate direct and indirect biological data sources for the protein interaction prediction task. However, the data sources, approaches, and implementations varied. Furthermore, the protein interaction prediction task itself can be subdivided into prediction of (1) physical interaction, (2) co‐complex relationship, and (3) pathway co‐membership. To investigate systematically the utility of different data sources and the way the data is encoded as features for predicting each of these types of protein interactions, we assembled a large set of biological features and varied their encoding for use in each of the three prediction tasks. Six different classifiers were used to assess the accuracy in predicting interactions, Random Forest (RF), RF similarity‐based k‐Nearest‐Neighbor, Naïve Bayes, Decision Tree, Logistic Regression, and Support Vector Machine. For all classifiers, the three prediction tasks had different success rates, and co‐complex prediction appears to be an easier task than the other two. Independently of prediction task, however, the RF classifier consistently ranked as one of the top two classifiers for all combinations of feature sets. Therefore, we used this classifier to study the importance of different biological datasets. First, we used the splitting function of the RF tree structure, the Gini index, to estimate feature importance. Second, we determined classification accuracy when only the top‐ranking features were used as an input in the classifier. We find that the importance of different features depends on the specific prediction task and the way they are encoded. Strikingly, gene expression is consistently the most important feature for all three prediction tasks, while the protein interactions identified using the yeast‐2‐hybrid system were not among the top‐ranking features under any condition. Proteins 2006. © 2006 Wiley‐Liss, Inc. 相似文献

17.

An eco-informatics tool for microbial community studies: supervised classification of Amplicon Length Heterogeneity (ALH) profiles of 16S rRNA

Yang C Mills D Mathee K Wang Y Jayachandran K Sikaroodi M Gillevet P Entry J Narasimhan G 《Journal of microbiological methods》2006,65(1):49-62

Support vector machines (SVM) and K-nearest neighbors (KNN) are two computational machine learning tools that perform supervised classification. This paper presents a novel application of such supervised analytical tools for microbial community profiling and to distinguish patterning among ecosystems. Amplicon length heterogeneity (ALH) profiles from several hypervariable regions of 16S rRNA gene of eubacterial communities from Idaho agricultural soil samples and from Chesapeake Bay marsh sediments were separately analyzed. The profiles from all available hypervariable regions were concatenated to obtain a combined profile, which was then provided to the SVM and KNN classifiers. Each profile was labeled with information about the location or time of its sampling. We hypothesized that after a learning phase using feature vectors from labeled ALH profiles, both these classifiers would have the capacity to predict the labels of previously unseen samples. The resulting classifiers were able to predict the labels of the Idaho soil samples with high accuracy. The classifiers were less accurate for the classification of the Chesapeake Bay sediments suggesting greater similarity within the Bay's microbial community patterns in the sampled sites. The profiles obtained from the V1+V2 region were more informative than that obtained from any other single region. However, combining them with profiles from the V1 region (with or without the profiles from the V3 region) resulted in the most accurate classification of the samples. The addition of profiles from the V 9 region appeared to confound the classifiers. Our results show that SVM and KNN classifiers can be effectively applied to distinguish between eubacterial community patterns from different ecosystems based only on their ALH profiles. 相似文献

18.

Comparison of Hybrid Classifiers for Crop Classification Using Normalized Difference Vegetation Index Time Series: A Case Study for Major Crops in North Xinjiang,China

Pengyu Hao Li Wang Zheng Niu 《PloS one》2015,10(9)

A range of single classifiers have been proposed to classify crop types using time series vegetation indices, and hybrid classifiers are used to improve discriminatory power. Traditional fusion rules use the product of multi-single classifiers, but that strategy cannot integrate the classification output of machine learning classifiers. In this research, the performance of two hybrid strategies, multiple voting (M-voting) and probabilistic fusion (P-fusion), for crop classification using NDVI time series were tested with different training sample sizes at both pixel and object levels, and two representative counties in north Xinjiang were selected as study area. The single classifiers employed in this research included Random Forest (RF), Support Vector Machine (SVM), and See 5 (C 5.0). The results indicated that classification performance improved (increased the mean overall accuracy by 5%~10%, and reduced standard deviation of overall accuracy by around 1%) substantially with the training sample number, and when the training sample size was small (50 or 100 training samples), hybrid classifiers substantially outperformed single classifiers with higher mean overall accuracy (1%~2%). However, when abundant training samples (4,000) were employed, single classifiers could achieve good classification accuracy, and all classifiers obtained similar performances. Additionally, although object-based classification did not improve accuracy, it resulted in greater visual appeal, especially in study areas with a heterogeneous cropping pattern. 相似文献

19.

Ensembles of neural networks based on the alteration of input feature values

Akhand MA Murase K 《International journal of neural systems》2012,22(1):77-87

An ensemble performs well when the component classifiers are diverse yet accurate, so that the failure of one is compensated for by others. A number of methods have been investigated for constructing ensemble in which some of them train classifiers with the generated patterns. This study investigates a new technique of training pattern generation. The method alters input feature values of some patterns using the values of other patterns to generate different patterns for different classifiers. The effectiveness of neural network ensemble based on the proposed technique was evaluated using a suite of 25 benchmark classification problems, and was found to achieve performance better than or competitive with related conventional methods. Experimental investigation of different input values alteration techniques finds that alteration with pattern values in the same class is better for generalization, although other alteration techniques may offer more diversity. 相似文献

20.

Eigenspace Time Frequency Based Features for Accurate Seizure Detection from EEG Data

M. Deriche S. Arafat S. Al-Insaif M. Siddiqui 《IRBM》2019,40(2):122-132

Background

Epilepsy is a neurological disorder that affects over 2% of the world population. Epilepsy patients suffer from recurring seizures that can be very harmful. The unpredictability of seizures is a major concern for medical practitioners because uncontrollable seizures can lead to sudden death and morbidity. A system that could warn patients and doctors alike about the impending seizure event would dramatically enhance the quality of life for patients.

Methods

While most previous research works focused on using signal processing tools appropriate for stationary signals, we propose here to use time and frequency (TF) analysis to extract features capable of discriminating normal from abnormal EEG traces (both ictal and interictal). The features are extracted using Singular Value Decomposition (SVD) of the EEG signal Time Frequency matrix. The left singular vectors of the time frequency matrix are used to obtain robust feature vectors. In contrast to existing techniques, the proposed TF-based technique can be used to detect the specific moments of seizure occurrences in time so that this information is used to discriminate interictal from ictal EEG traces. Instead of extracting the features directly from the TF matrix, we transform the left eigenvectors obtained from the SVD of the TF matrix into a feature vector that behaves like to a probability density function.

Results

We show that almost all classical classification techniques achieve excellent seizure detection results when used with the proposed TF features, irrespective of the classifier used. Contrary to existing works, we test our approach across several real-life scenarios covering 2, 3, and 5 possible classes of data. Our tests provided consistent results across different scenarios. The results, under different scenarios, outperformed existing ones achieving consistently more than 97.3% and up to 99.5% in terms of accuracy, sensitivity, and specificity.

Conclusion

Experimental results show that the novel features have successfully represented the characteristics of the underlying disease phenomenon from EEG data. Also, we conclude that learning based classifiers are better suited for this application, compared to Bayesian classifiers that have difficulty in adapting to the varying nature of the features' probability distribution function. 相似文献