期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An eco-informatics tool for microbial community studies: supervised classification of Amplicon Length Heterogeneity (ALH) profiles of 16S rRNA

Yang C Mills D Mathee K Wang Y Jayachandran K Sikaroodi M Gillevet P Entry J Narasimhan G 《Journal of microbiological methods》2006,65(1):49-62

Support vector machines (SVM) and K-nearest neighbors (KNN) are two computational machine learning tools that perform supervised classification. This paper presents a novel application of such supervised analytical tools for microbial community profiling and to distinguish patterning among ecosystems. Amplicon length heterogeneity (ALH) profiles from several hypervariable regions of 16S rRNA gene of eubacterial communities from Idaho agricultural soil samples and from Chesapeake Bay marsh sediments were separately analyzed. The profiles from all available hypervariable regions were concatenated to obtain a combined profile, which was then provided to the SVM and KNN classifiers. Each profile was labeled with information about the location or time of its sampling. We hypothesized that after a learning phase using feature vectors from labeled ALH profiles, both these classifiers would have the capacity to predict the labels of previously unseen samples. The resulting classifiers were able to predict the labels of the Idaho soil samples with high accuracy. The classifiers were less accurate for the classification of the Chesapeake Bay sediments suggesting greater similarity within the Bay's microbial community patterns in the sampled sites. The profiles obtained from the V1+V2 region were more informative than that obtained from any other single region. However, combining them with profiles from the V1 region (with or without the profiles from the V3 region) resulted in the most accurate classification of the samples. The addition of profiles from the V 9 region appeared to confound the classifiers. Our results show that SVM and KNN classifiers can be effectively applied to distinguish between eubacterial community patterns from different ecosystems based only on their ALH profiles. 相似文献

2.

Recognition of cognitive load with a stacking network ensemble of denoising autoencoders and abstracted neurophysiological features

Zixuan Cao Zhong Yin Jianhua Zhang 《Cognitive neurodynamics》2021,15(3):425

The safety of human–machine systems can be indirectly evaluated based on operator’s cognitive load levels at each temporal instant. However, relevant features of cognitive states are hidden behind in multiple sources of cortical neural responses. In this study, we developed a novel neural network ensemble, SE-SDAE, based on stacked denoising autoencoders (SDAEs) which identify different levels of cognitive load by electroencephalography (EEG) signals. To improve the generalization capability of the ensemble framework, a stacking-based approach is adopted to fuse the abstracted EEG features from activations of deep-structured hidden layers. In particular, we also combine multiple K-nearest neighbor and naive Bayesian classifiers with SDAEs to generate a heterogeneous classification committee to enhance ensemble’s diversity. Finally, we validate the proposed SE-SDAE by comparing its performance with mainstream pattern classifiers for cognitive load evaluation to show its effectiveness. 相似文献

3.

Aesthetic preference recognition of 3D shapes using EEG

Lin Hou Chew Jason Teo James Mountstephens 《Cognitive neurodynamics》2016,10(2):165-173

Recognition and identification of aesthetic preference is indispensable in industrial design. Humans tend to pursue products with aesthetic values and make buying decisions based on their aesthetic preferences. The existence of neuromarketing is to understand consumer responses toward marketing stimuli by using imaging techniques and recognition of physiological parameters. Numerous studies have been done to understand the relationship between human, art and aesthetics. In this paper, we present a novel preference-based measurement of user aesthetics using electroencephalogram (EEG) signals for virtual 3D shapes with motion. The 3D shapes are designed to appear like bracelets, which is generated by using the Gielis superformula. EEG signals were collected by using a medical grade device, the B-Alert X10 from advance brain monitoring, with a sampling frequency of 256 Hz and resolution of 16 bits. The signals obtained when viewing 3D bracelet shapes were decomposed into alpha, beta, theta, gamma and delta rhythm by using time–frequency analysis, then classified into two classes, namely like and dislike by using support vector machines and K-nearest neighbors (KNN) classifiers respectively. Classification accuracy of up to 80 % was obtained by using KNN with the alpha, theta and delta rhythms as the features extracted from frontal channels, Fz, F3 and F4 to classify two classes, like and dislike. 相似文献

4.

Detection of Seizure Event and Its Onset/Offset Using Orthonormal Triadic Wavelet Based Features

G. Chandel P. Upadhyaya O. Farooq Y.U. Khan 《IRBM》2019,40(2):103-112

Background

Epileptic seizures are unpredictable in nature and its quick detection is important for immediate treatment of patients. In last few decades researchers have proposed different algorithms for onset and offset detection of seizure using Electroencephalogram (EEG) signals.

Methods

In this paper, a combined approach for onset and offset detection is proposed using Triadic wavelet decomposition based features. Standard deviation, variance and higher order moments, extracted as significant features to represent different EEG activities.Classification between seizure and non-seizure EEG was carried out using linear discriminant analysis (LDA) and k-nearest neighbour (KNN) classifiers. The method was tested using two benchmark EEG datasets in the field of seizure detection.CHBMIT EEG dataset was used for evaluating the performance of proposed seizure onset and offset detection method.Further for testing the robustness of the algorithm, the effect of the signal-to-noise ratio on the detection accuracy has been also investigated using Bonn University EEG dataset.

Results

The seizure onset and offset detection method yielded classification accuracy, specificity and sensitivity of 99.45%, 99.62% and 98.36% respectively with 6.3 s onset and ?1.17 s offset latency using KNN classifier.The seizure detection method using Bonn University EEG dataset got classification accuracy of 92% when SNR = 5 dB, 94% when SNR = 10 dB, and 96% when SNR = 20 dB, while it also yielded 96% accuracy for noiseless EEG.

Conclusion

The present study focuses on detection of seizure onset and offset rather than only seizure detection. The major contribution of this work is that the novel triadic wavelet transform based method is developed for the analysis of EEG signals. The results show improvement over other existing dyadic wavelet based Triadic techniques. 相似文献

5.

Prediction of protein subcellular multi-localization based on the general form of Chou's pseudo amino acid composition 总被引：1，自引：0，他引：1

Li LQ Zhang Y Zou LY Zhou Y Zheng XQ 《Protein and peptide letters》2012,19(4):375-387

Many proteins bear multi-locational characteristics, and this phenomenon is closely related to biological function. However, most of the existing methods can only deal with single-location proteins. Therefore, an automatic and reliable ensemble classifier for protein subcellular multi-localization is needed. We propose a new ensemble classifier combining the KNN (K-nearest neighbour) and SVM (support vector machine) algorithms to predict the subcellular localization of eukaryotic, Gram-negative bacterial and viral proteins based on the general form of Chou's pseudo amino acid composition, i.e., GO (gene ontology) annotations, dipeptide composition and AmPseAAC (Amphiphilic pseudo amino acid composition). This ensemble classifier was developed by fusing many basic individual classifiers through a voting system. The overall prediction accuracies obtained by the KNN-SVM ensemble classifier are 95.22, 93.47 and 80.72% for the eukaryotic, Gram-negative bacterial and viral proteins, respectively. Our prediction accuracies are significantly higher than those by previous methods and reveal that our strategy better predicts subcellular locations of multi-location proteins. 相似文献

6.

Epileptic Seizure Detection Based on New Hybrid Models with Electroencephalogram Signals

《IRBM》2020,41(6):331-353

Objectives: Epileptic seizures are one of the most common diseases in society and difficult to detect. In this study, a new method was proposed to automatically detect and classify epileptic seizures from EEG (Electroencephalography) signals.Methods: In the proposed method, EEG signals classification five-classes including the cases of eyes open, eyes closed, healthy, from the tumor region, an epileptic seizure, has been carried out by using the support vector machine (SVM) and the normalization methods comprising the z-score, minimum-maximum, and MAD normalizations. To classify the EEG signals, the support vector machine classifiers having different kernel functions, including Linear, Cubic, and Medium Gaussian, have been used. In order to evaluate the performance of the proposed hybrid models, the confusion matrix, ROC curves, and classification accuracy have been used. The used SVM models are Linear SVM, Cubic SVM, and Medium Gaussian SVM.Results: Without the normalizations, the obtained classification accuracies are 76.90%, 82.40%, and 81.70% using Linear SVM, Cubic SVM, and Medium Gaussian SVM, respectively. After applying the z-score normalization to the multi-class EEG signals dataset, the obtained classification accuracies are 77.10%, 82.30%, and 81.70% using Linear SVM, Cubic SVM, and Medium Gaussian SVM, respectively. With the minimum-maximum normalization, the obtained classification accuracies are 77.20%, 82.40%, and 81.50% using Linear SVM, Cubic SVM, and Medium Gaussian SVM, respectively. Moreover, finally, after applying the MAD normalization to the multi-class EEG signals dataset, the obtained classification accuracies are 76.70%, 82.50%, and 81.40% using Linear SVM, Cubic SVM, and Medium Gaussian SVM, respectively.Conclusion: The obtained results have shown that the best hybrid model is the combination of cubic SVM and MAD normalization in the classification of EEG signals classification five-classes. 相似文献

7.

基于脑电信号的癫痫发作预测特征及识别

单宝莲张力新徐舫舟许敏鹏于海情魏斯文明东《生物化学与生物物理进展》2023,50(2):322-333

解码癫痫发作前脑电信号的神经元集群异常痫样放电活动,对癫痫发作进行有效预测并实施病前干预,可显著减少疾病病损,是癫痫防治的研究热点之一。基于脑电信号的癫痫发作预测研究关键在于发作间期和前期的异常状态识别,研究上述两状态间的神经动力学特征差异对明确癫痫发病机制、选取高分辨特征,进而有效识别该渐进性疾病所处的发作阶段具有重要价值。目前,研究者已对当前主流特征提取及模式识别方法进行了充分的调研梳理,但忽视了神经动态特征变化对于癫痫发作预测的重要意义。基于此,本文归纳总结了5类典型的发作预测特征分析方法及其优缺点,重点剖析了发作间期至前期神经生理特征的动态变化及其动力学特性,类比分析了当前该领域主流的机器学习和深度学习特征识别方法,以期为进一步建立精准、高效的癫痫发作预测技术提供新思路。相似文献

8.

Argumentation Based Joint Learning: A Novel Ensemble Learning Approach

Junyi Xu Li Yao Le Li 《PloS one》2015,10(5)

Recently, ensemble learning methods have been widely used to improve classification performance in machine learning. In this paper, we present a novel ensemble learning method: argumentation based multi-agent joint learning (AMAJL), which integrates ideas from multi-agent argumentation, ensemble learning, and association rule mining. In AMAJL, argumentation technology is introduced as an ensemble strategy to integrate multiple base classifiers and generate a high performance ensemble classifier. We design an argumentation framework named Arena as a communication platform for knowledge integration. Through argumentation based joint learning, high quality individual knowledge can be extracted, and thus a refined global knowledge base can be generated and used independently for classification. We perform numerous experiments on multiple public datasets using AMAJL and other benchmark methods. The results demonstrate that our method can effectively extract high quality knowledge for ensemble classifier and improve the performance of classification. 相似文献

9.

Predicting the potential distribution of wheatear birds using stacked generalization-based ensembles

《Ecological Informatics》2023

Habitat suitability models, usually referred to as species distribution models (SDMs), are widely applied in ecology for many purposes, including species conservation, habitat discovery, and gain evolutionary insights by estimating the distribution of species. Machine learning algorithms as well as statistical models have been recently used to predict the distribution of species. However, they seemed to have some limitations due to the data and the models used. Therefore, this study proposes a novel approach for assessing habitat suitability based on ensemble learning techniques. Three heterogeneous ensembles were built using the stacked generalization method to model the distribution of four wheatear species (Oenanthe deserti, Oenanthe leucopyga, Oenanthe leucura, and Oenanthe oenanthe) located in Morocco. Initially, a set of base-learners were constructed by virtue of training for each specie's dataset six machine learning algorithms (Multi-Layer Perceptron (MLP), Support Vector Classifier (SVC), K-nearest neighbors (KNN), Decision Trees (DT), Gradient Boosting Classifier (GB), and Random Forest (RF)). Then, the predictions of these base learners were fed as training data to train three meta-learners (Logistic Regression (LR), SVC, and MLP). To evaluate and assess the performance of the proposed approaches, we used: (1) six performance criteria (accuracy, recall, precision, F1-score, AUC, and TSS), (2) Borda Count (BC) ranking method based on multiple criteria to rank the best-performing models, and (3) Scott Knott (SK) test to statistically compare the performance of the presented models. The results based on the six-evaluation metrics showed that stacked ensembles outperformed their singles in all species datasets, and the stacked model with SVC as a meta-learner outperformed the other two ensembles. The results showed the potential of using ensemble learning techniques to model species distribution and recommend the use of the stacked generalization technique as a combination strategy since it gave better results compared to single models in four wheatear species datasets. Moreover, to assess the impact of future climate changes on the distribution of the four wheatear species, the best-performing distribution model was selected and projected into the current and future climatic conditions. The distributions of the Moroccan wheatear birds were found to be slightly affected by future climate changes. 相似文献

10.

Multiway analysis of epilepsy tensors 总被引：1，自引：0，他引：1

Acar E Aykut-Bingol C Bingol H Bro R Yener B 《Bioinformatics (Oxford, England)》2007,23(13):i10-i18

MOTIVATION: The success or failure of an epilepsy surgery depends greatly on the localization of epileptic focus (origin of a seizure). We address the problem of identification of a seizure origin through an analysis of ictal electroencephalogram (EEG), which is proven to be an effective standard in epileptic focus localization. SUMMARY: With a goal of developing an automated and robust way of visual analysis of large amounts of EEG data, we propose a novel approach based on multiway models to study epilepsy seizure structure. Our contributions are 3-fold. First, we construct an Epilepsy Tensor with three modes, i.e. time samples, scales and electrodes, through wavelet analysis of multi-channel ictal EEG. Second, we demonstrate that multiway analysis techniques, in particular parallel factor analysis (PARAFAC), provide promising results in modeling the complex structure of an epilepsy seizure, localizing a seizure origin and extracting artifacts. Third, we introduce an approach for removing artifacts using multilinear subspace analysis and discuss its merits and drawbacks. RESULTS: Ictal EEG analysis of 10 seizures from 7 patients are included in this study. Our results for 8 seizures match with clinical observations in terms of seizure origin and extracted artifacts. On the other hand, for 2 of the seizures, seizure localization is not achieved using an initial trial of PARAFAC modeling. In these cases, first, we apply an artifact removal method and subsequently apply the PARAFAC model on the epilepsy tensor from which potential artifacts have been removed. This method successfully identifies the seizure origin in both cases. 相似文献

11.

Automated detection of driver fatigue based on EEG signals using gradient boosting decision tree model

Jianfeng Hu Jianliang Min 《Cognitive neurodynamics》2018,12(4):431-440

Driver fatigue is increasingly a contributing factor for traffic accidents, so an effective method to automatically detect driver fatigue is urgently needed. In this study, in order to catch the main characteristics of the EEG signals, four types of entropies (based on the EEG signal of a single channel) were calculated as the feature sets, including sample entropy, fuzzy entropy, approximate entropy and spectral entropy. All feature sets were used as the input of a gradient boosting decision tree (GBDT), a fast and highly accurate boosting ensemble method. The output of GBDT determined whether a driver was in a fatigue state or not based on their EEG signals. Three state-of-the-art classifiers, k-nearest neighbor, support vector machine and neural network were also employed. To assess our method, several experiments including parameter setting and classification performance comparison were performed on 22 subjects. The results indicated that it is possible to use only one EEG channel to detect a driver fatigue state. The average highest recognition rate in this work was up to 94.0%, which could meet the needs of daily applications. Our GBDT-based method may assist in the detection of driver fatigue. 相似文献

12.

Automatic detection of epileptic EEG signals using higher order cumulant features

Acharya UR Sree SV Suri JS 《International journal of neural systems》2011,21(5):403-414

The unpredictability of the occurrence of epileptic seizures makes it difficult to detect and treat this condition effectively. An automatic system that characterizes epileptic activities in EEG signals would allow patients or the people near them to take appropriate precautions, would allow clinicians to better manage the condition, and could provide more insight into these phenomena thereby revealing important clinical information. Various methods have been proposed to detect epileptic activity in EEG recordings. Because of the nonlinear and dynamic nature of EEG signals, the use of nonlinear Higher Order Spectra (HOS) features is a seemingly promising approach. This paper presents the methodology employed to extract HOS features (specifically, cumulants) from normal, interictal, and epileptic EEG segments and to use significant features in classifiers for the detection of these three classes. In this work, 300 sets of EEG data belonging to the three classes were used for feature extraction and classifier development and evaluation. The results show that the HOS based measures have unique ranges for the different classes with high confidence level (p-value < 0.0001). On evaluating several classifiers with the significant features, it was observed that the Support Vector Machine (SVM) presented a high detection accuracy of 98.5% thereby establishing the possibility of effective EEG segment classification using the proposed technique. 相似文献

13.

A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis 总被引：8，自引：0，他引：8

Statnikov A Aliferis CF Tsamardinos I Hardin D Levy S 《Bioinformatics (Oxford, England)》2005,21(5):631-643

MOTIVATION: Cancer diagnosis is one of the most important emerging clinical applications of gene expression microarray technology. We are seeking to develop a computer system for powerful and reliable cancer diagnostic model creation based on microarray data. To keep a realistic perspective on clinical applications we focus on multicategory diagnosis. To equip the system with the optimum combination of classifier, gene selection and cross-validation methods, we performed a systematic and comprehensive evaluation of several major algorithms for multicategory classification, several gene selection methods, multiple ensemble classifier methods and two cross-validation designs using 11 datasets spanning 74 diagnostic categories and 41 cancer types and 12 normal tissue types. RESULTS: Multicategory support vector machines (MC-SVMs) are the most effective classifiers in performing accurate cancer diagnosis from gene expression data. The MC-SVM techniques by Crammer and Singer, Weston and Watkins and one-versus-rest were found to be the best methods in this domain. MC-SVMs outperform other popular machine learning algorithms, such as k-nearest neighbors, backpropagation and probabilistic neural networks, often to a remarkable degree. Gene selection techniques can significantly improve the classification performance of both MC-SVMs and other non-SVM learning algorithms. Ensemble classifiers do not generally improve performance of the best non-ensemble models. These results guided the construction of a software system GEMS (Gene Expression Model Selector) that automates high-quality model construction and enforces sound optimization and performance estimation procedures. This is the first such system to be informed by a rigorous comparative analysis of the available algorithms and datasets. AVAILABILITY: The software system GEMS is available for download from http://www.gems-system.org for non-commercial use. CONTACT: alexander.statnikov@vanderbilt.edu. 相似文献

14.

Designing boosting ensemble of relational fuzzy systems

Scherer R 《International journal of neural systems》2010,20(5):381-388

A method frequently used in classification systems for improving classification accuracy is to combine outputs of several classifiers. Among various types of classifiers, fuzzy ones are tempting because of using intelligible fuzzy if-then rules. In the paper we build an AdaBoost ensemble of relational neuro-fuzzy classifiers. Relational fuzzy systems bond input and output fuzzy linguistic values by a binary relation; thus, fuzzy rules have additional, comparing to traditional fuzzy systems, weights - elements of a fuzzy relation matrix. Thanks to this the system is better adjustable to data during learning. In the paper an ensemble of relational fuzzy systems is proposed. The problem is that such an ensemble contains separate rule bases which cannot be directly merged. As systems are separate, we cannot treat fuzzy rules coming from different systems as rules from the same (single) system. In the paper, the problem is addressed by a novel design of fuzzy systems constituting the ensemble, resulting in normalization of individual rule bases during learning. The method described in the paper is tested on several known benchmarks and compared with other machine learning solutions from the literature. 相似文献

15.

Efficient Kernelized prototype based classification

Schleif FM Villmann T Hammer B Schneider P 《International journal of neural systems》2011,21(6):443-457

Prototype based classifiers are effective algorithms in modeling classification problems and have been applied in multiple domains. While many supervised learning algorithms have been successfully extended to kernels to improve the discrimination power by means of the kernel concept, prototype based classifiers are typically still used with Euclidean distance measures. Kernelized variants of prototype based classifiers are currently too complex to be applied for larger data sets. Here we propose an extension of Kernelized Generalized Learning Vector Quantization (KGLVQ) employing a sparsity and approximation technique to reduce the learning complexity. We provide generalization error bounds and experimental results on real world data, showing that the extended approach is comparable to SVM on different public data. 相似文献

16.

Spatio-temporal air quality analysis and PM2.5 prediction over Hyderabad City,India using artificial intelligence techniques

《Ecological Informatics》2023

Air pollution is one of the most serious environmental issues faced by humans, and it affects the quality of life in cities. PM_2.5 forecasting models can be used to create strategies for assessing and warning the public about anticipated harmful levels of air pollution. Accurate pollutant concentration measurements and forecasting are critical criteria for assessing air quality and are the foundation for making the right strategic decisions. Data-driven machine learning models for PM_2.5 forecasting have gained attention in the recent past. In this study, PM_2.5 prediction for Hyderabad city was carried out using various machine learning models viz. Multi-Linear Regression (MLR), decision tree (DT), K-Nearest Neighbors (KNN), Random Forest (RF), and XGBoost. A deep learning model, the Long Short-Term Memory (LSTM) model, was also used in this study. The results obtained were finally compared based on error and R² value. The best model was selected based on its maximum R² value and minimal error. The model's performance was further improved using the randomized search CV hyperparameter optimization technique. Spatio-temporal air quality analysis was initially conducted, and it was found that the average winter PM_2.5 concentrations were 68% higher than the concentrations in summer. The analysis revealed that XGBoost regression was the best-performing machine learning model with an R² value of 0.82 and a Mean Absolute Error (MAE) of 7.01 μg/ m³, whereas the LSTM deep learning model performed better than XGBoost regression for PM_2.5 modeling with an R² value of 0.89 and an MAE of 5.78 μg/ m³. 相似文献

17.

Mito-GSAAC: mitochondria prediction using genetic ensemble classifier and split amino acid composition

Afridi TH Khan A Lee YS 《Amino acids》2012,42(4):1443-1454

Mitochondria are all-important organelles of eukaryotic cells since they are involved in processes associated with cellular mortality and human diseases. Therefore, trustworthy techniques are highly required for the identification of new mitochondrial proteins. We propose Mito-GSAAC system for prediction of mitochondrial proteins. The aim of this work is to investigate an effective feature extraction strategy and to develop an ensemble approach that can better exploit the advantages of this feature extraction strategy for mitochondria classification. We investigate four kinds of protein representations for prediction of mitochondrial proteins: amino acid composition, dipeptide composition, pseudo amino acid composition, and split amino acid composition (SAAC). Individual classifiers such as support vector machine (SVM), k-nearest neighbor, multilayer perceptron, random forest, AdaBoost, and bagging are first trained. An ensemble classifier is then built using genetic programming (GP) for evolving a complex but effective decision space from the individual decision spaces of the trained classifiers. The highest prediction performance for Jackknife test is 92.62% using GP-based ensemble classifier on SAAC features, which is the highest accuracy, reported so far on the Mitochondria dataset being used. While on the Malaria Parasite Mitochondria dataset, the highest accuracy is obtained by SVM using SAAC and it is further enhanced to 93.21% using GP-based ensemble. It is observed that SAAC has better discrimination power for mitochondria prediction over the rest of the feature extraction strategies. Thus, the improved prediction performance is largely due to the better capability of SAAC for discriminating between mitochondria and non-mitochondria proteins at the N and C terminus and the effective combination capability of GP. Mito-GSAAC can be accessed at . It is expected that the novel approach and the accompanied predictor will have a major impact to Molecular Cell Biology, Proteomics, Bioinformatics, System Biology, and Drug Development. 相似文献

18.

IDM-PhyChm-Ens: Intelligent decision-making ensemble methodology for classification of human breast cancer using physicochemical properties of amino acids

Safdar Ali Abdul Majid Asifullah Khan 《Amino acids》2014,46(4):977-993

相似文献

19.

An ensemble of K-local hyperplanes for predicting protein-protein interactions

Nanni L Lumini A 《Bioinformatics (Oxford, England)》2006,22(10):1207-1210

Prediction of protein-protein interaction is a difficult and important problem in biology. In this paper, we propose a new method based on an ensemble of K-local hyperplane distance nearest neighbor (HKNN) classifiers, where each HKNN is trained using a different physicochemical property of the amino acids. Moreover, we propose a new encoding technique that combines the amino acid indices together with the 2-Grams amino acid composition. A fusion of HKNN classifiers combined with the 'Sum rule' enables us to obtain an improvement over other state-of-the-art methods. The approach is demonstrated by building a learning system based on experimentally validated protein-protein interactions in human gastric bacterium Helicobacter pylori and in Human dataset. 相似文献

20.

Ensemble Positive Unlabeled Learning for Disease Gene Identification

Peng Yang Xiaoli Li Hon-Nian Chua Chee-Keong Kwoh See-Kiong Ng 《PloS one》2014,9(5)

An increasing number of genes have been experimentally confirmed in recent years as causative genes to various human diseases. The newly available knowledge can be exploited by machine learning methods to discover additional unknown genes that are likely to be associated with diseases. In particular, positive unlabeled learning (PU learning) methods, which require only a positive training set P (confirmed disease genes) and an unlabeled set U (the unknown candidate genes) instead of a negative training set N, have been shown to be effective in uncovering new disease genes in the current scenario. Using only a single source of data for prediction can be susceptible to bias due to incompleteness and noise in the genomic data and a single machine learning predictor prone to bias caused by inherent limitations of individual methods. In this paper, we propose an effective PU learning framework that integrates multiple biological data sources and an ensemble of powerful machine learning classifiers for disease gene identification. Our proposed method integrates data from multiple biological sources for training PU learning classifiers. A novel ensemble-based PU learning method EPU is then used to integrate multiple PU learning classifiers to achieve accurate and robust disease gene predictions. Our evaluation experiments across six disease groups showed that EPU achieved significantly better results compared with various state-of-the-art prediction methods as well as ensemble learning classifiers. Through integrating multiple biological data sources for training and the outputs of an ensemble of PU learning classifiers for prediction, we are able to minimize the potential bias and errors in individual data sources and machine learning algorithms to achieve more accurate and robust disease gene predictions. In the future, our EPU method provides an effective framework to integrate the additional biological and computational resources for better disease gene predictions. 相似文献