首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
《Genomics》2020,112(5):3089-3096
Automatic classification of glaucoma from fundus images is a vital diagnostic tool for Computer-Aided Diagnosis System (CAD). In this work, a novel fused feature extraction technique and ensemble classifier fusion is proposed for diagnosis of glaucoma. The proposed method comprises of three stages. Initially, the fundus images are subjected to preprocessing followed by feature extraction and feature fusion by Intra-Class and Extra-Class Discriminative Correlation Analysis (IEDCA). The feature fusion approach eliminates between-class correlation while retaining sufficient Feature Dimension (FD) for Correlation Analysis (CA). The fused features are then fed to the classifiers namely Support Vector Machine (SVM), Random Forest (RF) and K-Nearest Neighbor (KNN) for classification individually. Finally, Classifier fusion is also designed which combines the decision of the ensemble of classifiers based on Consensus-based Combining Method (CCM). CCM based Classifier fusion adjusts the weights iteratively after comparing the outputs of all the classifiers. The proposed fusion classifier provides a better improvement in accuracy and convergence when compared to the individual algorithms. A classification accuracy of 99.2% is accomplished by the two-level hybrid fusion approach. The method is evaluated on the public datasets High Resolution Fundus (HRF) and DRIVE datasets with cross dataset validation.  相似文献   

3.
A real-time plant species recognition under an unconstrained environment is a challenging and time-consuming process. The recognition model should cope up with the computer vision challenges such as scale variations, illumination changes, camera viewpoint or object orientation changes, cluttered backgrounds and structure of leaf (simple or compound). In this paper, a bilateral convolutional neural network (CNN) with machine learning classifiers are investigated in relation to the real-time implementation of plant species recognition. The CNN models considered are MobileNet, Xception and DenseNet-121. In the bilateral CNNs (Homogeneous/Heterogeneous type), the models are connected using the cascade early fusion strategy. The Bilateral CNN is used in the process of feature extraction. Then, the extracted features are classified using different machine learning classifiers such as Linear Discriminant Analysis (LDA), multinomial Logistic Regression (MLR), Naïve Bayes (NB), k-Nearest Neighbor (k−NN), Classification and Regression Tree (CART), Random Forest Classifier (RF), Bagging Classifier (BC), Multi-Layer Perceptron (MLP) and Support Vector Machine (SVM). From the experimental investigation, it is observed that the multinomial Logistic Regression classifier performed better compared to other classifiers, irrespective of the bilateral CNN models (Homogeneous - MoMoNet, XXNet, DeDeNet; Heterogeneous - MoXNet, XDeNet, MoDeNet). It is also observed that the MoDeNet + MLR model attained the state-of-the-art results (Flavia: 98.71%, Folio: 96.38%, Swedish Leaf: 99.41%, custom created Leaf-12: 99.39%), irrespective of the dataset. The number of misprediction/class is highly reduced by utilizing the MoDeNet + MLR model for real-time plant species recognition.  相似文献   

4.
BackgroundPrevious epidemiological studies have examined the prevalence and risk factors for a variety of parasitic illnesses, including protozoan and soil-transmitted helminth (STH, e.g., hookworms and roundworms) infections. Despite advancements in machine learning for data analysis, the majority of these studies use traditional logistic regression to identify significant risk factors.MethodsIn this study, we used data from a survey of 54 risk factors for intestinal parasitosis in 954 Ethiopian school children. We investigated whether machine learning approaches can supplement traditional logistic regression in identifying intestinal parasite infection risk factors. We used feature selection methods such as InfoGain (IG), ReliefF (ReF), Joint Mutual Information (JMI), and Minimum Redundancy Maximum Relevance (MRMR). Additionally, we predicted children’s parasitic infection status using classifiers such as Logistic Regression (LR), Support Vector Machines (SVM), Random Forests (RF) and XGBoost (XGB), and compared their accuracy and area under the receiver operating characteristic curve (AUROC) scores. For optimal model training, we performed tenfold cross-validation and tuned the classifier hyperparameters. We balanced our dataset using the Synthetic Minority Oversampling (SMOTE) method. Additionally, we used association rule learning to establish a link between risk factors and parasitic infections.Key findingsOur study demonstrated that machine learning could be used in conjunction with logistic regression. Using machine learning, we developed models that accurately predicted four parasitic infections: any parasitic infection at 79.9% accuracy, helminth infection at 84.9%, any STH infection at 95.9%, and protozoan infection at 94.2%. The Random Forests (RF) and Support Vector Machines (SVM) classifiers achieved the highest accuracy when top 20 risk factors were considered using Joint Mutual Information (JMI) or all features were used. The best predictors of infection were socioeconomic, demographic, and hematological characteristics.ConclusionsWe demonstrated that feature selection and association rule learning are useful strategies for detecting risk factors for parasite infection. Additionally, we showed that advanced classifiers might be utilized to predict children’s parasitic infection status. When combined with standard logistic regression models, machine learning techniques can identify novel risk factors and predict infection risk.  相似文献   

5.
Li Y  Wang N  Perkins EJ  Zhang C  Gong P 《PloS one》2010,5(10):e13715
Monitoring, assessment and prediction of environmental risks that chemicals pose demand rapid and accurate diagnostic assays. A variety of toxicological effects have been associated with explosive compounds TNT and RDX. One important goal of microarray experiments is to discover novel biomarkers for toxicity evaluation. We have developed an earthworm microarray containing 15,208 unique oligo probes and have used it to profile gene expression in 248 earthworms exposed to TNT, RDX or neither. We assembled a new machine learning pipeline consisting of several well-established feature filtering/selection and classification techniques to analyze the 248-array dataset in order to construct classifier models that can separate earthworm samples into three groups: control, TNT-treated, and RDX-treated. First, a total of 869 genes differentially expressed in response to TNT or RDX exposure were identified using a univariate statistical algorithm of class comparison. Then, decision tree-based algorithms were applied to select a subset of 354 classifier genes, which were ranked by their overall weight of significance. A multiclass support vector machine (MC-SVM) method and an unsupervised K-mean clustering method were applied to independently refine the classifier, producing a smaller subset of 39 and 30 classifier genes, separately, with 11 common genes being potential biomarkers. The combined 58 genes were considered the refined subset and used to build MC-SVM and clustering models with classification accuracy of 83.5% and 56.9%, respectively. This study demonstrates that the machine learning approach can be used to identify and optimize a small subset of classifier/biomarker genes from high dimensional datasets and generate classification models of acceptable precision for multiple classes.  相似文献   

6.
BQ Li  KY Feng  L Chen  T Huang  YD Cai 《PloS one》2012,7(8):e43927
Prediction of protein-protein interaction (PPI) sites is one of the most challenging problems in computational biology. Although great progress has been made by employing various machine learning approaches with numerous characteristic features, the problem is still far from being solved. In this study, we developed a novel predictor based on Random Forest (RF) algorithm with the Minimum Redundancy Maximal Relevance (mRMR) method followed by incremental feature selection (IFS). We incorporated features of physicochemical/biochemical properties, sequence conservation, residual disorder, secondary structure and solvent accessibility. We also included five 3D structural features to predict protein-protein interaction sites and achieved an overall accuracy of 0.672997 and MCC of 0.347977. Feature analysis showed that 3D structural features such as Depth Index (DPX) and surface curvature (SC) contributed most to the prediction of protein-protein interaction sites. It was also shown via site-specific feature analysis that the features of individual residues from PPI sites contribute most to the determination of protein-protein interaction sites. It is anticipated that our prediction method will become a useful tool for identifying PPI sites, and that the feature analysis described in this paper will provide useful insights into the mechanisms of interaction.  相似文献   

7.
《IRBM》2020,41(4):229-239
Feature selection algorithms are the cornerstone of machine learning. By increasing the properties of the samples and samples, the feature selection algorithm selects the significant features. The general name of the methods that perform this function is the feature selection algorithm. The general purpose of feature selection algorithms is to select the most relevant properties of data classes and to increase the classification performance. Thus, we can select features based on their classification performance. In this study, we have developed a feature selection algorithm based on decision support vectors classification performance. The method can work according to two different selection criteria. We tested the classification performances of the features selected with P-Score with three different classifiers. Besides, we assessed P-Score performance with 13 feature selection algorithms in the literature. According to the results of the study, the P-Score feature selection algorithm has been determined as a method which can be used in the field of machine learning.  相似文献   

8.
Protein–protein interactions play a key role in many biological systems. High‐throughput methods can directly detect the set of interacting proteins in yeast, but the results are often incomplete and exhibit high false‐positive and false‐negative rates. Recently, many different research groups independently suggested using supervised learning methods to integrate direct and indirect biological data sources for the protein interaction prediction task. However, the data sources, approaches, and implementations varied. Furthermore, the protein interaction prediction task itself can be subdivided into prediction of (1) physical interaction, (2) co‐complex relationship, and (3) pathway co‐membership. To investigate systematically the utility of different data sources and the way the data is encoded as features for predicting each of these types of protein interactions, we assembled a large set of biological features and varied their encoding for use in each of the three prediction tasks. Six different classifiers were used to assess the accuracy in predicting interactions, Random Forest (RF), RF similarity‐based k‐Nearest‐Neighbor, Naïve Bayes, Decision Tree, Logistic Regression, and Support Vector Machine. For all classifiers, the three prediction tasks had different success rates, and co‐complex prediction appears to be an easier task than the other two. Independently of prediction task, however, the RF classifier consistently ranked as one of the top two classifiers for all combinations of feature sets. Therefore, we used this classifier to study the importance of different biological datasets. First, we used the splitting function of the RF tree structure, the Gini index, to estimate feature importance. Second, we determined classification accuracy when only the top‐ranking features were used as an input in the classifier. We find that the importance of different features depends on the specific prediction task and the way they are encoded. Strikingly, gene expression is consistently the most important feature for all three prediction tasks, while the protein interactions identified using the yeast‐2‐hybrid system were not among the top‐ranking features under any condition. Proteins 2006. © 2006 Wiley‐Liss, Inc.  相似文献   

9.

Background

The goal of this work is to develop a non-invasive method in order to help detecting Alzheimer's disease in its early stages, by implementing voice analysis techniques based on machine learning algorithms.

Methods

We extract temporal and acoustical voice features (e.g. Jitter and Harmonics-to-Noise Ratio) from read speech of patients in Early Stage of Alzheimer's Disease (ES-AD), with Mild Cognitive Impairment (MCI), and from a Healthy Control (HC) group. Three classification methods are used to evaluate the efficiency of these features, namely kNN, SVM and decision Tree. To assess the effectiveness of this set of features, we compare them with two sets of feature parameters that are widely used in speech and speaker recognition applications. A two-stage feature selection process is conducted to optimize classification performance. For these experiments, the data samples of HC, ES-AD and MCI groups were collected at AP-HP Broca Hospital, in Paris.

Results

First, a wrapper feature selection method for each feature set is evaluated and the relevant features for each classifier are selected. By combining, for each classifier, the features selected from each initial set, we improve the classification accuracy by a relative gain of more than 30% for all classifiers. Then the same feature selection procedure is performed anew on the combination of selected feature sets, resulting in an additional significant improvement of classification accuracy.

Conclusion

The proposed method improved the classification accuracy for ES-AD, MCI and HC groups and promises the effectiveness of speech analysis and machine learning techniques to help detect pathological diseases.  相似文献   

10.
刘鲁霞  庞勇  桑国庆  李增元  胡波 《生态学报》2022,42(20):8398-8413
季风常绿阔叶林是我国南亚热带典型的地带性植被,也是云南省普洱地区重要森林类型。季风常绿阔叶林乔木物种多样性遥感估测对研究区域尺度生物多样性格局及其规律具有重要作用。根据光谱异质性假说和环境异质性假说,首先使用1m空间分辨率的机载高光谱数据和激光雷达数据提取了光谱多样性特征和垂直结构特征。然后利用基于随机森林算法的递归特征消除方法选择对研究区森林乔木物种多样性指数具有较好解释能力的遥感特征,并对Shannon-Winner物种多样性指数进行建模、制图。研究结果表明:(1)基于机载LiDAR数据提取的垂直结构特征和机载高光谱数据提取的光谱多样性特征均对研究区森林乔木物种多样性具有较好的解释能力,随机森林模型估测结果分别为R2=0.48,RMSE=0.46和R2=0.5,RMSE=0.45;两种数据源融合可以进一步提高遥感数据的森林乔木物种多样性估测精度,随机森林估测模型R2和RMSE分别为0.69和0.37。(2)机载激光雷达数据对研究区针阔混交林乔木物种多样性的估测能力优于机载高光谱数据。(3)机器学习方法有助于从高维遥感...  相似文献   

11.
目的:探究将统计学习方法应用于心理测验所得的大量数据进行学习分析的可行性,并基于探究结果对飞行职业的人格特征进行进一步探索,为飞行人员的选拔及评估提供新的思路。方法:从某航空公司随机抽取1020名男性被试,其中飞行人员510名,非飞行人员510名,采用卡特尔16项人格测试对其进行测验,施测后对得到的16项因子分采用支持向量机就随机划分的训练组和测试组进行学习,分析学习结果。结果:挑选出4项因子作为分类的特征因子,基于线性支持向量机构建的分类器在交叉验证下的平均正确率为64%。结论:采用SVM构建的分类器具有一定的可靠性和有效性。  相似文献   

12.
Random forests for genomic data analysis   总被引:1,自引:0,他引:1  
Chen X  Ishwaran H 《Genomics》2012,99(6):323-329
Random forests (RF) is a popular tree-based ensemble machine learning tool that is highly data adaptive, applies to "large p, small n" problems, and is able to account for correlation as well as interactions among features. This makes RF particularly appealing for high-dimensional genomic data analysis. In this article, we systematically review the applications and recent progresses of RF for genomic data, including prediction and classification, variable selection, pathway analysis, genetic association and epistasis detection, and unsupervised learning.  相似文献   

13.

Background

Pertussis is highly contagious; thus, prompt identification of cases is essential to control outbreaks. Clinicians experienced with the disease can easily identify classic cases, where patients have bursts of rapid coughing followed by gasps, and a characteristic whooping sound. However, many clinicians have never seen a case, and thus may miss initial cases during an outbreak. The purpose of this project was to use voice-recognition software to distinguish pertussis coughs from croup and other coughs.

Methods

We collected a series of recordings representing pertussis, croup and miscellaneous coughing by children. We manually categorized coughs as either pertussis or non-pertussis, and extracted features for each category. We used Mel-frequency cepstral coefficients (MFCC), a sampling rate of 16 KHz, a frame Duration of 25 msec, and a frame rate of 10 msec. The coughs were filtered. Each cough was divided into 3 sections of proportion 3-4-3. The average of the 13 MFCCs for each section was computed and made into a 39-element feature vector used for the classification. We used the following machine learning algorithms: Neural Networks, K-Nearest Neighbor (KNN), and a 200 tree Random Forest (RF). Data were reserved for cross-validation of the KNN and RF. The Neural Network was trained 100 times, and the averaged results are presented.

Results

After categorization, we had 16 examples of non-pertussis coughs and 31 examples of pertussis coughs. Over 90% of all pertussis coughs were properly classified as pertussis. The error rates were: Type I errors of 7%, 12%, and 25% and Type II errors of 8%, 0%, and 0%, using the Neural Network, Random Forest, and KNN, respectively.

Conclusion

Our results suggest that we can build a robust classifier to assist clinicians and the public to help identify pertussis cases in children presenting with typical symptoms.  相似文献   

14.
15.
Information on plant species is fundamental to forest ecosystems, in the context of biodiversity monitoring and forest management. Traditional methods for plant species inventories are generally inefficient, in terms of cost and performance, and there is a high demand for a quick and feasible approach to be developed. Of the various attempts, remote sensing has emerged as an active approach for plant species classification, but most studies have concentrated on image processing and only a few of them ever use hyperspectral information, despite the wealth of information it contains. In this study, plant species are classified from hyperspectral leaf information using different machine learning models, coupled with feature reduction and selection methods, and their performance is optimized through Bayesian optimization. The results show that including feature selection and Bayesian optimization increases the classification accuracy of machine learning models. Among these, the Bayesian optimization-based support vector machine (SVM) model, combined with the recursive feature elimination (RFE) feature selection method, yields the best output, with an overall accuracy of 86% and a kappa coefficient of 0.85. Furthermore, the confusion matrix revealed that the number of samples correlates with classification accuracy. The support vector machine with informative bands after Bayesian optimization outperformed in classing plant species. The results of this study facilitate a better understanding of spectral (phenotype) information with plant species (genotype) and help to bridge hyperspectral information with ecosystem functions.  相似文献   

16.
Microarray data has a high dimension of variables but available datasets usually have only a small number of samples, thereby making the study of such datasets interesting and challenging. In the task of analyzing microarray data for the purpose of, e.g., predicting gene-disease association, feature selection is very important because it provides a way to handle the high dimensionality by exploiting information redundancy induced by associations among genetic markers. Judicious feature selection in microarray data analysis can result in significant reduction of cost while maintaining or improving the classification or prediction accuracy of learning machines that are employed to sort out the datasets. In this paper, we propose a gene selection method called Recursive Feature Addition (RFA), which combines supervised learning and statistical similarity measures. We compare our method with the following gene selection methods:
  • Support Vector Machine Recursive Feature Elimination (SVMRFE)
  • Leave-One-Out Calculation Sequential Forward Selection (LOOCSFS)
  • Gradient based Leave-one-out Gene Selection (GLGS)
To evaluate the performance of these gene selection methods, we employ several popular learning classifiers on the MicroArray Quality Control phase II on predictive modeling (MAQC-II) breast cancer dataset and the MAQC-II multiple myeloma dataset. Experimental results show that gene selection is strictly paired with learning classifier. Overall, our approach outperforms other compared methods. The biological functional analysis based on the MAQC-II breast cancer dataset convinced us to apply our method for phenotype prediction. Additionally, learning classifiers also play important roles in the classification of microarray data and our experimental results indicate that the Nearest Mean Scale Classifier (NMSC) is a good choice due to its prediction reliability and its stability across the three performance measurements: Testing accuracy, MCC values, and AUC errors.  相似文献   

17.
Epi-macrobenthic species richness, abundance and composition are linked with type, assemblage and structural complexity of seabed habitat within coastal ecosystems. However, the evaluation of these habitats is highly hindered by limitations related to both waterborne surveys (slow acquisition, shallow water and low reactivity) and water clarity (turbid for most coastal areas). Substratum type/diversity and bathymetric features were elucidated using a supervised method applied to airborne bathymetric LiDAR waveforms over Saint-Siméon–Bonaventure''s nearshore area (Gulf of Saint-Lawrence, Québec, Canada). High-resolution underwater photographs were taken at three hundred stations across an 8-km2 study area. Seven models based upon state-of-the-art machine learning techniques such as Naïve Bayes, Regression Tree, Classification Tree, C 4.5, Random Forest, Support Vector Machine, and CN2 learners were tested for predicting eight epi-macrobenthic species diversity metrics as a function of the class number. The Random Forest outperformed other models with a three-discretized Simpson index applied to epi-macrobenthic communities, explaining 69% (Classification Accuracy) of its variability by mean bathymetry, time range and skewness derived from the LiDAR waveform. Corroborating marine ecological theory, areas with low Simpson epi-macrobenthic diversity responded to low water depths, high skewness and time range, whereas higher Simpson diversity relied upon deeper bottoms (correlated with stronger hydrodynamics) and low skewness and time range. The degree of species heterogeneity was therefore positively linked with the degree of the structural complexity of the benthic cover. This work underpins that fully exploited bathymetric LiDAR (not only bathymetrically derived by-products), coupled with proficient machine learner, is able to rapidly predict habitat characteristics at a spatial resolution relevant to epi-macrobenthos diversity, ranging from clear to turbid waters. This method might serve both to nurture marine ecological theory and to manage areas with high species heterogeneity where navigation is hazardous and water clarity opaque to passive optical sensors.  相似文献   

18.
《IRBM》2021,42(6):466-473
ObjectiveIn the last few decades, the consumption of cannabis-based products for recreational purposes has dramatically increased. Unfortunately, cannabis consumption has been associated with the incidences of cardiovascular diseases. Hence, there is a necessity for understanding the plausible mechanics of cardiophysiological changes due to cannabis consumption. Accordingly, the current study was designed to understand the suitability of the recurrence quantification analysis (RQA) method in detecting the changes in the heart rate variability (HRV) time-series signals due to the consumption of cannabis (bhang). Further, a machine learning model has been proposed for the automated detection of the cannabis takers.Materials and MethodsThe RQA of the HRV time-series signals from 200 healthy Indian male paddy-field workers were carried out. The obtained parameters were statistically analyzed using the Mann-Whitney U test. Further, the decision trees, weight-based feature ranking, and dimensionality reduction methods were employed for identifying the relevant features for the development of a suitable machine learning model.ResultsObservable changes in the patterns of the recurrence plots among the bhang consuming and non-consuming groups were noticed. However, there were no significant differences in the RQA parameters. Among the developed machine learning models, the SVM model obtained from the “Information gain ratio” feature selection method exhibited the highest accuracy (%) of 69.09 ± 9.33.ConclusionOur study suggests that the RQA method is not as effective as the time and frequency domain methods for detecting the alterations in the HRV time-series signals due to cannabis consumption. The SVM model was found to be the best model for the automated detection of cannabis takers. The selection of the features by the information gain ratio method played an important role in the development of the optimized SVM model.  相似文献   

19.
It is important to understand the cause of amyloid illnesses by predicting the short protein fragments capable of forming amyloid-like fibril motifs aiding in the discovery of sequence-targeted anti-aggregation drugs. It is extremely desirable to design computational tools to provide affordable in silico predictions owing to the limitations of molecular techniques for their identification. In this research article, we tried to study, from a machine learning perspective, the performance of several machine learning classifiers that use heterogenous features based on biochemical and biophysical properties of amino acids to discriminate between amyloidogenic and non-amyloidogenic regions in peptides. Four conventional machine learning classifiers namely Support Vector Machine, Neural network, Decision tree and Random forest were trained and tested to find the best classifier that fits the problem domain well. Prior to classification, novel implementations of two biologically-inspired feature optimization techniques based on evolutionary algorithms and methodologies that mimic social life and a multivariate method based on projection are utilized in order to remove the unimportant and uninformative features. Among the dimenionality reduction algorithms considered under the study, prediction results show that algorithms based on evolutionary computation is the most effective. SVM best suits the problem domain in its fitment among the classifiers considered. The best classifier is also compared with an online predictor to evidence the equilibrium maintained between true positive rates and false positive rates in the proposed classifier. This exploratory study suggests that these methods are promising in providing amyloidogenity prediction and may be further extended for large-scale proteomic studies.  相似文献   

20.

Background

With an increasing number of plant genome sequences, it has become important to develop a robust computational method for detecting plant promoters. Although a wide variety of programs are currently available, prediction accuracy of these still requires further improvement. The limitations of these methods can be addressed by selecting appropriate features for distinguishing promoters and non-promoters.

Methods

In this study, we proposed two feature selection approaches based on hexamer sequences: the Frequency Distribution Analyzed Feature Selection Algorithm (FDAFSA) and the Random Triplet Pair Feature Selecting Genetic Algorithm (RTPFSGA). In FDAFSA, adjacent triplet-pairs (hexamer sequences) were selected based on the difference in the frequency of hexamers between promoters and non-promoters. In RTPFSGA, random triplet-pairs (RTPs) were selected by exploiting a genetic algorithm that distinguishes frequencies of non-adjacent triplet pairs between promoters and non-promoters. Then, a support vector machine (SVM), a nonlinear machine-learning algorithm, was used to classify promoters and non-promoters by combining these two feature selection approaches. We referred to this novel algorithm as PromoBot.

Results

Promoter sequences were collected from the PlantProm database. Non-promoter sequences were collected from plant mRNA, rRNA, and tRNA of PlantGDB and plant miRNA of miRBase. Then, in order to validate the proposed algorithm, we applied a 5-fold cross validation test. Training data sets were used to select features based on FDAFSA and RTPFSGA, and these features were used to train the SVM. We achieved 89% sensitivity and 86% specificity.

Conclusions

We compared our PromoBot algorithm to five other algorithms. It was found that the sensitivity and specificity of PromoBot performed well (or even better) with the algorithms tested. These results show that the two proposed feature selection methods based on hexamer frequencies and random triplet-pair could be successfully incorporated into a supervised machine learning method in promoter classification problem. As such, we expect that PromoBot can be used to help identify new plant promoters. Source codes and analysis results of this work could be provided upon request.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号