Abstract: | An ensemble classifier approach for microRNA precursor (pre-miRNA) classification was
proposed based upon combining a set of heterogeneous algorithms including support vector
machine (SVM), k-nearest neighbors (kNN) and random forest (RF), then aggregating their
prediction through a voting system. Additionally, the proposed algorithm, the
classification performance was also improved using discriminative features,
self-containment and its derivatives, which have shown unique structural robustness
characteristics of pre-miRNAs. These are applicable across different species. By applying
preprocessing methods—both a correlation-based feature selection (CFS) with genetic
algorithm (GA) search method and a modified-Synthetic Minority Oversampling Technique
(SMOTE) bagging rebalancing method—improvement in the performance of this ensemble
was observed. The overall prediction accuracies obtained via 10 runs of 5-fold cross
validation (CV) was 96.54%, with sensitivity of 94.8% and specificity of
98.3%—this is better in trade-off sensitivity and specificity values than
those of other state-of-the-art methods. The ensemble model was applied to animal, plant
and virus pre-miRNA and achieved high accuracy, >93%. Exploiting the
discriminative set of selected features also suggests that pre-miRNAs possess high
intrinsic structural robustness as compared with other stem loops. Our heterogeneous
ensemble method gave a relatively more reliable prediction than those using single
classifiers. Our program is available at http://ncrna-pred.com/premiRNA.html. |