期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

In silico,in vitro,and in vivo machine learning in synthetic biology and metabolic engineering

《Current opinion in chemical biology》2021

Among the main learning methods reviewed in this study and used in synthetic biology and metabolic engineering are supervised learning, reinforcement and active learning, and in vitro or in vivo learning.In the context of biosynthesis, supervised machine learning is being exploited to predict biological sequence activities, predict structures and engineer sequences, and optimize culture conditions.Active and reinforcement learning methods use training sets acquired through an iterative process generally involving experimental measurements. They are applied to design, engineer, and optimize metabolic pathways and bioprocesses.The nascent but promising developments with in vitro and in vivo learning comprise molecular circuits performing simple tasks such as pattern recognition and classification. 相似文献

2.

A comparison of methods for classifying clinical samples based on proteomics data: a case study for statistical and machine learning approaches

Sampson DL Parker TJ Upton Z Hurst CP 《PloS one》2011,6(9):e24973

The discovery of protein variation is an important strategy in disease diagnosis within the biological sciences. The current benchmark for elucidating information from multiple biological variables is the so called “omics” disciplines of the biological sciences. Such variability is uncovered by implementation of multivariable data mining techniques which come under two primary categories, machine learning strategies and statistical based approaches. Typically proteomic studies can produce hundreds or thousands of variables, p, per observation, n, depending on the analytical platform or method employed to generate the data. Many classification methods are limited by an n≪p constraint, and as such, require pre-treatment to reduce the dimensionality prior to classification. Recently machine learning techniques have gained popularity in the field for their ability to successfully classify unknown samples. One limitation of such methods is the lack of a functional model allowing meaningful interpretation of results in terms of the features used for classification. This is a problem that might be solved using a statistical model-based approach where not only is the importance of the individual protein explicit, they are combined into a readily interpretable classification rule without relying on a black box approach. Here we incorporate statistical dimension reduction techniques Partial Least Squares (PLS) and Principal Components Analysis (PCA) followed by both statistical and machine learning classification methods, and compared them to a popular machine learning technique, Support Vector Machines (SVM). Both PLS and SVM demonstrate strong utility for proteomic classification problems. 相似文献

3.

Feature selection and nearest centroid classification for protein mass spectrometry

Ilya?Levner Email author 《BMC bioinformatics》2005,6(1):68

Background

The use of mass spectrometry as a proteomics tool is poised to revolutionize early disease diagnosis and biomarker identification. Unfortunately, before standard supervised classification algorithms can be employed, the "curse of dimensionality" needs to be solved. Due to the sheer amount of information contained within the mass spectra, most standard machine learning techniques cannot be directly applied. Instead, feature selection techniques are used to first reduce the dimensionality of the input space and thus enable the subsequent use of classification algorithms. This paper examines feature selection techniques for proteomic mass spectrometry. 相似文献

4.

A stable iterative method for refining discriminative gene clusters

Xu M Zhu M Zhang L 《BMC genomics》2008,9(Z2):S18

Background

Microarray technology is often used to identify the genes that are differentially expressed between two biological conditions. On the other hand, since microarray datasets contain a small number of samples and a large number of genes, it is usually desirable to identify small gene subsets with distinct pattern between sample classes. Such gene subsets are highly discriminative in phenotype classification because of their tightly coupling features. Unfortunately, such identified classifiers usually tend to have poor generalization properties on the test samples due to overfitting problem.

Results

We propose a novel approach combining both supervised learning with unsupervised learning techniques to generate increasingly discriminative gene clusters in an iterative manner. Our experiments on both simulated and real datasets show that our method can produce a series of robust gene clusters with good classification performance compared with existing approaches.

Conclusion

This backward approach for refining a series of highly discriminative gene clusters for classification purpose proves to be very consistent and stable when applied to various types of training samples.

相似文献

5.

Discovery of Ongoing Selective Sweeps within Anopheles Mosquito Populations Using Deep Learning

Alexander T Xue Daniel R Schrider Andrew D Kern Agg Consortium&#; 《Molecular biology and evolution》2021,38(3):1168

Identification of partial sweeps, which include both hard and soft sweeps that have not currently reached fixation, provides crucial information about ongoing evolutionary responses. To this end, we introduce partialS/HIC, a deep learning method to discover selective sweeps from population genomic data. partialS/HIC uses a convolutional neural network for image processing, which is trained with a large suite of summary statistics derived from coalescent simulations incorporating population-specific history, to distinguish between completed versus partial sweeps, hard versus soft sweeps, and regions directly affected by selection versus those merely linked to nearby selective sweeps. We perform several simulation experiments under various demographic scenarios to demonstrate partialS/HIC’s performance, which exhibits excellent resolution for detecting partial sweeps. We also apply our classifier to whole genomes from eight mosquito populations sampled across sub-Saharan Africa by the Anopheles gambiae 1000 Genomes Consortium, elucidating both continent-wide patterns as well as sweeps unique to specific geographic regions. These populations have experienced intense insecticide exposure over the past two decades, and we observe a strong overrepresentation of sweeps at insecticide resistance loci. Our analysis thus provides a list of candidate adaptive loci that may be relevant to mosquito control efforts. More broadly, our supervised machine learning approach introduces a method to distinguish between completed and partial sweeps, as well as between hard and soft sweeps, under a variety of demographic scenarios. As whole-genome data rapidly accumulate for a greater diversity of organisms, partialS/HIC addresses an increasing demand for useful selection scan tools that can track in-progress evolutionary dynamics. 相似文献

6.

BCDForest: a boosting cascade deep forest model towards the classification of cancer subtypes based on gene expression data

Yang Guo Shuhui Liu Zhanhuai Li Xuequn Shang 《BMC bioinformatics》2018,19(5):118

Background

The classification of cancer subtypes is of great importance to cancer disease diagnosis and therapy. Many supervised learning approaches have been applied to cancer subtype classification in the past few years, especially of deep learning based approaches. Recently, the deep forest model has been proposed as an alternative of deep neural networks to learn hyper-representations by using cascade ensemble decision trees. It has been proved that the deep forest model has competitive or even better performance than deep neural networks in some extent. However, the standard deep forest model may face overfitting and ensemble diversity challenges when dealing with small sample size and high-dimensional biology data.

Results

In this paper, we propose a deep learning model, so-called BCDForest, to address cancer subtype classification on small-scale biology datasets, which can be viewed as a modification of the standard deep forest model. The BCDForest distinguishes from the standard deep forest model with the following two main contributions: First, a named multi-class-grained scanning method is proposed to train multiple binary classifiers to encourage diversity of ensemble. Meanwhile, the fitting quality of each classifier is considered in representation learning. Second, we propose a boosting strategy to emphasize more important features in cascade forests, thus to propagate the benefits of discriminative features among cascade layers to improve the classification performance. Systematic comparison experiments on both microarray and RNA-Seq gene expression datasets demonstrate that our method consistently outperforms the state-of-the-art methods in application of cancer subtype classification.

Conclusions

The multi-class-grained scanning and boosting strategy in our model provide an effective solution to ease the overfitting challenge and improve the robustness of deep forest model working on small-scale data. Our model provides a useful approach to the classification of cancer subtypes by using deep learning on high-dimensional and small-scale biology data.

相似文献

7.

Further Evidence of Increasing Diversity of Plasmodium vivax in the Republic of Korea in Recent Years

Jung-Yeon Kim Youn-Kyoung Goo Young-Gun Zo So-Young Ji Hidayat Trimarsanto Sheren To Taane G. Clark Ric N. Price Sarah Auburn 《PloS one》2016,11(3)

Background

Vivax malaria was successfully eliminated from the Republic of Korea (ROK) in the late 1970s but re-emerged in 1993. Two decades later as the ROK enters the final stages of malaria elimination, dedicated surveillance of the local P. vivax population is critical. We apply a population genetic approach to gauge P. vivax transmission dynamics in the ROK between 2010 and 2012.

Methodology/Principal Findings

P. vivax positive blood samples from 98 autochthonous cases were collected from patients attending health centers in the ROK in 2010 (n = 27), 2011 (n = 48) and 2012 (n = 23). Parasite genotyping was undertaken at 9 tandem repeat markers. Although not reaching significance, a trend of increasing population diversity was observed from 2010 (H_E = 0.50 ± 0.11) to 2011 (H_E = 0.56 ± 0.08) and 2012 (H_E = 0.60 ± 0.06). Conversely, linkage disequilibrium declined during the same period: I_AS = 0.15 in 2010 (P = 0.010), 0.09 in 2011 (P = 0.010) and 0.05 in 2012 (P = 0.010). In combination with data from other ROK studies undertaken between 1994 and 2007, our results are consistent with increasing parasite divergence since re-emergence. Polyclonal infections were rare (3% infections) suggesting that local out-crossing alone was unlikely to explain the increased divergence. Cases introduced from an external reservoir may therefore have contributed to the increased diversity. Aside from one isolate, all infections carried a short MS20 allele (142 or 149 bp), not observed in other studies in tropical endemic countries despite high diversity, inferring that these regions are unlikely reservoirs.

Conclusions

Whilst a number of factors may explain the observed population genetic trends, the available evidence suggests that an external geographic reservoir with moderate diversity sustains the majority of P. vivax infection in the ROK, with important implications for malaria elimination. 相似文献

8.

Galaxy-ML: An accessible,reproducible, and scalable machine learning toolkit for biomedicine

Qiang Gu Anup Kumar Simon Bray Allison Creason Alireza Khanteymoori Vahid Jalili Bjrn Grüning Jeremy Goecks 《PLoS computational biology》2021,17(6)

Supervised machine learning is an essential but difficult to use approach in biomedical data analysis. The Galaxy-ML toolkit (https://galaxyproject.org/community/machine-learning/) makes supervised machine learning more accessible to biomedical scientists by enabling them to perform end-to-end reproducible machine learning analyses at large scale using only a web browser. Galaxy-ML extends Galaxy (https://galaxyproject.org), a biomedical computational workbench used by tens of thousands of scientists across the world, with a suite of tools for all aspects of supervised machine learning.

This is a PLOS Computational Biology Software paper.

相似文献

9.

Defining reference sequences for Nocardia species by similarity and clustering analyses of 16S rRNA gene sequence data

Helal M Kong F Chen SC Bain M Christen R Sintchenko V 《PloS one》2011,6(6):e19517

Background

The intra- and inter-species genetic diversity of bacteria and the absence of ‘reference’, or the most representative, sequences of individual species present a significant challenge for sequence-based identification. The aims of this study were to determine the utility, and compare the performance of several clustering and classification algorithms to identify the species of 364 sequences of 16S rRNA gene with a defined species in GenBank, and 110 sequences of 16S rRNA gene with no defined species, all within the genus Nocardia.

Methods

A total of 364 16S rRNA gene sequences of Nocardia species were studied. In addition, 110 16S rRNA gene sequences assigned only to the Nocardia genus level at the time of submission to GenBank were used for machine learning classification experiments. Different clustering algorithms were compared with a novel algorithm or the linear mapping (LM) of the distance matrix. Principal Components Analysis was used for the dimensionality reduction and visualization.

Results

The LM algorithm achieved the highest performance and classified the set of 364 16S rRNA sequences into 80 clusters, the majority of which (83.52%) corresponded with the original species. The most representative 16S rRNA sequences for individual Nocardia species have been identified as ‘centroids’ in respective clusters from which the distances to all other sequences were minimized; 110 16S rRNA gene sequences with identifications recorded only at the genus level were classified using machine learning methods. Simple kNN machine learning demonstrated the highest performance and classified Nocardia species sequences with an accuracy of 92.7% and a mean frequency of 0.578.

Conclusion

The identification of centroids of 16S rRNA gene sequence clusters using novel distance matrix clustering enables the identification of the most representative sequences for each individual species of Nocardia and allows the quantitation of inter- and intra-species variability. 相似文献

10.

Tapping the potential of intact cell mass spectrometry with a combined data analytical approach applied to Yersinia spp.: detection, differentiation and identification of Y. pestis

Wittwer M Heim J Schär M Dewarrat G Schürch N 《Systematic and applied microbiology》2011,34(1):12-19

In the everyday routine of an analytic lab, one is often confronted with the challenge to identify an unknown microbial sample lacking prior information to set the search limits.In the present work, we propose a workflow, which uses the spectral diversity of a commercial database (SARAMIS) to narrow down the search field at a certain taxonomic level, followed by a refined classification by supervised modelling. As supervised learning algorithm, we have chosen a shrinkage discriminant analysis approach, which takes collinearity of the data into account and provides a scoring system for biomarker ranking. This ranking can be used to tailor specific biomarker subsets, which optimize discrimination between subgroups, allowing a weighting of misclassification.The suitability of the approach was verified based on a dataset containing the mass spectra of three Yersinia species Yersinia enterocolitica, Y. pseudotuberculosis and Yersinia pestis. Thereby, we laid the emphasis on the discrimination between the highly related species Yersinia pseudotuberculosis and Y. pestis.All three species were correctly identified at the genus level by the commercial database. Whereas Y. enterocolitica was correctly identified at the species level, discrimination between the highly related Y. pseudotuberculosis and Y. pestis strains was ambiguous. With the use of the supervised modelling approach, we were able to accurately discriminate all the species even when grown under different culture conditions. 相似文献

11.

Interaction profile-based protein classification of death domain

Drew?Lett Michael?Hsing Frederic?Pio Email author 《BMC bioinformatics》2004,5(1):75

Background

The increasing number of protein sequences and 3D structure obtained from genomic initiatives is leading many of us to focus on proteomics, and to dedicate our experimental and computational efforts on the creation and analysis of information derived from 3D structure. In particular, the high-throughput generation of protein-protein interaction data from a few organisms makes such an approach very important towards understanding the molecular recognition that make-up the entire protein-protein interaction network. Since the generation of sequences, and experimental protein-protein interactions increases faster than the 3D structure determination of protein complexes, there is tremendous interest in developing in silico methods that generate such structure for prediction and classification purposes. In this study we focused on classifying protein family members based on their protein-protein interaction distinctiveness. Structure-based classification of protein-protein interfaces has been described initially by Ponstingl et al. [1] and more recently by Valdar et al. [2] and Mintseris et al. [3], from complex structures that have been solved experimentally. However, little has been done on protein classification based on the prediction of protein-protein complexes obtained from homology modeling and docking simulation.

Results

We have developed an in silico classification system entitled HODOCO (Homology modeling, Docking and Classification Oracle), in which protein Residue Potential Interaction Profiles (RPIPS) are used to summarize protein-protein interaction characteristics. This system applied to a dataset of 64 proteins of the death domain superfamily was used to classify each member into its proper subfamily. Two classification methods were attempted, heuristic and support vector machine learning. Both methods were tested with a 5-fold cross-validation. The heuristic approach yielded a 61% average accuracy, while the machine learning approach yielded an 89% average accuracy.

Conclusion

We have confirmed the reliability and potential value of classifying proteins via their predicted interactions. Our results are in the same range of accuracy as other studies that classify protein-protein interactions from 3D complex structure obtained experimentally. While our classification scheme does not take directly into account sequence information our results are in agreement with functional and sequence based classification of death domain family members.

相似文献

12.

Plant microRNA-Target Interaction Identification Model Based on the Integration of Prediction Tools and Support Vector Machine

Jun Meng Lin Shi Yushi Luan 《PloS one》2014,9(7)

Background

Confident identification of microRNA-target interactions is significant for studying the function of microRNA (miRNA). Although some computational miRNA target prediction methods have been proposed for plants, results of various methods tend to be inconsistent and usually lead to more false positive. To address these issues, we developed an integrated model for identifying plant miRNA–target interactions.

Results

Three online miRNA target prediction toolkits and machine learning algorithms were integrated to identify and analyze Arabidopsis thaliana miRNA-target interactions. Principle component analysis (PCA) feature extraction and self-training technology were introduced to improve the performance. Results showed that the proposed model outperformed the previously existing methods. The results were validated by using degradome sequencing supported Arabidopsis thaliana miRNA-target interactions. The proposed model constructed on Arabidopsis thaliana was run over Oryza sativa and Vitis vinifera to demonstrate that our model is effective for other plant species.

Conclusions

The integrated model of online predictors and local PCA-SVM classifier gained credible and high quality miRNA-target interactions. The supervised learning algorithm of PCA-SVM classifier was employed in plant miRNA target identification for the first time. Its performance can be substantially improved if more experimentally proved training samples are provided. 相似文献

13.

Availability of MudPIT data for classification of biological samples

Dario Di Silvestre Italo Zoppis Francesca Brambilla Valeria Bellettato Giancarlo Mauri Pierluigi Mauri 《Journal of clinical bioinformatics》2013,3(1):1-9

Background

Mass spectrometry is an important analytical tool for clinical proteomics. Primarily employed for biomarker discovery, it is increasingly used for developing methods which may help to provide unambiguous diagnosis of biological samples. In this context, we investigated the classification of phenotypes by applying support vector machine (SVM) on experimental data obtained by MudPIT approach. In particular, we compared the performance capabilities of SVM by using two independent collection of complex samples and different data-types, such as mass spectra (m/z), peptides and proteins.

Results

Globally, protein and peptide data allowed a better discriminant informative content than experimental mass spectra (overall accuracy higher than 87% in both collection 1 and 2). These results indicate that sequencing of peptides and proteins reduces the experimental noise affecting the raw mass spectra, and allows the extraction of more informative features available for the effective classification of samples. In addition, proteins and peptides features selected by SVM matched for 80% with the differentially expressed proteins identified by the MAProMa software.

Conclusions

These findings confirm the availability of the most label-free quantitative methods based on processing of spectral count and SEQUEST-based SCORE values. On the other hand, it stresses the usefulness of MudPIT data for a correct grouping of sample phenotypes, by applying both supervised and unsupervised learning algorithms. This capacity permit the evaluation of actual samples and it is a good starting point to translate proteomic methodology to clinical application. 相似文献

14.

Machine learning-based investigation of the relationship between gut microbiome and obesity status

《Microbes and infection / Institut Pasteur》2022,24(2):104892

Gut microbiota is believed to play a crucial role in obesity. However, the consistent findings among published studies regarding microbiome–obesity interaction are relatively rare, and one of the underlying causes could be the limited sample size of cohort studies. In order to identify gut microbiota changes between normal-weight individuals and obese individuals, fecal samples along with phenotype information from 2262 Chinese individuals were collected and analyzed. Compared with normal-weight individuals, the obese individuals exhibit lower diversity of species and higher diversity of metabolic pathways. In addition, various machine learning models were employed to quantify the relationship between obesity status and Body mass index (BMI) values, of which support vector machine model achieves best performance with 0.716 classification accuracy and 0.485 R² score. In addition to two well-established obesity-associated species, three species that have potential to be obesity-related biomarkers, including Bacteroides caccae, Odoribacter splanchnicus and Roseburia hominis were identified. Further analyses of functional pathways also reveal some enriched pathways in obese individuals. Collectively, our data demonstrates tight relationship between obesity and gut microbiota in a large-scale Chinese population. These findings may provide potential targets for the prevention and treatment of obesity. 相似文献

15.

Automatic structure classification of small proteins using random forest

Pooja Jain Jonathan D Hirst 《BMC bioinformatics》2010,11(1):364

相似文献

16.

Towards large-scale FAME-based bacterial species identification using machine learning techniques

Bram Slabbinck Bernard De Baets Peter Dawyndt Paul De Vos 《Systematic and applied microbiology》2009

In the last decade, bacterial taxonomy witnessed a huge expansion. The swift pace of bacterial species (re-)definitions has a serious impact on the accuracy and completeness of first-line identification methods. Consequently, back-end identification libraries need to be synchronized with the List of Prokaryotic names with Standing in Nomenclature. In this study, we focus on bacterial fatty acid methyl ester (FAME) profiling as a broadly used first-line identification method. From the BAME@LMG database, we have selected FAME profiles of individual strains belonging to the genera Bacillus, Paenibacillus and Pseudomonas. Only those profiles resulting from standard growth conditions have been retained. The corresponding data set covers 74, 44 and 95 validly published bacterial species, respectively, represented by 961, 378 and 1673 standard FAME profiles. Through the application of machine learning techniques in a supervised strategy, different computational models have been built for genus and species identification. Three techniques have been considered: artificial neural networks, random forests and support vector machines. Nearly perfect identification has been achieved at genus level. Notwithstanding the known limited discriminative power of FAME analysis for species identification, the computational models have resulted in good species identification results for the three genera. For Bacillus, Paenibacillus and Pseudomonas, random forests have resulted in sensitivity values, respectively, 0.847, 0.901 and 0.708. The random forests models outperform those of the other machine learning techniques. Moreover, our machine learning approach also outperformed the Sherlock MIS (MIDI Inc., Newark, DE, USA). These results show that machine learning proves very useful for FAME-based bacterial species identification. Besides good bacterial identification at species level, speed and ease of taxonomic synchronization are major advantages of this computational species identification strategy. 相似文献

17.

Generalizable brain network markers of major depressive disorder across multiple imaging sites

Ayumu Yamashita Yuki Sakai Takashi Yamada Noriaki Yahata Akira Kunimatsu Naohiro Okada Takashi Itahashi Ryuichiro Hashimoto Hiroto Mizuta Naho Ichikawa Masahiro Takamura Go Okada Hirotaka Yamagata Kenichiro Harada Koji Matsuo Saori C. Tanaka Mitsuo Kawato Kiyoto Kasai Nobumasa Kato Hidehiko Takahashi Yasumasa Okamoto Okito Yamashita Hiroshi Imamizu 《PLoS biology》2020,18(12)

Many studies have highlighted the difficulty inherent to the clinical application of fundamental neuroscience knowledge based on machine learning techniques. It is difficult to generalize machine learning brain markers to the data acquired from independent imaging sites, mainly due to large site differences in functional magnetic resonance imaging. We address the difficulty of finding a generalizable marker of major depressive disorder (MDD) that would distinguish patients from healthy controls based on resting-state functional connectivity patterns. For the discovery dataset with 713 participants from 4 imaging sites, we removed site differences using our recently developed harmonization method and developed a machine learning MDD classifier. The classifier achieved an approximately 70% generalization accuracy for an independent validation dataset with 521 participants from 5 different imaging sites. The successful generalization to a perfectly independent dataset acquired from multiple imaging sites is novel and ensures scientific reproducibility and clinical applicability.

Biomarkers for psychiatric disorders based on neuroimaging data have yet to be put to practical use. This study overcomes the problems of inter-site differences in fMRI data by using a novel harmonization method, thereby successfully constructing a generalizable brain network marker of major depressive disorder across multiple imaging sites. 相似文献

18.

Feasibility study of stain-free classification of cell apoptosis based on diffraction imaging flow cytometry and supervised machine learning techniques

Jingwen Feng Tong Feng Chengwen Yang Wei Wang Yu Sa Yuanming Feng 《Apoptosis : an international journal on programmed cell death》2018,23(5-6):290-298

This study was to explore the feasibility of prediction and classification of cells in different stages of apoptosis with a stain-free method based on diffraction images and supervised machine learning. Apoptosis was induced in human chronic myelogenous leukemia K562 cells by cis-platinum (DDP). A newly developed technique of polarization diffraction imaging flow cytometry (p-DIFC) was performed to acquire diffraction images of the cells in three different statuses (viable, early apoptotic and late apoptotic/necrotic) after cell separation through fluorescence activated cell sorting with Annexin V-PE and SYTOX® Green double staining. The texture features of the diffraction images were extracted with in-house software based on the Gray-level co-occurrence matrix algorithm to generate datasets for cell classification with supervised machine learning method. Therefore, this new method has been verified in hydrogen peroxide induced apoptosis model of HL-60. Results show that accuracy of higher than 90% was achieved respectively in independent test datasets from each cell type based on logistic regression with ridge estimators, which indicated that p-DIFC system has a great potential in predicting and classifying cells in different stages of apoptosis. 相似文献

19.

Toward the explainability,transparency, and universality of machine learning for behavioral classification in neuroscience

《Current opinion in neurobiology》2022

The use of rigorous ethological observation via machine learning techniques to understand brain function (computational neuroethology) is a rapidly growing approach that is poised to significantly change how behavioral neuroscience is commonly performed. With the development of open-source platforms for automated tracking and behavioral recognition, these approaches are now accessible to a wide array of neuroscientists despite variations in budget and computational experience. Importantly, this adoption has moved the field toward a common understanding of behavior and brain function through the removal of manual bias and the identification of previously unknown behavioral repertoires. Although less apparent, another consequence of this movement is the introduction of analytical tools that increase the explainabilty, transparency, and universality of the machine-based behavioral classifications both within and between research groups. Here, we focus on three main applications of such machine model explainabilty tools and metrics in the drive toward behavioral (i) standardization, (ii) specialization, and (iii) explainability. We provide a perspective on the use of explainability tools in computational neuroethology, and detail why this is a necessary next step in the expansion of the field. Specifically, as a possible solution in behavioral neuroscience, we propose the use of Shapley values via Shapley Additive Explanations (SHAP) as a diagnostic resource toward explainability of human annotation, as well as supervised and unsupervised behavioral machine learning analysis. 相似文献

20.

Comprehensive decision tree models in bioinformatics

Stiglic G Kocbek S Pernek I Kokol P 《PloS one》2012,7(3):e33812

Purpose

Classification is an important and widely used machine learning technique in bioinformatics. Researchers and other end-users of machine learning software often prefer to work with comprehensible models where knowledge extraction and explanation of reasoning behind the classification model are possible.

Methods

This paper presents an extension to an existing machine learning environment and a study on visual tuning of decision tree classifiers. The motivation for this research comes from the need to build effective and easily interpretable decision tree models by so called one-button data mining approach where no parameter tuning is needed. To avoid bias in classification, no classification performance measure is used during the tuning of the model that is constrained exclusively by the dimensions of the produced decision tree.

Results

The proposed visual tuning of decision trees was evaluated on 40 datasets containing classical machine learning problems and 31 datasets from the field of bioinformatics. Although we did not expected significant differences in classification performance, the results demonstrate a significant increase of accuracy in less complex visually tuned decision trees. In contrast to classical machine learning benchmarking datasets, we observe higher accuracy gains in bioinformatics datasets. Additionally, a user study was carried out to confirm the assumption that the tree tuning times are significantly lower for the proposed method in comparison to manual tuning of the decision tree.

Conclusions

The empirical results demonstrate that by building simple models constrained by predefined visual boundaries, one not only achieves good comprehensibility, but also very good classification performance that does not differ from usually more complex models built using default settings of the classical decision tree algorithm. In addition, our study demonstrates the suitability of visually tuned decision trees for datasets with binary class attributes and a high number of possibly redundant attributes that are very common in bioinformatics. 相似文献