期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Weakly supervised learning of information structure of scientific abstracts--is it accurate enough to benefit real-world tasks in biomedicine?

Guo Y Korhonen A Silins I Stenius U 《Bioinformatics (Oxford, England)》2011,27(22):3179-3185

MOTIVATION: Many practical tasks in biomedicine require accessing specific types of information in scientific literature; e.g. information about the methods, results or conclusions of the study in question. Several approaches have been developed to identify such information in scientific journal articles. The best of these have yielded promising results and proved useful for biomedical text mining tasks. However, relying on fully supervised machine learning (ml) and a large body of annotated data, existing approaches are expensive to develop and port to different tasks. A potential solution to this problem is to employ weakly supervised learning instead. In this article, we investigate a weakly supervised approach to identifying information structure according to a scheme called Argumentative Zoning (az). We apply four weakly supervised classifiers to biomedical abstracts and evaluate their performance both directly and in a real-life scenario in the context of cancer risk assessment. RESULTS: Our best weakly supervised classifier (based on the combination of active learning and self-training) performs well on the task, outperforming our best supervised classifier: it yields a high accuracy of 81% when just 10% of the labeled data is used for training. When cancer risk assessors are presented with the resulting annotated abstracts, they find relevant information in them significantly faster than when presented with unannotated abstracts. These results suggest that weakly supervised learning could be used to improve the practical usefulness of information structure for real-life tasks in biomedicine. 相似文献

2.

Prediction of protein subcellular localization 总被引：6，自引：0，他引：6

Yu CS Chen YC Lu CH Hwang JK 《Proteins》2006,64(3):643-651

Because the protein's function is usually related to its subcellular localization, the ability to predict subcellular localization directly from protein sequences will be useful for inferring protein functions. Recent years have seen a surging interest in the development of novel computational tools to predict subcellular localization. At present, these approaches, based on a wide range of algorithms, have achieved varying degrees of success for specific organisms and for certain localization categories. A number of authors have noticed that sequence similarity is useful in predicting subcellular localization. For example, Nair and Rost (Protein Sci 2002;11:2836-2847) have carried out extensive analysis of the relation between sequence similarity and identity in subcellular localization, and have found a close relationship between them above a certain similarity threshold. However, many existing benchmark data sets used for the prediction accuracy assessment contain highly homologous sequences-some data sets comprising sequences up to 80-90% sequence identity. Using these benchmark test data will surely lead to overestimation of the performance of the methods considered. Here, we develop an approach based on a two-level support vector machine (SVM) system: the first level comprises a number of SVM classifiers, each based on a specific type of feature vectors derived from sequences; the second level SVM classifier functions as the jury machine to generate the probability distribution of decisions for possible localizations. We compare our approach with a global sequence alignment approach and other existing approaches for two benchmark data sets-one comprising prokaryotic sequences and the other eukaryotic sequences. Furthermore, we carried out all-against-all sequence alignment for several data sets to investigate the relationship between sequence homology and subcellular localization. Our results, which are consistent with previous studies, indicate that the homology search approach performs well down to 30% sequence identity, although its performance deteriorates considerably for sequences sharing lower sequence identity. A data set of high homology levels will undoubtedly lead to biased assessment of the performances of the predictive approaches-especially those relying on homology search or sequence annotations. Our two-level classification system based on SVM does not rely on homology search; therefore, its performance remains relatively unaffected by sequence homology. When compared with other approaches, our approach performed significantly better. Furthermore, we also develop a practical hybrid method, which combines the two-level SVM classifier and the homology search method, as a general tool for the sequence annotation of subcellular localization. 相似文献

3.

Separability indexes and accuracy of neuro-fuzzy classification in Geographic Information Systems for assessment of coastal environmental vulnerability

《Ecological Informatics》2012

The aim of this study was the development, evaluation and analysis of a neuro-fuzzy classifier for a supervised and hard classification of coastal environmental vulnerability due to marine aquaculture using minimal training sets within a Geographic Information System (GIS). The neuro-fuzzy classification model NEFCLASS‐J, was used to develop learning algorithms to create the structure (rule base) and the parameters (fuzzy sets) of a fuzzy classifier from a set of labeled data. The training sites were manually classified based on four categories of coastal environmental vulnerability through meetings and interviews with experts having field experience and specific knowledge of the environmental problems investigated. The inter-class separability estimations were performed on the training data set to assess the difficulty of the class separation problem under investigation. The two training data sets did not follow the assumptions of multivariate normality. For this reason Bhattacharyy and Jeffries–Matusita distances were used to estimate the probability of correct classification. Further evaluation and analysis of the quality of the classification achieved low values of quantity and allocation disagreement and a good overall accuracy. For each of the four classes the user and producer values for accuracy were between 77% and 100%.In conclusion, the use of a neuro-fuzzy classifier for a supervised and hard classification of coastal environmental vulnerability demonstrated an ability to derive an accurate and reliable classification using a minimal number of training sets. 相似文献

4.

Relative protein quantification by isobaric SILAC with immonium ion splitting (ISIS)

Colzani M Schütz F Potts A Waridel P Quadroni M 《Molecular & cellular proteomics : MCP》2008,7(5):927-937

Metabolic labeling techniques have recently become popular tools for the quantitative profiling of proteomes. Classical stable isotope labeling with amino acids in cell cultures (SILAC) uses pairs of heavy/light isotopic forms of amino acids to introduce predictable mass differences in protein samples to be compared. After proteolysis, pairs of cognate precursor peptides can be correlated, and their intensities can be used for mass spectrometry-based relative protein quantification. We present an alternative SILAC approach by which two cell cultures are grown in media containing isobaric forms of amino acids, labeled either with 13C on the carbonyl (C-1) carbon or 15N on backbone nitrogen. Labeled peptides from both samples have the same nominal mass and nearly identical MS/MS spectra but generate upon fragmentation distinct immonium ions separated by 1 amu. When labeled protein samples are mixed, the intensities of these immonium ions can be used for the relative quantification of the parent proteins. We validated the labeling of cellular proteins with valine, isoleucine, and leucine with coverage of 97% of all tryptic peptides. We improved the sensitivity for the detection of the quantification ions on a pulsing instrument by using a specific fast scan event. The analysis of a protein mixture with a known heavy/light ratio showed reliable quantification. Finally the application of the technique to the analysis of two melanoma cell lines yielded quantitative data consistent with those obtained by a classical two-dimensional DIGE analysis of the same samples. Our method combines the features of the SILAC technique with the advantages of isobaric labeling schemes like iTRAQ. We discuss advantages and disadvantages of isobaric SILAC with immonium ion splitting as well as possible ways to improve it. 相似文献

5.

Estimating the localization spread function of static single-molecule localization microscopy images

《Biophysical journal》2022,121(15):2906-2920

Single-molecule localization microscopy (SMLM) permits the visualization of cellular structures an order of magnitude smaller than the diffraction limit of visible light, and an accurate, objective evaluation of the resolution of an SMLM data set is an essential aspect of the image processing and analysis pipeline. Here, we present a simple method to estimate the localization spread function (LSF) of a static SMLM data set directly from acquired localizations, exploiting the correlated dynamics of individual emitters and properties of the pair autocorrelation function evaluated in both time and space. The method is demonstrated on simulated localizations, DNA origami rulers, and cellular structures labeled by dye-conjugated antibodies, DNA-PAINT, or fluorescent fusion proteins. We show that experimentally obtained images have LSFs that are broader than expected from the localization precision alone, due to additional uncertainty accrued when localizing molecules imaged over time. 相似文献

6.

Quantitative imaging of lymphocyte membrane protein reorganization and signaling

下载免费PDF全文

Kasson PM Huppa JB Krogsgaard M Davis MM Brunger AT 《Biophysical journal》2005,88(1):579-589

Changes in membrane protein localization are critical to establishing cell polarity and regulating cell signaling. Fluorescence microscopy of labeled proteins allows visualization of these changes, but quantitative analysis is needed to study this aspect of cell signaling in full mechanistic detail. We have developed a novel approach for quantitative assessment of membrane protein redistribution based on four-dimensional video microscopy of fluorescently labeled proteins. Our analytic system provides robust automated methods for cell surface reconstruction, cell shape tracking, cell-surface distance measurement, and cluster formation analysis. These methods permit statistical analyses and testing of mechanistic hypotheses regarding cell signaling. We have used this approach to measure antigen-dependent clustering of signaling molecules in CD4+ T lymphocytes, obtaining clustering velocities consistent with single-particle tracking data. Our system captures quantitative differences in clustering between signaling proteins with distinct biological functions. Our methods can be generalized to a range of cell-signaling phenomena and enable novel applications not feasible with single-particle studies, such as analysis of subcellular protein localization in live organ culture. 相似文献

7.

Predicting Classifier Performance with Limited Training Data: Applications to Computer-Aided Diagnosis in Breast and Prostate Cancer

Ajay Basavanhally Satish Viswanath Anant Madabhushi 《PloS one》2015,10(5)

Clinical trials increasingly employ medical imaging data in conjunction with supervised classifiers, where the latter require large amounts of training data to accurately model the system. Yet, a classifier selected at the start of the trial based on smaller and more accessible datasets may yield inaccurate and unstable classification performance. In this paper, we aim to address two common concerns in classifier selection for clinical trials: (1) predicting expected classifier performance for large datasets based on error rates calculated from smaller datasets and (2) the selection of appropriate classifiers based on expected performance for larger datasets. We present a framework for comparative evaluation of classifiers using only limited amounts of training data by using random repeated sampling (RRS) in conjunction with a cross-validation sampling strategy. Extrapolated error rates are subsequently validated via comparison with leave-one-out cross-validation performed on a larger dataset. The ability to predict error rates as dataset size increases is demonstrated on both synthetic data as well as three different computational imaging tasks: detecting cancerous image regions in prostate histopathology, differentiating high and low grade cancer in breast histopathology, and detecting cancerous metavoxels in prostate magnetic resonance spectroscopy. For each task, the relationships between 3 distinct classifiers (k-nearest neighbor, naive Bayes, Support Vector Machine) are explored. Further quantitative evaluation in terms of interquartile range (IQR) suggests that our approach consistently yields error rates with lower variability (mean IQRs of 0.0070, 0.0127, and 0.0140) than a traditional RRS approach (mean IQRs of 0.0297, 0.0779, and 0.305) that does not employ cross-validation sampling for all three datasets. 相似文献

8.

Quantitative single cell monitoring of protein synthesis at subcellular resolution using fluorescently labeled tRNA

Barhoom S Kaur J Cooperman BS Smorodinsky NI Smilansky Z Ehrlich M Elroy-Stein O 《Nucleic acids research》2011,39(19):e129

We have developed a novel technique of using fluorescent tRNA for translation monitoring (FtTM). FtTM enables the identification and monitoring of active protein synthesis sites within live cells at submicron resolution through quantitative microscopy of transfected bulk uncharged tRNA, fluorescently labeled in the D-loop (fl-tRNA). The localization of fl-tRNA to active translation sites was confirmed through its co-localization with cellular factors and its dynamic alterations upon inhibition of protein synthesis. Moreover, fluorescence resonance energy transfer (FRET) signals, generated when fl-tRNAs, separately labeled as a FRET pair occupy adjacent sites on the ribosome, quantitatively reflect levels of protein synthesis in defined cellular regions. In addition, FRET signals enable detection of intra-populational variability in protein synthesis activity. We demonstrate that FtTM allows quantitative comparison of protein synthesis between different cell types, monitoring effects of antibiotics and stress agents, and characterization of changes in spatial compartmentalization of protein synthesis upon viral infection. 相似文献

9.

Three new Drosophila markers of intracellular membranes

LaJeunesse DR Buckner SM Lake J Na C Pirt A Fromson K 《BioTechniques》2004,36(5):784-8, 790

The need for cellular markers that permit a quick and accurate evaluation of a protein's subcellular localization has increased with the surge of new data generated by the Drosophila genome project. In this report, we present three ubiquitously expressed Drosophila transgenes that expressed a green fluorescent protein variant (enhanced yellow fluorescent protein) that has been targeted to different intracellular membrane targets: the Golgi apparatus, mitochondria, and endoplasmic reticulum. These markers serve as an internal standard for characterizing a protein's subcellular localization or as a means of tracking the dynamics of intracellular organelles during normal or abnormal cellular or developmental processes. We have also examined fixation artifacts using these constructs to illustrate the effects that fixation and permeabilization have on intracellular membrane organization. 相似文献

10.

Learning cellular sorting pathways using protein interactions and sequence motifs

Lin TH Bar-Joseph Z Murphy RF 《Journal of computational biology》2011,18(11):1709-1722

Proper subcellular localization is critical for proteins to perform their roles in cellular functions. Proteins are transported by different cellular sorting pathways, some of which take a protein through several intermediate locations until reaching its final destination. The pathway a protein is transported through is determined by carrier proteins that bind to specific sequence motifs. In this article, we present a new method that integrates protein interaction and sequence motif data to model how proteins are sorted through these sorting pathways. We use a hidden Markov model (HMM) to represent protein sorting pathways. The model is able to determine intermediate sorting states and to assign carrier proteins and motifs to the sorting pathways. In simulation studies, we show that the method can accurately recover an underlying sorting model. Using data for yeast, we show that our model leads to accurate prediction of subcellular localization. We also show that the pathways learned by our model recover many known sorting pathways and correctly assign proteins to the path they utilize. The learned model identified new pathways and their putative carriers and motifs and these may represent novel protein sorting mechanisms. Supplementary results and software implementation are available from http://murphylab.web.cmu.edu/software/2010_RECOMB_pathways/. 相似文献

11.

Single-strand DNA aptamers as probes for protein localization in cells.

Kristi K H Stanlis J Richard McIntosh 《The journal of histochemistry and cytochemistry》2003,51(6):797-808

The accurate localization of proteins in fixed cells is important for many studies in cell biology, but good fixation is often antagonistic to good immunolabeling, given the density of well-preserved cells and the size of most labeled antibody probes. We therefore explored the use of single-stranded oligonucleotides (aptamers), which can bind to proteins with very high affinity and specificity but which are only approximately 10 kD. To evaluate these probes for general protein localization, we sought an aptamer that binds to a widely used protein tag, the green fluorescent protein (GFP). Although this quest was not successful, we were able to solve several practical problems that will confront any such labeling effort, e.g., the rates at which oligonucleotides enter fixed cells of different kinds and the extent of nonspecific oligonucleotide binding to both mammalian and yeast cell structures. Because such localization methods would be of particular value for electron microscopy of optimally fixed material, we also explored the solubility of aptamers under conditions suitable for freeze-substitution fixation. We found that aptamers are sufficiently soluble in cold organic solvents to encourage the view that this approach may be useful for the localization of specific proteins in context of cellular fine structure. 相似文献

12.

Face recognition from a single image per person using deep architecture neural networks

Tian Zhuo 《Cluster computing》2016,19(1):73-77

Implementing an accurate face recognition system requires images in different variations, and if our database is large, we suffer from problems such as storing cost and low speed in recognition algorithms. On the other hand, in some applications there is only one image available per person for training recognition model. In this article, we propose a neural network model inspired of bidirectional analysis and synthesis brain network which can learn nonlinear mapping between image space and components space. Using a deep neural network model, we have tried to separate pose components from person ones. After setting apart these components, we can use them to synthesis virtual images of test data in different pose and lighting conditions. These virtual images are used to train neural network classifier. The results showed that training neural classifier with virtual images gives better performance than training classifier with frontal view images. 相似文献

13.

Cellular trafficking and photochemical internalization of cell penetrating peptide linked cargo proteins: a dual fluorescent labeling study 总被引：1，自引：0，他引：1

Gillmeister MP Betenbaugh MJ Fishman PS 《Bioconjugate chemistry》2011,22(4):556-566

Initial cellular uptake of cell penetrating peptide (CPP) linked macromolecules is usually endosomal, with passage from endosome to cytosol a major limitation to efficient delivery. To gain a better understanding of the passage of the CPP-linked proteins, we studied the uptake and localization of CPP-linked proteins that contained two different forms of fluorescent markers, GFP protein and chemically conjugated tetramethylrhodamine, in living cells. Rhodamine labeled TAT-GFP was internalized in multiple cell lines including HEK293, N18-RE-105, hippocampal slices, and human neural progenitor cells and showed predominantly endosomal localization of both fluorescent markers. Cytosolic localization of some rhodamine label was detected to suggest that some of the GFP label had exited from the endosome. However, quantification of the distribution of the rhodamine and GFP label indicated that the protein location was primarily endosomal and that the distribution of TAT-GFP was not significantly different than that of an exclusively endosomal localized exogenous protein (tetanus toxin fragment C - TTC). As a result, photochemical internalization (PCI) was evaluated and caused a significant quantitative redistribution of cellular fluorescence of rhodamine and GFP labels to demonstrate increased cytosolic delivery of GFP. While rhodamine-labeled TAT-GFP showed cytosolic delivery with exposure to specific wavelengths of fluorescent illumination, a similarly labeled GFP fusion protein containing the membrane binding domain of TTC did not mediate PCI in N18-RE-105 cells. 相似文献

14.

Semi-supervised protein classification using cluster kernels 总被引：2，自引：0，他引：2

Weston J Leslie C Ie E Zhou D Elisseeff A Noble WS 《Bioinformatics (Oxford, England)》2005,21(15):3241-3247

MOTIVATION: Building an accurate protein classification system depends critically upon choosing a good representation of the input sequences of amino acids. Recent work using string kernels for protein data has achieved state-of-the-art classification performance. However, such representations are based only on labeled data--examples with known 3D structures, organized into structural classes--whereas in practice, unlabeled data are far more plentiful. RESULTS: In this work, we develop simple and scalable cluster kernel techniques for incorporating unlabeled data into the representation of protein sequences. We show that our methods greatly improve the classification performance of string kernels and outperform standard approaches for using unlabeled data, such as adding close homologs of the positive examples to the training data. We achieve equal or superior performance to previously presented cluster kernel methods and at the same time achieving far greater computational efficiency. AVAILABILITY: Source code is available at www.kyb.tuebingen.mpg.de/bs/people/weston/semiprot. The Spider matlab package is available at www.kyb.tuebingen.mpg.de/bs/people/spider. SUPPLEMENTARY INFORMATION: www.kyb.tuebingen.mpg.de/bs/people/weston/semiprot. 相似文献

15.

Systematic validation of antibody binding and protein subcellular localization using siRNA and confocal microscopy

Stadler C Hjelmare M Neumann B Jonasson K Pepperkok R Uhlén M Lundberg E 《Journal of Proteomics》2012,75(7):2236-2251

We have developed a platform for validation of antibody binding and protein subcellular localization data obtained from immunofluorescence using siRNA technology combined with automated confocal microscopy and image analysis. By combining the siRNA technology with automated sample preparation, automated imaging and quantitative image analysis, a high-throughput assay has been set-up to enable confirmation of accurate protein binding and localization in a systematic manner. Here, we describe the analysis and validation of the subcellular location of 65 human proteins, targeted by 75 antibodies and silenced by 130 siRNAs. A large fraction of (80%) the subcellular locations, including locations of several previously uncharacterized proteins, could be confirmed by the significant down-regulation of the antibody signal after the siRNA silencing. A quantitative analysis was set-up using automated image analysis to facilitate studies of targets found in more than one compartment. The results obtained using the platform demonstrate that siRNA silencing in combination with quantitative image analysis of antibody signals in different compartments of the cells is an attractive approach for ensuring accurate protein localization as well as antibody binding using immunofluorescence. With a large fraction of the human proteome still unexplored, we suggest this approach to be of great importance under the continued work of mapping the human proteome on a subcellular level. 相似文献

16.

Semi-supervised analysis of gene expression profiles for lineage-specific development in the Caenorhabditis elegans embryo

Qi Y Missiuro PE Kapoor A Hunter CP Jaakkola TS Gifford DK Ge H 《Bioinformatics (Oxford, England)》2006,22(14):e417-e423

MOTIVATION: Gene expression profiling is a powerful approach to identify genes that may be involved in a specific biological process on a global scale. For example, gene expression profiling of mutant animals that lack or contain an excess of certain cell types is a common way to identify genes that are important for the development and maintenance of given cell types. However, it is difficult for traditional computational methods, including unsupervised and supervised learning methods, to detect relevant genes from a large collection of expression profiles with high sensitivity and specificity. Unsupervised methods group similar gene expressions together while ignoring important prior biological knowledge. Supervised methods utilize training data from prior biological knowledge to classify gene expression. However, for many biological problems, little prior knowledge is available, which limits the prediction performance of most supervised methods. RESULTS: We present a Bayesian semi-supervised learning method, called BGEN, that improves upon supervised and unsupervised methods by both capturing relevant expression profiles and using prior biological knowledge from literature and experimental validation. Unlike currently available semi-supervised learning methods, this new method trains a kernel classifier based on labeled and unlabeled gene expression examples. The semi-supervised trained classifier can then be used to efficiently classify the remaining genes in the dataset. Moreover, we model the confidence of microarray probes and probabilistically combine multiple probe predictions into gene predictions. We apply BGEN to identify genes involved in the development of a specific cell lineage in the C. elegans embryo, and to further identify the tissues in which these genes are enriched. Compared to K-means clustering and SVM classification, BGEN achieves higher sensitivity and specificity. We confirm certain predictions by biological experiments. AVAILABILITY: The results are available at http://www.csail.mit.edu/~alanqi/projects/BGEN.html. 相似文献

17.

Predicting protein sumoylation sites from sequence features

Teng S Luo H Wang L 《Amino acids》2012,43(1):447-455

Protein sumoylation is a post-translational modification that plays an important role in a wide range of cellular processes. Small ubiquitin-related modifier (SUMO) can be covalently and reversibly conjugated to the sumoylation sites of target proteins, many of which are implicated in various human genetic disorders. The accurate prediction of protein sumoylation sites may help biomedical researchers to design their experiments and understand the molecular mechanism of protein sumoylation. In this study, a new machine learning approach has been developed for predicting sumoylation sites from protein sequence information. Random forests (RFs) and support vector machines (SVMs) were trained with the data collected from the literature. Domain-specific knowledge in terms of relevant biological features was used for input vector encoding. It was shown that RF classifier performance was affected by the sequence context of sumoylation sites, and 20 residues with the core motif ΨKXE in the middle appeared to provide enough context information for sumoylation site prediction. The RF classifiers were also found to outperform SVM models for predicting protein sumoylation sites from sequence features. The results suggest that the machine learning approach gives rise to more accurate prediction of protein sumoylation sites than the other existing methods. The accurate classifiers have been used to develop a new web server, called seeSUMO (http://bioinfo.ggc.org/seesumo/), for sequence-based prediction of protein sumoylation sites. 相似文献

18.

Automated analysis of high‐content microscopy data with deep learning

下载免费PDF全文

Jimmy Ba Yolanda Chong Brendan J Frey Charles Boone Brenda J Andrews 《Molecular systems biology》2017,13(4)

Existing computational pipelines for quantitative analysis of high‐content microscopy data rely on traditional machine learning approaches that fail to accurately classify more than a single dataset without substantial tuning and training, requiring extensive analysis. Here, we demonstrate that the application of deep learning to biological image data can overcome the pitfalls associated with conventional machine learning classifiers. Using a deep convolutional neural network (DeepLoc) to analyze yeast cell images, we show improved performance over traditional approaches in the automated classification of protein subcellular localization. We also demonstrate the ability of DeepLoc to classify highly divergent image sets, including images of pheromone‐arrested cells with abnormal cellular morphology, as well as images generated in different genetic backgrounds and in different laboratories. We offer an open‐source implementation that enables updating DeepLoc on new microscopy datasets. This study highlights deep learning as an important tool for the expedited analysis of high‐content microscopy data. 相似文献

19.

Identification of 40LoVe, a Xenopus hnRNP D family protein involved in localizing a TGF-beta-related mRNA during oogenesis

Czaplinski K Köcher T Schelder M Segref A Wilm M Mattaj IW 《Developmental cell》2005,8(4):505-515

Asymmetric distribution of cellular components underlies many biological processes, and the localization of mRNAs within domains of the cytoplasm is one important mechanism of establishing and maintaining cellular asymmetry. mRNA localization often involves assembly of large ribonucleoproteins (RNPs) in the cytoplasm. Using an RNA affinity chromatography approach, we investigated localization RNP formation on the vegetal localization element (VLE) of the mRNA encoding Vg1, a Xenopus TGF-beta family member. We identified 40LoVe, an hnRNP D family protein, as a specific VLE binding protein from Xenopus oocytes. Interaction of 40LoVe with the VLE strictly correlates with the ability of the RNA to localize, and antibodies against 40LoVe inhibit vegetal localization in vivo in oocytes. Our results associate an hnRNP D protein with mRNA localization and have implications for several functions mediated by this important protein family. 相似文献

20.

DQB: A novel dynamic quantitive classification model using artificial bee colony algorithm with application on gene expression profiles

Hala M. Alshamlan 《Saudi Journal of Biological Sciences》2018,25(5):932-946

In the medical domain, it is very significant to develop a rule-based classification model. This is because it has the ability to produce a comprehensible and understandable model that accounts for the predictions. Moreover, it is desirable to know not only the classification decisions but also what leads to these decisions. In this paper, we propose a novel dynamic quantitative rule-based classification model, namely DQB, which integrates quantitative association rule mining and the Artificial Bee Colony (ABC) algorithm to provide users with more convenience in terms of understandability and interpretability via an accurate class quantitative association rule-based classifier model. As far as we know, this is the first attempt to apply the ABC algorithm in mining for quantitative rule-based classifier models. In addition, this is the first attempt to use quantitative rule-based classification models for classifying microarray gene expression profiles. Also, in this research we developed a new dynamic local search strategy named DLS, which is improved the local search for artificial bee colony (ABC) algorithm. The performance of the proposed model has been compared with well-known quantitative-based classification methods and bio-inspired meta-heuristic classification algorithms, using six gene expression profiles for binary and multi-class cancer datasets. From the results, it can be concludes that a considerable increase in classification accuracy is obtained for the DQB when compared to other available algorithms in the literature, and it is able to provide an interpretable model for biologists. This confirms the significance of the proposed algorithm in the constructing a classifier rule-based model, and accordingly proofs that these rules obtain a highly qualified and meaningful knowledge extracted from the training set, where all subset of quantitive rules report close to 100% classification accuracy with a minimum number of genes. It is remarkable that apparently (to the best of our knowledge) several new genes were discovered that have not been seen in any past studies. For the applicability demand, based on the results acqured from microarray gene expression analysis, we can conclude that DQB can be adopted in a different real world applications with some modifications. 相似文献