首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
MOTIVATION: Determining locations of protein expression is essential to understand protein function. Advances in green fluorescence protein (GFP) fusion proteins and automated fluorescence microscopy allow for rapid acquisition of large collections of protein localization images. Recognition of these cell images requires an automated image analysis system. Approaches taken by previous work concentrated on designing a set of optimal features and then applying standard machine-learning algorithms. In fact, trends of recent advances in machine learning and computer vision can be applied to improve the performance. One trend is the advances in multiclass learning with error-correcting output codes (ECOC). Another trend is the use of a large number of weak detectors with boosting for detecting objects in images of real-world scenes. RESULTS: We take advantage of these advances to propose a new learning algorithm, AdaBoost.ERC, coupled with weak and strong detectors, to improve the performance of automatic recognition of protein subcellular locations in cell images. We prepared two image data sets of CHO and Vero cells and downloaded a HeLa cell image data set in the public domain to evaluate our new method. We show that AdaBoost.ERC outperforms other AdaBoost extensions. We demonstrate the benefit of weak detectors by showing significant performance improvements over classifiers using only strong detectors. We also empirically test our method's capability of generalizing to heterogeneous image collections. Compared with previous work, our method performs reasonably well for the HeLa cell images. AVAILABILITY: CHO and Vero cell images, their corresponding feature sets (SSLF and WSLF), our new learning algorithm, AdaBoost.ERC, and Supplementary Material are available at http://aiia.iis.sinica.edu.tw/  相似文献   

2.
Some insights into protein structural class prediction   总被引:7,自引:0,他引:7  
Zhou GP  Assa-Munt N 《Proteins》2001,44(1):57-59
It has been quite clear that the success rate for predicting protein structural class can be improved significantly by using the algorithms that incorporate the coupling effect among different amino acid components of a protein. However, there is still a lot of confusion in understanding the relationship of these advanced algorithms, such as the least Mahalanobis distance algorithm, the component-coupled algorithm, and the Bayes decision rule. In this communication, a simple, rigorous derivation is provided to prove that the Bayes decision rule introduced recently for protein structural class prediction is completely the same as the earlier component-coupled algorithm. Meanwhile, it is also very clear from the derivative equations that the least Mahalanobis distance algorithm is an approximation of the component-coupled algorithm, also named as the covariant-discriminant algorithm introduced by Chou and Elrod in protein subcellular location prediction (Protein Engineering, 1999; 12:107-118). Clarification of the confusion will help use these powerful algorithms effectively and correctly interpret the results obtained by them, so as to conduce to the further development not only in the structural prediction area, but in some other relevant areas in protein science as well.  相似文献   

3.
For a protein, an important characteristic is its location or compartment in a cell. This is because a protein has to be located in its proper position in a cell to perform its biological functions. Therefore, predicting protein subcellular location is an important and challenging task in current molecular and cellular biology. In this paper, based on AdaBoost.ME algorithm and Chou's PseAAC (pseudo amino acid composition), a new computational method was developed to identify protein subcellular location. AdaBoost.ME is an improved version of AdaBoost algorithm that can directly extend the original AdaBoost algorithm to deal with multi-class cases without the need to reduce it to multiple two-class problems. In some previous studies the conventional amino acid composition was applied to represent protein samples. In order to take into account the sequence order effects, in this study we use Chou's PseAAC to represent protein samples. To demonstrate that AdaBoost.ME is a robust and efficient model in predicting protein subcellular locations, the same protein dataset used by Cedano et al. (Journal of Molecular Biology, 1997, 266: 594-600) is adopted in this paper. It can be seen from the computed results that the accuracy achieved by our method is better than those by the methods developed by the previous investigators.  相似文献   

4.
The development of high-throughput technology has generated a massive amount of high-dimensional data, and many of them are of discrete type. Robust and efficient learning algorithms such as LASSO [1] are required for feature selection and overfitting control. However, most feature selection algorithms are only applicable to the continuous data type. In this paper, we propose a novel method for sparse support vector machines (SVMs) with L_{p} (p ≪ 1) regularization. Efficient algorithms (LpSVM) are developed for learning the classifier that is applicable to high-dimensional data sets with both discrete and continuous data types. The regularization parameters are estimated through maximizing the area under the ROC curve (AUC) of the cross-validation data. Experimental results on protein sequence and SNP data attest to the accuracy, sparsity, and efficiency of the proposed algorithm. Biomarkers identified with our methods are compared with those from other methods in the literature. The software package in Matlab is available upon request.  相似文献   

5.
赵学彤  杨亚东  渠鸿竹  方向东 《遗传》2018,40(9):693-703
随着组学技术的不断发展,对于不同层次和类型的生物数据的获取方法日益成熟。在疾病诊治过程中会产生大量数据,通过机器学习等人工智能方法解析复杂、多维、多尺度的疾病大数据,构建临床决策支持工具,辅助医生寻找快速且有效的疾病诊疗方案是非常必要的。在此过程中,机器学习等人工智能方法的选择显得尤为重要。基于此,本文首先从类型和算法角度对临床决策支持领域中常用的机器学习等方法进行简要综述,分别介绍了支持向量机、逻辑回归、聚类算法、Bagging、随机森林和深度学习,对机器学习等方法在临床决策支持中的应用做了相应总结和分类,并对它们的优势和不足分别进行讨论和阐述,为临床决策支持中机器学习等人工智能方法的选择提供有效参考。  相似文献   

6.
Predicting subcellular localization with AdaBoost Learner   总被引:1,自引:0,他引:1  
Protein subcellular localization, which tells where a protein resides in a cell, is an important characteristic of a protein, and relates closely to the function of proteins. The prediction of their subcellular localization plays an important role in the prediction of protein function, genome annotation and drug design. Therefore, it is an important and challenging role to predict subcellular localization using bio-informatics approach. In this paper, a robust predictor, AdaBoost Learner is introduced to predict protein subcellular localization based on its amino acid composition. Jackknife cross-validation and independent dataset test were used to demonstrate that Adaboost is a robust and efficient model in predicting protein subcellular localization. As a result, the correct prediction rates were 74.98% and 80.12% for the Jackknife test and independent dataset test respectively, which are higher than using other existing predictors. An online server for predicting subcellular localization of proteins based on AdaBoost classifier was available on http://chemdata.shu. edu.cn/sl12.  相似文献   

7.
Jiang X  Wei R  Zhao Y  Zhang T 《Amino acids》2008,34(4):669-675
The knowledge of subnuclear localization in eukaryotic cells is essential for understanding the life function of nucleus. Developing prediction methods and tools for proteins subnuclear localization become important research fields in protein science for special characteristics in cell nuclear. In this study, a novel approach has been proposed to predict protein subnuclear localization. Sample of protein is represented by Pseudo Amino Acid (PseAA) composition based on approximate entropy (ApEn) concept, which reflects the complexity of time series. A novel ensemble classifier is designed incorporating three AdaBoost classifiers. The base classifier algorithms in three AdaBoost are decision stumps, fuzzy K nearest neighbors classifier, and radial basis-support vector machines, respectively. Different PseAA compositions are used as input data of different AdaBoost classifier in ensemble. Genetic algorithm is used to optimize the dimension and weight factor of PseAA composition. Two datasets often used in published works are used to validate the performance of the proposed approach. The obtained results of Jackknife cross-validation test are higher and more balance than them of other methods on same datasets. The promising results indicate that the proposed approach is effective and practical. It might become a useful tool in protein subnuclear localization. The software in Matlab and supplementary materials are available freely by contacting the corresponding author.  相似文献   

8.
Yasui Y  Pepe M  Hsu L  Adam BL  Feng Z 《Biometrics》2004,60(1):199-206
Training data in a supervised learning problem consist of the class label and its potential predictors for a set of observations. Constructing effective classifiers from training data is the goal of supervised learning. In biomedical sciences and other scientific applications, class labels may be subject to errors. We consider a setting where there are two classes but observations with labels corresponding to one of the classes may in fact be mislabeled. The application concerns the use of protein mass-spectrometry data to discriminate between serum samples from cancer and noncancer patients. The patients in the training set are classified on the basis of tissue biopsy. Although biopsy is 100% specific in the sense that a tissue that shows itself to have malignant cells is certainly cancer, it is less than 100% sensitive. Reference gold standards that are subject to this special type of misclassification due to imperfect diagnosis certainty arise in many fields. We consider the development of a supervised learning algorithm under these conditions and refer to it as partially supervised learning. Boosting is a supervised learning algorithm geared toward high-dimensional predictor data, such as those generated in protein mass-spectrometry. We propose a modification of the boosting algorithm for partially supervised learning. The proposal is to view the true class membership of the samples that are labeled with the error-prone class label as missing data, and apply an algorithm related to the EM algorithm for minimization of a loss function. To assess the usefulness of the proposed method, we artificially mislabeled a subset of samples and applied the original and EM-modified boosting (EM-Boost) algorithms for comparison. Notable improvements in misclassification rates are observed with EM-Boost.  相似文献   

9.
Prediction of protein classification is an important topic in molecular biology. This is because it is able to not only provide useful information from the viewpoint of structure itself, but also greatly stimulate the characterization of many other features of proteins that may be closely correlated with their biological functions. In this paper, the LogitBoost, one of the boosting algorithms developed recently, is introduced for predicting protein structural classes. It performs classification using a regression scheme as the base learner, which can handle multi-class problems and is particularly superior in coping with noisy data. It was demonstrated that the LogitBoost outperformed the support vector machines in predicting the structural classes for a given dataset, indicating that the new classifier is very promising. It is anticipated that the power in predicting protein structural classes as well as many other bio-macromolecular attributes will be further strengthened if the LogitBoost and some other existing algorithms can be effectively complemented with each other.  相似文献   

10.
We present an approach to predicting protein structural class that uses amino acid composition and hydrophobic pattern frequency information as input to two types of neural networks: (1) a three-layer back-propagation network and (2) a learning vector quantization network. The results of these methods are compared to those obtained from a modified Euclidean statistical clustering algorithm. The protein sequence data used to drive these algorithms consist of the normalized frequency of up to 20 amino acid types and six hydrophobic amino acid patterns. From these frequency values the structural class predictions for each protein (all-alpha, all-beta, or alpha-beta classes) are derived. Examples consisting of 64 previously classified proteins were randomly divided into multiple training (56 proteins) and test (8 proteins) sets. The best performing algorithm on the test sets was the learning vector quantization network using 17 inputs, obtaining a prediction accuracy of 80.2%. The Matthews correlation coefficients are statistically significant for all algorithms and all structural classes. The differences between algorithms are in general not statistically significant. These results show that information exists in protein primary sequences that is easily obtainable and useful for the prediction of protein structural class by neural networks as well as by standard statistical clustering algorithms.  相似文献   

11.
Structural class characterizes the overall folding type of a protein or its domain and the prediction of protein structural class has become both an important and a challenging topic in protein science. Moreover, the prediction itself can stimulate the development of novel predictors that may be straightforwardly applied to many other relational areas. In this paper, 10 frequently used sequence-derived structural and physicochemical features, which can be easily computed by the PROFEAT (Protein Features) web server, were taken as inputs of support vector machines to develop statistical learning models for predicting the protein structural class. More importantly, a strategy of merging different features, called best-first search, was developed. It was shown through the rigorous jackknife cross-validation test that the success rates by our method were significantly improved. We anticipate that the present method may also have important impacts on boosting the predictive accuracies for a series of other protein attributes, such as subcellular localization, membrane types, enzyme family and subfamily classes, among many others.  相似文献   

12.
A pseudo-random generator is an algorithm to generate a sequence of objects determined by a truly random seed which is not truly random. It has been widely used in many applications, such as cryptography and simulations. In this article, we examine current popular machine learning algorithms with various on-line algorithms for pseudo-random generated data in order to find out which machine learning approach is more suitable for this kind of data for prediction based on on-line algorithms. To further improve the prediction performance, we propose a novel sample weighted algorithm that takes generalization errors in each iteration into account. We perform intensive evaluation on real Baccarat data generated by Casino machines and random number generated by a popular Java program, which are two typical examples of pseudo-random generated data. The experimental results show that support vector machine and k-nearest neighbors have better performance than others with and without sample weighted algorithm in the evaluation data set.  相似文献   

13.
Zhang TL  Ding YS 《Amino acids》2007,33(4):623-629
Compared with the conventional amino acid composition (AA), the pseudo amino acid composition (PseAA) as originally introduced by Chou can incorporate much more information of a protein sequence; this remarkably enhances the power to use a discrete model for predicting various attributes of a protein. In this study, based on the concept of Chou's PseAA, a 46-D (dimensional) PseAA was formulated to represent the sample of a protein and a new approach based on binary-tree support vector machines (BTSVMs) was proposed to predict the protein structural class. BTSVMs algorithm has the capability in solving the problem of unclassifiable data points in multi-class SVMs. The results by both the 10-fold cross-validation and jackknife tests demonstrate that the predictive performance using the new PseAA (46-D) is better than that of AA (20-D), which is widely used in many algorithms for protein structural class prediction. The results obtained by the new approach are quite encouraging, indicating that it can at least play a complimentary role to many of the existing methods and is a useful tool for predicting many other protein attributes as well.  相似文献   

14.
Cai YD 《Proteins》2001,43(3):336-338
The paradox recently raised by Wang and Yuan (Proteins 2000;38:165-175) in protein structural class prediction is actually a misinterpretation of the data reported in the literature. The Bayes decision rule, which was deemed by Wang and Yuan to be the most powerful method for predicting protein structural classes based on the amino acid composition, and applied by these investigators to derive the upper limit of prediction rate for structural classes, is actually completely the same as the component-coupled algorithm proposed by previous investigators (Chou et al., Proteins 1998;31:97-103). Owing to lack of a complete or near-complete training data set, the upper limit rate thus derived by these investigators might be both invalid and misleading. Clarification of these points will further stimulate investigation of this interesting area.  相似文献   

15.
Studying similarities in protein molecules has become a fundamental activity in much of biology and biomedical research, for which methods such as multiple sequence alignments are widely used. Most methods available for such comparisons cater to studying proteins which have clearly recognizable evolutionary relationships but not to proteins that recognize the same or similar ligands but do not share similarities in their sequence or structural folds. In many cases, proteins in the latter class share structural similarities only in their binding sites. While several algorithms are available for comparing binding sites, there are none for deriving structural motifs of the binding sites, independent of the whole proteins. We report the development of SiteMotif, a new algorithm that compares binding sites from multiple proteins and derives sequence-order independent structural site motifs. We have tested the algorithm at multiple levels of complexity and demonstrate its performance in different scenarios. We have benchmarked against 3 current methods available for binding site comparison and demonstrate superior performance of our algorithm. We show that SiteMotif identifies new structural motifs of spatially conserved residues in proteins, even when there is no sequence or fold-level similarity. We expect SiteMotif to be useful for deriving key mechanistic insights into the mode of ligand interaction, predict the ligand type that a protein can bind and improve the sensitivity of functional annotation.  相似文献   

16.
Determining structural similarities between proteins is an important problem since it can help identify functional and evolutionary relationships. In this paper, an algorithm is proposed to align two protein structures. Given the protein backbones, the algorithm finds a rigid motion of one backbone onto the other such that large substructures are matched. The algorithm uses a representation of the backbones that is independent of their relative orientations in space and applies dynamic programming to this representation to compute an initial alignment, which is then refined iteratively. Experiments indicate that the algorithm is competitive with two well-known algorithms, namely DALI and LOCK.  相似文献   

17.
Rong Liu  Jianjun Hu 《Proteins》2013,81(11):1885-1899
Accurate prediction of DNA‐binding residues has become a problem of increasing importance in structural bioinformatics. Here, we presented DNABind, a novel hybrid algorithm for identifying these crucial residues by exploiting the complementarity between machine learning‐ and template‐based methods. Our machine learning‐based method was based on the probabilistic combination of a structure‐based and a sequence‐based predictor, both of which were implemented using support vector machines algorithms. The former included our well‐designed structural features, such as solvent accessibility, local geometry, topological features, and relative positions, which can effectively quantify the difference between DNA‐binding and nonbinding residues. The latter combined evolutionary conservation features with three other sequence attributes. Our template‐based method depended on structural alignment and utilized the template structure from known protein–DNA complexes to infer DNA‐binding residues. We showed that the template method had excellent performance when reliable templates were found for the query proteins but tended to be strongly influenced by the template quality as well as the conformational changes upon DNA binding. In contrast, the machine learning approach yielded better performance when high‐quality templates were not available (about 1/3 cases in our dataset) or the query protein was subject to intensive transformation changes upon DNA binding. Our extensive experiments indicated that the hybrid approach can distinctly improve the performance of the individual methods for both bound and unbound structures. DNABind also significantly outperformed the state‐of‐art algorithms by around 10% in terms of Matthews's correlation coefficient. The proposed methodology could also have wide application in various protein functional site annotations. DNABind is freely available at http://mleg.cse.sc.edu/DNABind/ . Proteins 2013; 81:1885–1899. © 2013 Wiley Periodicals, Inc.  相似文献   

18.
Autonomous learning techniques are based on experience acquisition. In most realistic applications, experience is time-consuming: it implies sensor reading, actuator control and algorithmic update, constrained by the learning system dynamics. The information crudeness upon which classical learning algorithms operate make such problems too difficult and unrealistic. Nonetheless, additional information for facilitating the learning process ideally should be embedded in such a way that the structural, well-studied characteristics of these fundamental algorithms are maintained. We investigate in this article a more general formulation of the Q-learning method that allows for a spreading of information derived from single updates towards a neighbourhood of the instantly visited state and converges to optimality. We show how this new formulation can be used as a mechanism to safely embed prior knowledge about the structure of the state space, and demonstrate it in a modified implementation of a reinforcement learning algorithm in a real robot navigation task.  相似文献   

19.
Autonomous learning techniques are based on experience acquisition. In most realistic applications, experience is time-consuming: it implies sensor reading, actuator control and algorithmic update, constrained by the learning system dynamics. The information crudeness upon which classical learning algorithms operate make such problems too difficult and unrealistic. Nonetheless, additional information for facilitating the learning process ideally should be embedded in such a way that the structural, well-studied characteristics of these fundamental algorithms are maintained. We investigate in this article a more general formulation of the Q-learning method that allows for a spreading of information derived from single updates towards a neighbourhood of the instantly visited state and converges to optimality. We show how this new formulation can be used as a mechanism to safely embed prior knowledge about the structure of the state space, and demonstrate it in a modified implementation of a reinforcement learning algorithm in a real robot navigation task.  相似文献   

20.
Three-dimensional reconstruction from electron micrographs requires the selection of many single-particle projection images; more than 10 000 are generally required to obtain 5- to 10-A structural resolution. Consequently, various automatic detection algorithms have been developed and successfully applied to large symmetric protein complexes. This paper presents a new automated particle recognition and pickup procedure based on the three-layer neural network that has a large application range than other automated procedures. Its use for both faint and noisy electron micrographs is demonstrated. The method requires only 200 selected particles as learning data and is able to detect images of proteins as small as 200 kDa.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号