共查询到7条相似文献,搜索用时 4 毫秒
1.
2.
One of the fundamental goals in cell biology and proteomics is to identify the functions of proteins in the context of compartments that organize them in the cellular environment. Knowledge of subcellular locations of proteins can provide key hints for revealing their functions and understanding how they interact with each other in cellular networking. Unfortunately, it is both time-consuming and expensive to determine the localization of an uncharacterized protein in a living cell purely based on experiments. With the avalanche of newly found protein sequences emerging in the post genomic era, we are facing a critical challenge, that is, how to develop an automated method to fast and reliably identify their subcellular locations so as to be able to timely use them for basic research and drug discovery. In view of this, an ensemble classifier was developed by the approach of fusing many basic individual classifiers through a voting system. Each of these basic classifiers was trained in a different dimension of the amphiphilic pseudo amino acid composition (Chou [2005] Bioinformatics 21: 10-19). As a demonstration, predictions were performed with the fusion classifier for proteins among the following 14 localizations: (1) cell wall, (2) centriole, (3) chloroplast, (4) cytoplasm, (5) cytoskeleton, (6) endoplasmic reticulum, (7) extracellular, (8) Golgi apparatus, (9) lysosome, (10) mitochondria, (11) nucleus, (12) peroxisome, (13) plasma membrane, and (14) vacuole. The overall success rates thus obtained via the resubstitution test, jackknife test, and independent dataset test were all significantly higher than those by the existing classifiers. It is anticipated that the novel ensemble classifier may also become a very useful vehicle in classifying other attributes of proteins according to their sequences, such as membrane protein type, enzyme family/sub-family, G-protein coupled receptor (GPCR) type, and structural class, among many others. The fusion ensemble classifier will be available at www.pami.sjtu.edu.cn/people/hbshen. 相似文献
3.
基于模糊支持向量机的膜蛋白折叠类型预测 总被引:1,自引:0,他引:1
现有的基于支持向量机(support vector machine,SVM)来预测膜蛋白折叠类型的方法.利用的蛋白质序列特征并不充分.并且在处理多类蛋白质分类问题时存在不可分区域,针对这两类问题.提取蛋白质序列的氨基酸和二肽组成特征,并计算加权的多阶氨基酸残基指数相关系数特征,将3类特征融和作为分类器的输入特征矢量.并采用模糊SVM(fuzzy SVM,FSVM)算法解决对传统SVM不可分数据的分类.在无冗余的数据集上测试结果显示.改进的特征提取方法在相同分类算法下预测性能优于已有的特征提取方法:FSVM在相同特征提取方法下性能优于传统的SVM.二者相结合的分类策略在独立性数据集测试下的预测精度达到96.6%.优于现有的多种预测方法.能够作为预测膜蛋白和其它蛋白质折叠类型的有效工具. 相似文献
4.
Protein phosphorylation is important for regulation of most biological functions and up to 50% of all proteins are thought to be modified by protein kinases. Increased knowledge about potential phosphorylation of a protein may increase our understanding of the molecular processes in which it takes part. Despite the importance of protein phosphorylation, identification of phosphoproteins and localization of phosphorylation sites is still a major challenge in proteomics. However, high-throughput methods for identification of phosphoproteins are being developed, in particular within the fields of bioinformatics and mass spectrometry. In this review, we present a toolbox of current technology applied in phosphoproteomics including computational prediction, chemical approaches and mass spectrometry-based analysis, and propose an integrated strategy for experimental phosphoproteomics. 相似文献
5.
Yeonuk Kim Mark S. Johnson Sara H. Knox T. Andrew Black Higo J. Dalmagro Minseok Kang Joon Kim Dennis Baldocchi 《Global Change Biology》2020,26(3):1499-1518
Methane flux (FCH4) measurements using the eddy covariance technique have increased over the past decade. FCH4 measurements commonly include data gaps, as is the case with CO2 and energy fluxes. However, gap‐filling FCH4 data are more challenging than other fluxes due to its unique characteristics including multidriver dependency, variabilities across multiple timescales, nonstationarity, spatial heterogeneity of flux footprints, and lagged influence of biophysical drivers. Some researchers have applied a marginal distribution sampling (MDS) algorithm, a standard gap‐filling method for other fluxes, to FCH4 datasets, and others have applied artificial neural networks (ANN) to resolve the challenging characteristics of FCH4. However, there is still no consensus regarding FCH4 gap‐filling methods due to limited comparative research. We are not aware of the applications of machine learning (ML) algorithms beyond ANN to FCH4 datasets. Here, we compare the performance of MDS and three ML algorithms (ANN, random forest [RF], and support vector machine [SVM]) using multiple combinations of ancillary variables. In addition, we applied principal component analysis (PCA) as an input to the algorithms to address multidriver dependency of FCH4 and reduce the internal complexity of the algorithmic structures. We applied this approach to five benchmark FCH4 datasets from both natural and managed systems located in temperate and tropical wetlands and rice paddies. Results indicate that PCA improved the performance of MDS compared to traditional inputs. ML algorithms performed better when using all available biophysical variables compared to using PCA‐derived inputs. Overall, RF was found to outperform other techniques for all sites. We found gap‐filling uncertainty is much larger than measurement uncertainty in accumulated CH4 budget. Therefore, the approach used for FCH4 gap filling can have important implications for characterizing annual ecosystem‐scale methane budgets, the accuracy of which is important for evaluating natural and managed systems and their interactions with global change processes. 相似文献
6.
We have used a 692 case dataset, collected retrospectively by a single observer, to develop decision support systems for the cytodiagnosis of fine needle aspirates of breast lesions. In this study, we use a 322 case dataset that was prospectively collected by multiple observers in a working clinical environment to test two predictive systems, using logistic regression and the multilayer perceptron (MLP) type of neural network. Ten observed features and the patient age were used as input features. The systems were developed using a training set and test set from the single observer dataset and then applied to the multiple observer dataset. For the independent test cases from the single observer dataset, with a threshold set for no false positives on the training set, logistic regression produced a sensitivity of 82% (95% confidence interval 73-91) and a predictive value of a positive result (PV +) of 98% (95-99), the values for the MLP were 79% (69-89) and 100%, respectively. However the performance on the prospective multiple observer dataset was much worse, with a sensitivity of 72% (65-80), and PV + of 97% (94-99) for logistic regression and 67% (60-75) and 91% (85-97) for the MLP. These results suggest that there is considerable interobserver variability for the defined features and that this system is unsuitable for further development in the clinical environment unless this problem can be overcome. 相似文献