首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
As one of the most common post-translational modifications, ubiquitination regulates the quantity and function of a variety of proteins. Experimental and clinical investigations have also suggested the crucial roles of ubiquitination in several human diseases. The complicated sequence context of human ubiquitination sites revealed by proteomic studies highlights the need of developing effective computational strategies to predict human ubiquitination sites. Here we report the establishment of a novel human-specific ubiquitination site predictor through the integration of multiple complementary classifiers. Firstly, a Support Vector Machine (SVM) classier was constructed based on the composition of k-spaced amino acid pairs (CKSAAP) encoding, which has been utilized in our previous yeast ubiquitination site predictor. To further exploit the pattern and properties of the ubiquitination sites and their flanking residues, three additional SVM classifiers were constructed using the binary amino acid encoding, the AAindex physicochemical property encoding and the protein aggregation propensity encoding, respectively. Through an integration that relied on logistic regression, the resulting predictor termed hCKSAAP_UbSite achieved an area under ROC curve (AUC) of 0.770 in 5-fold cross-validation test on a class-balanced training dataset. When tested on a class-balanced independent testing dataset that contains 3419 ubiquitination sites, hCKSAAP_UbSite has also achieved a robust performance with an AUC of 0.757. Specifically, it has consistently performed better than the predictor using the CKSAAP encoding alone and two other publicly available predictors which are not human-specific. Given its promising performance in our large-scale datasets, hCKSAAP_UbSite has been made publicly available at our server (http://protein.cau.edu.cn/cksaap_ubsite/).  相似文献   

2.
Pluripotent stem cells are able to self-renew, and to differentiate into all adult cell types. Many studies report data describing these cells, and characterize them in molecular terms. Machine learning yields classifiers that can accurately identify pluripotent stem cells, but there is a lack of studies yielding minimal sets of best biomarkers (genes/features). We assembled gene expression data of pluripotent stem cells and non-pluripotent cells from the mouse. After normalization and filtering, we applied machine learning, classifying samples into pluripotent and non-pluripotent with high cross-validated accuracy. Furthermore, to identify minimal sets of best biomarkers, we used three methods: information gain, random forests and a wrapper of genetic algorithm and support vector machine (GA/SVM). We demonstrate that the GA/SVM biomarkers work best in combination with each other; pathway and enrichment analyses show that they cover the widest variety of processes implicated in pluripotency. The GA/SVM wrapper yields best biomarkers, no matter which classification method is used. The consensus best biomarker based on the three methods is Tet1, implicated in pluripotency just recently. The best biomarker based on the GA/SVM wrapper approach alone is Fam134b, possibly a missing link between pluripotency and some standard surface markers of unknown function processed by the Golgi apparatus.  相似文献   

3.
The complete genome of severe acute respiratory syndrome coronavirus (SARS-CoV) reveals the existence of putative proteins unique to SARS-CoV. Identification of their function facilitates a mechanistic understanding of SARS infection and drug development for its treatment. The sequence of the majority of these putative proteins has no significant similarity to those of known proteins, which complicates the task of using sequence analysis tools to probe their function. Support vector machines (SVM), useful for predicting the functional class of distantly related proteins, is employed to ascribe a possible functional class to SARS-CoV proteins. Testing results indicate that SVM is able to predict the functional class of 73% of the known SARS-CoV proteins with available sequences and 67% of 18 other novel viral proteins. A combination of the sequence comparison method BLAST and SVMProt can further improve the prediction accuracy of SMVProt such that the functional class of two additional SARS-CoV proteins is correctly predicted. Our study suggests that the SARS-CoV genome possibly contains a putative voltage-gated ion channel, structural proteins, a carbon-oxygen lyase, oxidoreductases acting on the CH-OH group of donors, and an ATP-binding cassette transporter. A web version of our software, SVMProt, is accessible at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi .  相似文献   

4.
《Fly》2013,7(3):164-171
The availability of complete genome sequence information for diverse organisms including model genetic organisms has ushered in a new era of protein sequence comparisons making it possible to search for commonalities among entire proteomes using the Basic Local Alignment Search Tool (BLAST). Although the identification and analysis of proteins shared by humans and model organisms has proven an invaluable tool to understanding gene function, the sets of proteins unique to a given model organism's proteome have remained largely unexplored. We have constructed a searchable database that allows biologists to identify proteins unique to a given proteome. The Negative Proteome Database (NPD) is populated with pair-wise protein sequence comparisons between each of the following proteomes: Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, Dictyostelium discoideum, Chlamydomonus reinhardti, Escherichia coli K12, Arabidopsis thaliana and Methanoscarcina acetivorans. Our analysis of negative proteome datasets using the NPD has thus far revealed 107 proteins in humans that may be involved in motile cilia function, 1628 potential pesticide target proteins in flies, 659 proteins shared by flies and humans that are not represented in the less neurologically complex worm proteome, and 180 nuclear encoded human disease associated proteins that are absent from the fly proteome. The NPD is the only online resource where users can quickly perform complex negative and positive comparisons of model organism proteomes. We anticipate that the NPD and the adaptable algorithm which can readily be used to duplicate this analysis on custom sets of proteomes will be an invaluable tool in the investigation of organism specific protein sets.  相似文献   

5.
A number of fungal and oomycete plant pathogens of major economic importance feed on their hosts by means of haustoria, which they place inside living plant cells. The underlying mechanisms are poorly understood, partly due to difficulty in preparing haustoria. We have therefore developed a procedure for isolating haustoria from the barley powdery mildew fungus (Blumeria graminis f.sp. hordei, Bgh). We subsequently aimed to understand the molecular mechanisms of haustoria through a study of their proteome. Extracted proteins were digested using trypsin, separated by LC, and analysed by MS/MS. Searches of a custom Bgh EST sequence database and the NCBI‐NR fungal protein database, using the MS/MS data, identified 204 haustoria proteins. The majority of the proteins appear to have roles in protein metabolic pathways and biological energy production. Surprisingly, pyruvate decarboxylase (PDC), involved in alcoholic fermentation and commonly abundant in fungi and plants, was absent in our Bgh proteome data set. A sequence encoding this enzyme was also absent in our EST sequence database. Significantly, BLAST searches of the recently available Bgh genome sequence data also failed to identify a sequence encoding this enzyme, strongly indicating that Bgh does not have a gene for PDC.  相似文献   

6.
Gundry RL  Burridge PW  Boheler KR 《Proteomics》2011,11(20):3947-3961
Stem cells represent obvious choices for regenerative medicine and are invaluable for studies of human development and drug testing. The proteomic landscape of pluripotent stem cells (PSCs), in particular, is not yet clearly defined; consequently, this field of research would greatly benefit from concerted efforts designed to better characterize these cells. In this concise review, we provide an overview of stem cell potency, highlight the types and practical implications of heterogeneity in PSCs and provide a detailed analysis of the current view of the pluripotent proteome in a unique resource for this rapidly evolving field. Our goal in this review is to provide specific insights into the current status of the known proteome of both mouse and human PSCs. This has been accomplished by integrating published data into a unified PSC proteome to facilitate the identification of proteins, which may be informative for the stem cell state as well as to reveal areas where our current view is limited. These analyses provide insight into the challenges faced in the proteomic analysis of PSCs and reveal one area--the cell surface subproteome--that would especially benefit from enhanced research efforts.  相似文献   

7.
Reiter LT  Do LH  Fischer MS  Hong NA  Bier E 《Fly》2007,1(3):164-171
The availability of complete genome sequence information for diverse organisms including model genetic organisms has ushered in a new era of protein sequence comparisons making it possible to search for commonalities among entire proteomes using the Basic Local Alignment Search Tool (BLAST). Although the identification and analysis of proteins shared by humans and model organisms has proven an invaluable tool to understanding gene function, the sets of proteins unique to a given model organism's proteome have remained largely unexplored. We have constructed a searchable database that allows biologists to identify proteins unique to a given proteome. The Negative Proteome Database (NPD) is populated with pair-wise protein sequence comparisons between each of the following proteomes: Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, Dictyostelium discoideum, Chlamydomonus reinhardti, Escherichia coli K12, Arabidopsis thaliana and Methanoscarcina acetivorans. Our analysis of negative proteome datasets using the NPD has thus far revealed 107 proteins in humans that may be involved in motile cilia function, 1628 potential pesticide target proteins in flies, 659 proteins shared by flies and humans that are not represented in the less neurologically complex worm proteome, and 180 nuclear encoded human disease associated proteins that are absent from the fly proteome. The NPD is the only online resource where users can quickly perform complex negative and positive comparisons of model organism proteomes. We anticipate that the NPD and the adaptable algorithm which can readily be used to duplicate this analysis on custom sets of proteomes will be an invaluable tool in the investigation of organism specific protein sets.  相似文献   

8.
Identifying protein–protein interactions (PPIs) is critical for understanding the cellular function of the proteins and the machinery of a proteome. Data of PPIs derived from high-throughput technologies are often incomplete and noisy. Therefore, it is important to develop computational methods and high-quality interaction dataset for predicting PPIs. A sequence-based method is proposed by combining correlation coefficient (CC) transformation and support vector machine (SVM). CC transformation not only adequately considers the neighboring effect of protein sequence but describes the level of CC between two protein sequences. A gold standard positives (interacting) dataset MIPS Core and a gold standard negatives (non-interacting) dataset GO-NEG of yeast Saccharomyces cerevisiae were mined to objectively evaluate the above method and attenuate the bias. The SVM model combined with CC transformation yielded the best performance with a high accuracy of 87.94% using gold standard positives and gold standard negatives datasets. The source code of MATLAB and the datasets are available on request under smgsmg@mail.ustc.edu.cn.  相似文献   

9.
Glycosylation is one of the most abundant and an important post-translational modification of proteins. Glycosylated proteins (glycoproteins) are involved in various cellular biological functions like protein folding, cell-cell interactions, cell recognition and host-pathogen interactions. A large number of eukaryotic glycoproteins also have therapeutic and potential technology applications. Therefore, characterization and analysis of glycosites (glycosylated residues) in these proteins is of great interest to biologists. In order to cater these needs a number of in silico tools have been developed over the years, however, a need to get even better prediction tools remains. Therefore, in this study we have developed a new webserver GlycoEP for more accurate prediction of N-linked, O-linked and C-linked glycosites in eukaryotic glycoproteins using two larger datasets, namely, standard and advanced datasets. In case of standard datasets no two glycosylated proteins are more similar than 40%; advanced datasets are highly non-redundant where no two glycosites’ patterns (as defined in methods) have more than 60% similarity. Further, based on our results with several algorihtms developed using different machine-learning techniques, we found Support Vector Machine (SVM) as optimum tool to develop glycosite prediction models. Accordingly, using our more stringent and non-redundant advanced datasets, the SVM based models developed in this study achieved a prediction accuracy of 84.26%, 86.87% and 91.43% with corresponding MCC of 0.54, 0.20 and 0.78, for N-, O- and C-linked glycosites, respectively. The best performing models trained on advanced datasets were then implemented as a user-friendly web server GlycoEP (http://www.imtech.res.in/raghava/glycoep/). Additionally, this server provides prediction models developed on standard datasets and allows users to scan sequons in input protein sequences.  相似文献   

10.
MOTIVATION: Distance measures built on the notion of text compression have been used for the comparison and classification of entire genomes and mitochondrial genomes. The present study was undertaken in order to explore their utility in the classification of protein sequences. RESULTS: We constructed compression-based distance measures (CBMs) using the Lempel-Zlv and the PPMZ compression algorithms and compared their performance with that of the Smith-Waterman algorithm and BLAST, using nearest neighbour or support vector machine classification schemes. The datasets included a subset of the SCOP protein structure database to test distant protein similarities, a 3-phosphoglycerate-kinase sequences selected from archaean, bacterial and eukaryotic species as well as low and high-complexity sequence segments of the human proteome, CBMs values show a dependence on the length and the complexity of the sequences compared. In classification tasks CBMs performed especially well on distantly related proteins where the performance of a combined measure, constructed from a CBM and a BLAST score, approached or even slightly exceeded that of the Smith-Waterman algorithm and two hidden Markov model-based algorithms.  相似文献   

11.
Gao J  Faraggi E  Zhou Y  Ruan J  Kurgan L 《PloS one》2012,7(6):e40104
Accurate identification of immunogenic regions in a given antigen chain is a difficult and actively pursued problem. Although accurate predictors for T-cell epitopes are already in place, the prediction of the B-cell epitopes requires further research. We overview the available approaches for the prediction of B-cell epitopes and propose a novel and accurate sequence-based solution. Our BEST (B-cell Epitope prediction using Support vector machine Tool) method predicts epitopes from antigen sequences, in contrast to some method that predict only from short sequence fragments, using a new architecture based on averaging selected scores generated from sliding 20-mers by a Support Vector Machine (SVM). The SVM predictor utilizes a comprehensive and custom designed set of inputs generated by combining information derived from the chain, sequence conservation, similarity to known (training) epitopes, and predicted secondary structure and relative solvent accessibility. Empirical evaluation on benchmark datasets demonstrates that BEST outperforms several modern sequence-based B-cell epitope predictors including ABCPred, method by Chen et al. (2007), BCPred, COBEpro, BayesB, and CBTOPE, when considering the predictions from antigen chains and from the chain fragments. Our method obtains a cross-validated area under the receiver operating characteristic curve (AUC) for the fragment-based prediction at 0.81 and 0.85, depending on the dataset. The AUCs of BEST on the benchmark sets of full antigen chains equal 0.57 and 0.6, which is significantly and slightly better than the next best method we tested. We also present case studies to contrast the propensity profiles generated by BEST and several other methods.  相似文献   

12.
MOTIVATION: Mutagenicity is among the toxicological end points that pose the highest concern. The accelerated pace of drug discovery has heightened the need for efficient prediction methods. Currently, most available tools fall short of the desired degree of accuracy, and can only provide a binary classification. It is of significance to develop a discriminative and informative model for the mutagenicity prediction. RESULTS: Here we developed a mutagenic probability prediction model addressing the problem, based on datasets covering a large chemical space. A novel molecular electrophilicity vector (MEV) is first devised to represent the structure profile of chemical compounds. An extended support vector machine (SVM) method is then used to derive the posterior probabilistic estimation of mutagenicity from the MEVs of the training set. The results show that our model gives a better performance than TOPKAT (http://www.accelrys.com) and other previously published methods. In addition, a confidence level related to the prediction can be provided, which may help people make more flexible decisions on chemical ordering or synthesis. AVAILABILITY: The binary program (ZGTOX_1.1) based on our model and samples of input datasets on Windows PC are available at http://dddc.ac.cn/adme upon request from the authors.  相似文献   

13.
To improve the utility of increasingly large numbers of available unannotated and initially poorly annotated genomic sequences for proteome analysis, we demonstrate that effective protein identification can be made on a large and unannotated genome. The strategy developed is to translate the unannotated genome sequence into amino acid sequence encoding putative proteins in all six reading frames, to identify peptides by tandem mass spectrometry (MS/MS), to localize them on the genome sequence, and to preliminarily annotate the protein via a similarity search by BLAST. These tasks have been optimized and automated. Optimization to obtain multiple peptide matches in effect extends the searchable region and results in more robust protein identification. The viability of this strategy is demonstrated with the identification of 223 cilia proteins in the unicellular eukaryotic model organism Tetrahymena thermophila, whose initial genomic sequence draft was released in November 2003. To the best of our knowledge, this is the first demonstration of large-scale protein identification based on such a large, unannotated genome. Of the 223 cilia proteins, 84 have no similarity to proteins in NCBI's nonredundant (nr) database. This methodology allows identifying the locations of the genes encoding these novel proteins, which is a necessary first step to downstream functional genomic experimentation.  相似文献   

14.
Detecting the boundaries of protein domains is an important and challenging task in both experimental and computational structural biology. In this paper, a promising method for detecting the domain structure of a protein from sequence information alone is presented. The method is based on analyzing multiple sequence alignments derived from a database search. Multiple measures are defined to quantify the domain information content of each position along the sequence. Then they are combined into a single predictor using support vector machine. What is more important, the domain detection is first taken as an imbal- anced data learning problem. A novel undersampling method is proposed on distance-based maximal entropy in the feature space of Support Vector Machine (SVM). The overall precision is about 80%. Simulation results demonstrate that the method can help not only in predicting the complete 3D structure of a protein but also in the machine learning system on general im- balanced datasets.  相似文献   

15.
The identification of catalytic residues is an essential step in functional characterization of enzymes. We present a purely structural approach to this problem, which is motivated by the difficulty of evolution-based methods to annotate structural genomics targets that have few or no homologs in the databases. Our approach combines a state-of-the-art support vector machine (SVM) classifier with novel structural features that augment structural clues by spatial averaging and Z scoring. Special attention is paid to the class imbalance problem that stems from the overwhelming number of non-catalytic residues in enzymes compared to catalytic residues. This problem is tackled by: (1) optimizing the classifier to maximize a performance criterion that considers both Type I and Type II errors in the classification of catalytic and non-catalytic residues; (2) under-sampling non-catalytic residues before SVM training; and (3) during SVM training, penalizing errors in learning catalytic residues more than errors in learning non-catalytic residues. Tested on four enzyme datasets, one specifically designed by us to mimic the structural genomics scenario and three previously evaluated datasets, our structure-based classifier is never inferior to similar structure-based classifiers and comparable to classifiers that use both structural and evolutionary features. In addition to the evaluation of the performance of catalytic residue identification, we also present detailed case studies on three proteins. This analysis suggests that many false positive predictions may correspond to binding sites and other functional residues. A web server that implements the method, our own-designed database, and the source code of the programs are publicly available at http://www.cs.bgu.ac.il/~meshi/functionPrediction.  相似文献   

16.
The identification of virulent proteins in any de-novo sequenced genome is useful in estimating its pathogenic ability and understanding the mechanism of pathogenesis. Similarly, the identification of such proteins could be valuable in comparing the metagenome of healthy and diseased individuals and estimating the proportion of pathogenic species. However, the common challenge in both the above tasks is the identification of virulent proteins since a significant proportion of genomic and metagenomic proteins are novel and yet unannotated. The currently available tools which carry out the identification of virulent proteins provide limited accuracy and cannot be used on large datasets. Therefore, we have developed an MP3 standalone tool and web server for the prediction of pathogenic proteins in both genomic and metagenomic datasets. MP3 is developed using an integrated Support Vector Machine (SVM) and Hidden Markov Model (HMM) approach to carry out highly fast, sensitive and accurate prediction of pathogenic proteins. It displayed Sensitivity, Specificity, MCC and accuracy values of 92%, 100%, 0.92 and 96%, respectively, on blind dataset constructed using complete proteins. On the two metagenomic blind datasets (Blind A: 51–100 amino acids and Blind B: 30–50 amino acids), it displayed Sensitivity, Specificity, MCC and accuracy values of 82.39%, 97.86%, 0.80 and 89.32% for Blind A and 71.60%, 94.48%, 0.67 and 81.86% for Blind B, respectively. In addition, the performance of MP3 was validated on selected bacterial genomic and real metagenomic datasets. To our knowledge, MP3 is the only program that specializes in fast and accurate identification of partial pathogenic proteins predicted from short (100–150 bp) metagenomic reads and also performs exceptionally well on complete protein sequences. MP3 is publicly available at http://metagenomics.iiserb.ac.in/mp3/index.php.  相似文献   

17.
Lo SL  Cai CZ  Chen YZ  Chung MC 《Proteomics》2005,5(4):876-884
Knowledge of protein-protein interaction is useful for elucidating protein function via the concept of 'guilt-by-association'. A statistical learning method, Support Vector Machine (SVM), has recently been explored for the prediction of protein-protein interactions using artificial shuffled sequences as hypothetical noninteracting proteins and it has shown promising results (Bock, J. R., Gough, D. A., Bioinformatics 2001, 17, 455-460). It remains unclear however, how the prediction accuracy is affected if real protein sequences are used to represent noninteracting proteins. In this work, this effect is assessed by comparison of the results derived from the use of real protein sequences with that derived from the use of shuffled sequences. The real protein sequences of hypothetical noninteracting proteins are generated from an exclusion analysis in combination with subcellular localization information of interacting proteins found in the Database of Interacting Proteins. Prediction accuracy using real protein sequences is 76.9% compared to 94.1% using artificial shuffled sequences. The discrepancy likely arises from the expected higher level of difficulty for separating two sets of real protein sequences than that for separating a set of real protein sequences from a set of artificial sequences. The use of real protein sequences for training a SVM classification system is expected to give better prediction results in practical cases. This is tested by using both SVM systems for predicting putative protein partners of a set of thioredoxin related proteins. The prediction results are consistent with observations, suggesting that real sequence is more practically useful in development of SVM classification system for facilitating protein-protein interaction prediction.  相似文献   

18.
19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号