首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: A method for recognizing the three-dimensional fold from the protein amino acid sequence based on a combination of hidden Markov models (HMMs) and secondary structure prediction was recently developed for proteins in the Mainly-Alpha structural class. Here, this methodology is extended to Mainly-Beta and Alpha-Beta class proteins. Compared to other fold recognition methods based on HMMs, this approach is novel in that only secondary structure information is used. Each HMM is trained from known secondary structure sequences of proteins having a similar fold. Secondary structure prediction is performed for the amino acid sequence of a query protein. The predicted fold of a query protein is the fold described by the model fitting the predicted sequence the best. RESULTS: After model cross-validation, the success rate on 44 test proteins covering the three structural classes was found to be 59%. On seven fold predictions performed prior to the publication of experimental structure, the success rate was 71%. In conclusion, this approach manages to capture important information about the fold of a protein embedded in the length and arrangement of the predicted helices, strands and coils along the polypeptide chain. When a more extensive library of HMMs representing the universe of known structural families is available (work in progress), the program will allow rapid screening of genomic databases and sequence annotation when fold similarity is not detectable from the amino acid sequence. AVAILABILITY: FORESST web server at http://absalpha.dcrt.nih.gov:8008/ for the library of HMMs of structural families used in this paper. FORESST web server at http://www.tigr.org/ for a more extensive library of HMMs (work in progress). CONTACT: valedf@tigr.org; munson@helix.nih.gov; garnier@helix.nih.gov  相似文献   

2.
The information of protein subcellular localization is vitally important for in-depth understanding the intricate pathways that regulate biological processes at the cellular level. With the rapidly increasing number of newly found protein sequence in the Post-Genomic Age, many automated methods have been developed attempting to help annotate their subcellular locations in a timely manner. However, very few of them were developed using the protein-protein interaction (PPI) network information. In this paper, we have introduced a new concept called "tethering potential" by which the PPI information can be effectively fused into the formulation for protein samples. Based on such a network frame, a new predictor called Yeast-PLoc has been developed for identifying budding yeast proteins among their 19 subcellular location sites. Meanwhile, a purely sequence-based approach, called the "hybrid-property" method, is integrated into Yeast-PLoc as a fall-back to deal with those proteins without sufficient PPI information. The overall success rate by the jackknife test on the 4,683 yeast proteins in the training dataset was 70.25%. Furthermore, it was shown that the success rate by Yeast- PLoc on an independent dataset was remarkably higher than those by some other existing predictors, indicating that the current approach by incorporating the PPI information is quite promising. As a user-friendly web-server, Yeast-PLoc is freely accessible at http://yeastloc.biosino.org/.  相似文献   

3.
By incorporating the information of gene ontology, functional domain, and sequential evolution, a new predictor called Gneg-mPLoc was developed. It can be used to identify Gram-negative bacterial proteins among the following eight locations: (1) cytoplasm, (2) extracellular, (3) fimbrium, (4) flagellum, (5) inner membrane, (6) nucleoid, (7) outer membrane, and (8) periplasm. It can also be used to deal with the case when a query protein may simultaneously exist in more than one location. Compared with the original predictor called Gneg-PLoc, the new predictor is much more powerful and flexible. For a newly constructed stringent benchmark dataset in which none of proteins included has ≥25% pairwise sequence identity to any other in a same subset (location), the overall jackknife success rate achieved by Gneg-mPLoc was 85.5%, which was more than 14% higher than the corresponding rate by the Gneg-PLoc. As a user friendly web-server, Gneg-mPLoc is freely accessible at http://www.csbio.sjtu.edu.cn/bioinf/Gneg-multi/.  相似文献   

4.
To understand the molecular basis of glycosyltransferases' (GTFs) catalytic mechanism, extensive structural information is required. Here, fold recognition methods were employed to assign 3D protein shapes (folds) to the currently known GTF sequences, available in public databases such as GenBank and Swissprot. First, GTF sequences were retrieved and classified into clusters, based on sequence similarity only. Intracluster sequence similarity was chosen sufficiently high to ensure that the same fold is found within a given cluster. Then, a representative sequence from each cluster was selected to compose a subset of GTF sequences. The members of this reduced set were processed by three different fold recognition methods: 3D-PSSM, FUGUE, and GeneFold. Finally, the results from different fold recognition methods were analyzed and compared to sequence-similarity search methods (i.e., BLAST and PSI-BLAST). It was established that the folds of about 70% of all currently known GTF sequences can be confidently assigned by fold recognition methods, a value which is higher than the fold identification rate based on sequence comparison alone (48% for BLAST and 64% for PSI-BLAST). The identified folds were submitted to 3D clustering, and we found that most of the GTF sequences adopt the typical GTF A or GTF B folds. Our results indicate a lack of evidence that new GTF folds (i.e., folds other than GTF A and B) exist. Based on cases where fold identification was not possible, we suggest several sequences as the most promising targets for a structural genomics initiative focused on the GTF protein family.  相似文献   

5.
Lin WZ  Fang JA  Xiao X  Chou KC 《PloS one》2011,6(9):e24756
DNA-binding proteins play crucial roles in various cellular processes. Developing high throughput tools for rapidly and effectively identifying DNA-binding proteins is one of the major challenges in the field of genome annotation. Although many efforts have been made in this regard, further effort is needed to enhance the prediction power. By incorporating the features into the general form of pseudo amino acid composition that were extracted from protein sequences via the "grey model" and by adopting the random forest operation engine, we proposed a new predictor, called iDNA-Prot, for identifying uncharacterized proteins as DNA-binding proteins or non-DNA binding proteins based on their amino acid sequences information alone. The overall success rate by iDNA-Prot was 83.96% that was obtained via jackknife tests on a newly constructed stringent benchmark dataset in which none of the proteins included has ≥25% pairwise sequence identity to any other in a same subset. In addition to achieving high success rate, the computational time for iDNA-Prot is remarkably shorter in comparison with the relevant existing predictors. Hence it is anticipated that iDNA-Prot may become a useful high throughput tool for large-scale analysis of DNA-binding proteins. As a user-friendly web-server, iDNA-Prot is freely accessible to the public at the web-site on http://icpr.jci.edu.cn/bioinfo/iDNA-Prot or http://www.jci-bioinfo.cn/iDNA-Prot. Moreover, for the convenience of the vast majority of experimental scientists, a step-by-step guide is provided on how to use the web-server to get the desired results.  相似文献   

6.
MOTIVATION: The success of the consensus approach to the protein structure prediction problem has led to development of several different consensus methods. Most of them only rely on a structural comparison of a number of different models. However, there are other types of information that might be useful such as the score from the server and structural evaluation. RESULTS: Pcons5 is a new and improved version of the consensus predictor Pcons. Pcons5 integrates information from three different sources: the consensus analysis, structural evaluation and the score from the fold recognition servers. We show that Pcons5 is better than the previous version of Pcons and that it performs better than using only the consensus analysis. In addition, we also present a version of Pmodeller based on Pcons5, which performs significantly better than Pcons5. AVAILABILITY: Pcons5 is the first Pcons version available as a standalone program from http://www.sbc.su.se/~bjorn/Pcons5. It should be easy to implement in local meta-servers.  相似文献   

7.
李军锋  李海峰  宋艳画  孙燕  张家骅 《遗传》2005,27(5):797-800
建立了一种简单处理单个卵子和早期胚胎制备DNA模板的方法——KOH/DTT-Triton X裂解法,并与TE-蛋白酶K法比较了PCR扩增效率。结果,采用KOH/DTT-Triton X裂解法处理单个卵子或2-细胞胚、8-细胞胚、桑椹胚、囊胚后,作为DNA模板直接进行PCR扩增线粒体DNA片段,3对引物的PCR扩增总成功率为100%(70/70),而TE-蛋白酶K法处理的单个卵子的PCR扩增总成功率为92.9%(65/70),二者差异显著(P<0.05)。但两种方法所制备模板的PCR假阳性率均为0。实验设计的KOH/DTT-Triton X裂解法是一种有效的单个早期胚胎的DNA模板制备方法,经一次PCR扩增即能获得清晰的目的DNA条带,能够满足早期胚胎遗传物质检测的需要。  相似文献   

8.
Wang X  Li GZ 《PloS one》2012,7(5):e36317
Subcellular locations of proteins are important functional attributes. An effective and efficient subcellular localization predictor is necessary for rapidly and reliably annotating subcellular locations of proteins. Most of existing subcellular localization methods are only used to deal with single-location proteins. Actually, proteins may simultaneously exist at, or move between, two or more different subcellular locations. To better reflect characteristics of multiplex proteins, it is highly desired to develop new methods for dealing with them. In this paper, a new predictor, called Euk-ECC-mPLoc, by introducing a powerful multi-label learning approach which exploits correlations between subcellular locations and hybridizing gene ontology with dipeptide composition information, has been developed that can be used to deal with systems containing both singleplex and multiplex eukaryotic proteins. It can be utilized to identify eukaryotic proteins among the following 22 locations: (1) acrosome, (2) cell membrane, (3) cell wall, (4) centrosome, (5) chloroplast, (6) cyanelle, (7) cytoplasm, (8) cytoskeleton, (9) endoplasmic reticulum, (10) endosome, (11) extracellular, (12) Golgi apparatus, (13) hydrogenosome, (14) lysosome, (15) melanosome, (16) microsome, (17) mitochondrion, (18) nucleus, (19) peroxisome, (20) spindle pole body, (21) synapse, and (22) vacuole. Experimental results on a stringent benchmark dataset of eukaryotic proteins by jackknife cross validation test show that the average success rate and overall success rate obtained by Euk-ECC-mPLoc were 69.70% and 81.54%, respectively, indicating that our approach is quite promising. Particularly, the success rates achieved by Euk-ECC-mPLoc for small subsets were remarkably improved, indicating that it holds a high potential for simulating the development of the area. As a user-friendly web-server, Euk-ECC-mPLoc is freely accessible to the public at the website http://levis.tongji.edu.cn:8080/bioinfo/Euk-ECC-mPLoc/. We believe that Euk-ECC-mPLoc may become a useful high-throughput tool, or at least play a complementary role to the existing predictors in identifying subcellular locations of eukaryotic proteins.  相似文献   

9.
Predicting subcellular localization of human proteins is a challenging problem, particularly when query proteins may have a multiplex character, i.e., simultaneously exist at, or move between, two or more different subcellular location sites. In a previous study, we developed a predictor called “Hum-mPLoc” to deal with the multiplex problem for the human protein system. However, Hum-mPLoc has the following shortcomings. (1) The input of accession number for a query protein is required in order to obtain a higher expected success rate by selecting to use the higher-level prediction pathway; but many proteins, such as synthetic and hypothetical proteins as well as those newly discovered proteins without being deposited into databanks yet, do not have accession numbers. (2) Neither functional domain nor sequential evolution information were taken into account in Hum-mPLoc, and hence its power may be reduced accordingly. In view of this, a top-down strategy to address these shortcomings has been implemented. The new predictor thus obtained is called Hum-mPLoc 2.0, where the accession number for input is no longer needed whatsoever. Moreover, both the functional domain information and the sequential evolution information have been fused into the predictor by an ensemble classifier. As a consequence, the prediction power has been significantly enhanced. The web server of Hum-mPLoc2.0 is freely accessible at http://www.csbio.sjtu.edu.cn/bioinf/hum-multi-2/.  相似文献   

10.
Hu L  Huang T  Shi X  Lu WC  Cai YD  Chou KC 《PloS one》2011,6(1):e14556

Background

With the huge amount of uncharacterized protein sequences generated in the post-genomic age, it is highly desirable to develop effective computational methods for quickly and accurately predicting their functions. The information thus obtained would be very useful for both basic research and drug development in a timely manner.

Methodology/Principal Findings

Although many efforts have been made in this regard, most of them were based on either sequence similarity or protein-protein interaction (PPI) information. However, the former often fails to work if a query protein has no or very little sequence similarity to any function-known proteins, while the latter had similar problem if the relevant PPI information is not available. In view of this, a new approach is proposed by hybridizing the PPI information and the biochemical/physicochemical features of protein sequences. The overall first-order success rates by the new predictor for the functions of mouse proteins on training set and test set were 69.1% and 70.2%, respectively, and the success rate covered by the results of the top-4 order from a total of 24 orders was 65.2%.

Conclusions/Significance

The results indicate that the new approach is quite promising that may open a new avenue or direction for addressing the difficult and complicated problem.  相似文献   

11.
Being the largest family of cell surface receptors, G-protein-coupled receptors (GPCRs) are among the most frequent targets of therapeutic drugs. The functions of many of GPCRs are unknown, and it is both time-consuming and expensive to determine their ligands and signaling pathways. This forces us to face a critical challenge: how to develop an automated method for classifying the family of GPCRs so as to help us in classifying drugs and expedite the process of drug discovery. Owing to their highly divergent nature, it is difficult to predict the classification of GPCRs by means of conventional sequence alignment approaches. To cope with such a situation, the CD (Covariant Discriminant) predictor was introduced to predict the families of GPCRs. The overall success rate thus obtained by jack-knife test for 1238 GPCRs classified into three main families, i.e., class A-"rhodopsin like", class B-"secretin like", and class C-"metabotrophic/glutamate/pheromone", was over 97%. The high success rate suggests that the CD predictor holds very high potential to become a useful tool for understanding the actions of drugs that target GPCRs and designing new medications with fewer side effects and greater efficacy.  相似文献   

12.
The preparation of probability distribution maps is the first important step in risk assessment and wildfire management. Here we employed Weights-of-Evidence (WOE) Bayesian modeling to investigate the spatial relationship between historical fire events in the Chaharmahal-Bakhtiari Province of Iran, using a wide range of binary predictor variables (i.e., presence or absence of a variable characteristic or condition) that represent topography, climate, and human activities. Model results were used to produce distribution maps of wildfire probability. Our modeling approach is based on the assumption that the probabilities reflect the observed proportions of the total landscape area occupied by the corresponding events (i.e., fire incident or no fire) and conditions (i.e., classes) of predictor variables. To assess the effect of each predictor variable on model outputs, we excluded each variable in turn during calculations. The results were validated and compared by the receiver operating characteristic (ROC) using both success rate and prediction rate curves. Seventy percent of fire events were used for the former, while the remainder was used for the latter. The validation results showed that the area under the curves (AUC) for success and prediction rates of the model that included all thirteen predictor variables that represent topography, climate, and human influences were 84.6 and 80.4%, respectively. The highest AUC for success and prediction rates (86.8 and 84.6%) were achieved when the altitude variable was excluded from the analysis. We found slightly decreased AUC values when the slope-aspect and proximity to settlements variables were excluded. These findings clearly demonstrate that the probability of a fire is strongly dependent upon the topographic characteristics of landscapes and, perhaps more importantly, human infrastructure and associated human activities. The results from this study may be useful for land use planning, decision-making for wildfire management, and the allocation of fire resources prior to the start of the main fire season.  相似文献   

13.
MOTIVATION: What constitutes a baseline level of success for protein fold recognition methods? As fold recognition benchmarks are often presented without any thought to the results that might be expected from a purely random set of predictions, an analysis of fold recognition baselines is long overdue. Given varying amounts of basic information about a protein-ranging from the length of the sequence to a knowledge of its secondary structure-to what extent can the fold be determined by intelligent guesswork? Can simple methods that make use of secondary structure information assign folds more accurately than purely random methods and could these methods be used to construct viable hierarchical classifications? EXPERIMENTS PERFORMED: A number of rapid automatic methods which score similarities between protein domains were devised and tested. These methods ranged from those that incorporated no secondary structure information, such as measuring absolute differences in sequence lengths, to more complex alignments of secondary structure elements. Each method was assessed for accuracy by comparison with the Class Architecture Topology Homology (CATH) classification. Methods were rated against both a random baseline fold assignment method as a lower control and FSSP as an upper control. Similarity trees were constructed in order to evaluate the accuracy of optimum methods at producing a classification of structure. RESULTS: Using a rigorous comparison of methods with CATH, the random fold assignment method set a lower baseline of 11% true positives allowing for 3% false positives and FSSP set an upper benchmark of 47% true positives at 3% false positives. The optimum secondary structure alignment method used here achieved 27% true positives at 3% false positives. Using a less rigorous Critical Assessment of Structure Prediction (CASP)-like sensitivity measurement the random assignment achieved 6%, FSSP-59% and the optimum secondary structure alignment method-32%. Similarity trees produced by the optimum method illustrate that these methods cannot be used alone to produce a viable protein structural classification system. CONCLUSIONS: Simple methods that use perfect secondary structure information to assign folds cannot produce an accurate protein taxonomy, however they do provide useful baselines for fold recognition. In terms of a typical CASP assessment our results suggest that approximately 6% of targets with folds in the databases could be assigned correctly by randomly guessing, and as many as 32% could be recognised by trivial secondary structure comparison methods, given knowledge of their correct secondary structures.  相似文献   

14.
The analysis of biological information from protein sequences is important for the study of cellular functions and interactions, and protein fold recognition plays a key role in the prediction of protein structures. Unfortunately, the prediction of protein fold patterns is challenging due to the existence of compound protein structures. Here, we processed the latest release of the Structural Classification of Proteins (SCOP, version 1.75) database and exploited novel techniques to impressively increase the accuracy of protein fold classification. The techniques proposed in this paper include ensemble classifying and a hierarchical framework, in the first layer of which similar or redundant sequences were deleted in two manners; a set of base classifiers, fused by various selection strategies, divides the input into seven classes; in the second layer of which, an analogous ensemble method is adopted to predict all protein folds. To our knowledge, it is the first time all protein folds can be intelligently detected hierarchically. Compared with prior studies, our experimental results demonstrated the efficiency and effectiveness of our proposed method, which achieved a success rate of 74.21%, which is much higher than results obtained with previous methods (ranging from 45.6% to 70.5%). When applied to the second layer of classification, the prediction accuracy was in the range between 23.13% and 46.05%. This value, which may not be remarkably high, is scientifically admirable and encouraging as compared to the relatively low counts of proteins from most fold recognition programs. The web server Hierarchical Protein Fold Prediction (HPFP) is available at http://datamining.xmu.edu.cn/software/hpfp.  相似文献   

15.
The cell cytosol is crowded with macromolecules such as proteins, nucleic acids, and membranes. The consequences of such crowding remain unclear. How is the rate of a typical enzymatic reaction, involving a freely diffusing enzyme and substrate, affected by the presence of macromolecules of different sizes, shapes, and concentrations? Here, we mimic the cytosolic crowding in vitro, using dextrans and Ficolls, for the first time in a variety of sizes ranging from 15 to 500 kDa, in a concentration range 0–30% w/w. Alkaline phosphatase–catalyzed hydrolysis of p‐nitrophenyl phosphate (PNPP) was chosen as the model reaction. A pronounced decrease in the rate with increase in fractional volume occupancy of dextran is observed for larger dextrans (200 and 500 kDa) in contrast to smaller dextrans (15–70 kDa). Our results indicate that, at 20% w/w, smaller dextrans (15–70 kDa) reduce the initial rate moderately (1.4‐ to 2.4‐fold slowing), while larger dextrans (>200 kDa) slow the reaction considerably (>5‐fold). Ficolls (70 and 400 kDa) slow the reaction moderately (1.3‐ to 2.3‐fold). The influence of smaller dextrans was accounted by a combination of increase in viscosity as sensed by PNPP and a minor offsetting increase in enzyme activity due to crowding. Larger dextrans apparently reduce the frequency of enzyme substrate encounter. The reduced influence of Ficolls is attributed to their compact and quasispherical shape, much unlike the dextrans. © 2006 Wiley Periodicals, Inc. Biopolymers 83: 477–486, 2006 This article was originally published online as an accepted preprint. The “Published Online” date corresponds to the preprint version. You can request a copy of the preprint by emailing the Biopolymers editorial office at biopolymers@wiley.com  相似文献   

16.
By introducing the "multi-layer scale", as well as hybridizing the information of gene ontology and the sequential evolution information, a novel predictor, called iLoc-Gpos, has been developed for predicting the subcellular localization of Gram positive bacterial proteins with both single-location and multiple-location sites. For facilitating comparison, the same stringent benchmark dataset used to estimate the accuracy of Gpos-mPLoc was adopted to demonstrate the power of iLoc-Gpos. The dataset contains 519 Gram-positive bacterial proteins classified into the following four subcellular locations: (1) cell membrane, (2) cell wall, (3) cytoplasm, and (4) extracell; none of proteins included has ≥25% pairwise sequence identity to any other in a same subset (subcellular location). The overall success rate by jackknife test on such a stringent benchmark dataset by iLoc-Gpos was over 93%, which is about 11% higher than that by GposmPLoc. As a user-friendly web-server, iLoc-Gpos is freely accessible to the public at http://icpr.jci.edu.cn/bioinfo/iLoc- Gpos or http://www.jci-bioinfo.cn/iLoc-Gpos. Meanwhile, a step-by-step guide is provided on how to use the web-server to get the desired results. Furthermore, for the user ? s convenience, the iLoc-Gpos web-server also has the function to accept the batch job submission, which is not available in the existing version of Gpos-mPLoc web-server.  相似文献   

17.
Shan Y  Wang G  Zhou HX 《Proteins》2001,42(1):23-37
A homology-based structure prediction method ideally gives both a correct fold assignment and an accurate query-template alignment. In this article we show that the combination of two existing methods, PSI-BLAST and threading, leads to significant enhancement in the success rate of fold recognition. The combined approach, termed COBLATH, also yields much higher alignment accuracy than found in previous studies. It consists of two-way searches both by PSI-BLAST and by threading. In the PSI-BLAST portion, a query is used to search for hits in a library of potential templates and, conversely, each potential template is used to search for hits in a library of queries. In the threading portion, the scoring function is the sum of a sequence profile and a 6x6 substitution matrix between predicted query and known template secondary structure and solvent exposure. "Two-way" in threading means that the query's sequence profile is used to match the sequences of all potential templates and the sequence profiles of all potential templates are used to match the query's sequence. When tested on a set of 533 nonhomologous proteins, COBLATH was able to assign folds for 390 (73%). Among these 390 queries, 265 (68%) had root-mean-square deviations (RMSDs) of less than 8 A between predicted and actual structures. Such high success rate and accuracy make COBLATH an ideal tool for structural genomics.  相似文献   

18.
To understand the networks in living cells, it is indispensably important to identify protein-protein interactions on a genomic scale. Unfortunately, it is both time-consuming and expensive to do so solely based on experiments due to the nature of the problem whose complexity is obviously overwhelming, just like the fact that "life is complicated". Therefore, developing computational techniques for predicting protein-protein interactions would be of significant value in this regard. By fusing the approach based on the gene ontology and the approach of pseudo-amino acid composition, a predictor called "GO-PseAA" predictor was established to deal with this problem. As a showcase, prediction was performed on 6323 protein pairs from yeast. To avoid redundancy and homology bias, none of the protein pairs investigated has > or = 40% sequence identity with any other. The overall success rate obtained by jackknife cross-validation was 81.6%, indicating the GO-PseAA predictor is very promising for predicting protein-protein interactions from protein sequences, and might become a useful vehicle for studying the network biology in the postgenomic era.  相似文献   

19.
20.
Getz G  Vendruscolo M  Sachs D  Domany E 《Proteins》2002,46(4):405-415
We present an automated procedure to assign CATH and SCOP classifications to proteins whose FSSP score is available. CATH classification is assigned down to the topology level, and SCOP classification is assigned to the fold level. Because the FSSP database is updated weekly, this method makes it possible to update also CATH and SCOP with the same frequency. Our predictions have a nearly perfect success rate when ambiguous cases are discarded. These ambiguous cases are intrinsic in any protein structure classification that relies on structural information alone. Hence, we introduce the "twilight zone for structure classification." We further suggest that to resolve these ambiguous cases, other criteria of classification, based also on information about sequence and function, must be used.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号