首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 8 毫秒
1.
2.
Kawabata T  Nishikawa K 《Proteins》2000,41(1):108-122
A number of automatic protein structure comparison methods have been proposed; however, their similarity score functions are often decided by the researchers' intuition and trial-and-error, and not by theoretical background. We propose a novel theory to evaluate protein structure similarity, which is based on the Markov transition model of evolution. Our similarity score between structures i and j is defined as log P(j --> i)/P(i), where P(j --> i) is the probability that structure j changes to structure i during the evolutionary process, and P(i) is the probability that structure i appears by chance. This is a reasonable definition of structure similarity, especially for finding evolutionarily related (homologous) similarity. The probability P(j --> i) is estimated by the Markov transition model, which is similar to the Dayhoff's substitution model between amino acids. To estimate the parameters of the model, homologous protein structure pairs are collected using sequence similarity, and the numbers of structure transitions within the pairs are counted. Next these numbers are transformed to a transition probability matrix of the Markov transition. Transition probabilities for longer time are obtained by multiplying the probability matrix by itself several times. In this study, we generated three types of structure similarity scores: an environment score, a residue-residue distance score, and a secondary structure elements (SSE) score. Using these scores, we developed the structure comparison program, Matras (MArkovian TRAnsition of protein Structure). It employs a hierarchical alignment algorithm, in which a rough alignment is first obtained by SSEs, and then is improved with more detailed functions. We attempted an all-versus-all comparison of the SCOP database, and evaluated its ability to recognize a superfamily relationship, which was manually assigned to be homologous in the SCOP database. A comparison with the FSSP database shows that our program can recognize more homologous similarity than FSSP. We also discuss the reliability of our method, by studying the disagreement between structural classifications by Matras and SCOP.  相似文献   

3.
The quality of protein function predictions relies on appropriate training of protein classification methods. Performance of these methods can be affected when only a limited number of protein samples are available, which is often the case in divergent protein families. Whereas profile hidden Markov models and PSI-BLAST presented significant performance decrease in such cases, alignment-free partial least-squares classifiers performed consistently better even when used to identify short fragmented sequences.  相似文献   

4.
Qu Y  Guo JT  Olman V  Xu Y 《Nucleic acids research》2004,32(2):551-561
Residual dipolar coupling (RDC) represents one of the most exciting emerging NMR techniques for protein structure studies. However, solving a protein structure using RDC data alone is still a highly challenging problem. We report here a computer program, RDC-PROSPECT, for protein structure prediction based on a structural homolog or analog of the target protein in the Protein Data Bank (PDB), which best aligns with the 15N–1H RDC data of the protein recorded in a single ordering medium. Since RDC-PROSPECT uses only RDC data and predicted secondary structure information, its performance is virtually independent of sequence similarity between a target protein and its structural homolog/analog, making it applicable to protein targets beyond the scope of current protein threading techniques. We have tested RDC-PROSPECT on all 15N–1H RDC data (representing 43 proteins) deposited in the BioMagResBank (BMRB) database. The program correctly identified structural folds for 83.7% of the target proteins, and achieved an average alignment accuracy of 98.1% residues within a four-residue shift.  相似文献   

5.

Background  

This paper considers the problem of identifying pathways through metabolic networks that relate to a specific biological response. Our proposed model, HME3M, first identifies frequently traversed network paths using a Markov mixture model. Then by employing a hierarchical mixture of experts, separate classifiers are built using information specific to each path and combined into an ensemble prediction for the response.  相似文献   

6.
Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.  相似文献   

7.
MOTIVATION: Gene selection algorithms for cancer classification, based on the expression of a small number of biomarker genes, have been the subject of considerable research in recent years. Shevade and Keerthi propose a gene selection algorithm based on sparse logistic regression (SLogReg) incorporating a Laplace prior to promote sparsity in the model parameters, and provide a simple but efficient training procedure. The degree of sparsity obtained is determined by the value of a regularization parameter, which must be carefully tuned in order to optimize performance. This normally involves a model selection stage, based on a computationally intensive search for the minimizer of the cross-validation error. In this paper, we demonstrate that a simple Bayesian approach can be taken to eliminate this regularization parameter entirely, by integrating it out analytically using an uninformative Jeffrey's prior. The improved algorithm (BLogReg) is then typically two or three orders of magnitude faster than the original algorithm, as there is no longer a need for a model selection step. The BLogReg algorithm is also free from selection bias in performance estimation, a common pitfall in the application of machine learning algorithms in cancer classification. RESULTS: The SLogReg, BLogReg and Relevance Vector Machine (RVM) gene selection algorithms are evaluated over the well-studied colon cancer and leukaemia benchmark datasets. The leave-one-out estimates of the probability of test error and cross-entropy of the BLogReg and SLogReg algorithms are very similar, however the BlogReg algorithm is found to be considerably faster than the original SLogReg algorithm. Using nested cross-validation to avoid selection bias, performance estimation for SLogReg on the leukaemia dataset takes almost 48 h, whereas the corresponding result for BLogReg is obtained in only 1 min 24 s, making BLogReg by far the more practical algorithm. BLogReg also demonstrates better estimates of conditional probability than the RVM, which are of great importance in medical applications, with similar computational expense. AVAILABILITY: A MATLAB implementation of the sparse logistic regression algorithm with Bayesian regularization (BLogReg) is available from http://theoval.cmp.uea.ac.uk/~gcc/cbl/blogreg/  相似文献   

8.
A reliable and accurate identification of the type of tumors is crucial to the proper treatment of cancers. In recent years, it has been shown that sparse representation (SR) by l1-norm minimization is robust to noise, outliers and even incomplete measurements, and SR has been successfully used for classification. This paper presents a new SR-based method for tumor classification using gene expression data. A set of metasamples are extracted from the training samples, and then an input testing sample is represented as the linear combination of these metasamples by l1-regularized least square method. Classification is achieved by using a discriminating function defined on the representation coefficients. Since l1-norm minimization leads to a sparse solution, the proposed method is called metasample-based SR classification (MSRC). Extensive experiments on publicly available gene expression data sets show that MSRC is efficient for tumor classification, achieving higher accuracy than many existing representative schemes.  相似文献   

9.
Protein classification artificial neural system.   总被引:2,自引:0,他引:2       下载免费PDF全文
A neural network classification method is developed as an alternative approach to the large database search/organization problem. The system, termed Protein Classification Artificial Neural System (ProCANS), has been implemented on a Cray supercomputer for rapid superfamily classification of unknown proteins based on the information content of the neural interconnections. The system employs an n-gram hashing function that is similar to the k-tuple method for sequence encoding. A collection of modular back-propagation networks is used to store the large amount of sequence patterns. The system has been trained and tested with the first 2,148 of the 8,309 entries of the annotated Protein Identification Resource protein sequence database (release 29). The entries included the electron transfer proteins and the six enzyme groups (oxidoreductases, transferases, hydrolases, lyases, isomerases, and ligases), with a total of 620 superfamilies. After a total training time of seven Cray central processing unit (CPU) hours, the system has reached a predictive accuracy of 90%. The classification is fast (i.e., 0.1 Cray CPU second per sequence), as it only involves a forward-feeding through the networks. The classification time on a full-scale system embedded with all known superfamilies is estimated to be within 1 CPU second. Although the training time will grow linearly with the number of entries, the classification time is expected to remain low even if there is a 10-100-fold increase of sequence entries. The neural database, which consists of a set of weight matrices of the networks, together with the ProCANS software, can be ported to other computers and made available to the genome community. The rapid and accurate superfamily classification would be valuable to the organization of protein sequence databases and to the gene recognition in large sequencing projects.  相似文献   

10.
11.
Classification of newly determined protein structures is important in understanding their function and mechanism of action. Currently available methods employ a global structure alignment strategy and are computationally expensive. We propose a two-step methodology with a quick screen to significantly reduce the number of candidate structures followed by global structure alignment of the query structure with the reduced set. We represent a protein structure as a sequence of local structures, codified in the form of geometric invariants. Geometric invariants are quantities that remain unchanged under transformations such as translation and rotation. Protein structures represented as multi-attribute sequences are aligned via dynamic programming to identify close neighbors of the query structure. The query structure is then compared with this reduced dataset using conventional structure comparison methods to predict its functional class. For a typical protein structure, the screening method was able to reduce the protein data bank to mere 200 proteins while preserving structurally closest neighbor in the reduced set. This has resulted in 30 to 60 fold improvement in the execution time. We present the results of leave-one-out classification experiment on ASTRAL-95 domains and comparison with SCOP classification hierarchy.  相似文献   

12.
Georg Kuenze  Jens Meiler 《Proteins》2019,87(12):1341-1350
Computational methods that produce accurate protein structure models from limited experimental data, for example, from nuclear magnetic resonance (NMR) spectroscopy, hold great potential for biomedical research. The NMR-assisted modeling challenge in CASP13 provided a blind test to explore the capabilities and limitations of current modeling techniques in leveraging NMR data which had high sparsity, ambiguity, and error rate for protein structure prediction. We describe our approach to predict the structure of these proteins leveraging the Rosetta software suite. Protein structure models were predicted de novo using a two-stage protocol. First, low-resolution models were generated with the Rosetta de novo method guided by nonambiguous nuclear Overhauser effect (NOE) contacts and residual dipolar coupling (RDC) restraints. Second, iterative model hybridization and fragment insertion with the Rosetta comparative modeling method was used to refine and regularize models guided by all ambiguous and nonambiguous NOE contacts and RDCs. Nine out of 16 of the Rosetta de novo models had the correct fold (global distance test total score > 45) and in three cases high-resolution models were achieved (root-mean-square deviation < 3.5 å). We also show that a meta-approach applying iterative Rosetta + NMR refinement on server-predicted models which employed non-NMR-contacts and structural templates leads to substantial improvement in model quality. Integrating these data-assisted refinement strategies with innovative non-data-assisted approaches which became possible in CASP13 such as high precision contact prediction will in the near future enable structure determination for large proteins that are outside of the realm of conventional NMR.  相似文献   

13.

Background  

This paper presents a simple method to increase the sensitivity of protein family comparisons by incorporating secondary structure (SS) information. We build upon the effective information theory approach towards profile-profile comparison described in [Yona & Levitt 2002]. Our method augments profile columns using PSIPRED secondary structure predictions and assesses statistical similarity using information theoretical principles.  相似文献   

14.
15.
MOTIVATION: Probabilistic graphical models have been developed in the past for the task of protein classification. In many cases, classifications obtained from the Gene Ontology have been used to validate these models. In this work we directly incorporate the structure of the Gene Ontology into the graphical representation for protein classification. We present a method in which each protein is represented by a replicate of the Gene Ontology structure, effectively modeling each protein in its own 'annotation space'. Proteins are also connected to one another according to different measures of functional similarity, after which belief propagation is run to make predictions at all ontology terms. RESULTS: The proposed method was evaluated on a set of 4879 proteins from the Saccharomyces Genome Database whose interactions were also recorded in the GRID project. Results indicate that direct utilization of the Gene Ontology improves predictive ability, outperforming traditional models that do not take advantage of dependencies among functional terms. Average increase in accuracy (precision) of positive and negative term predictions of 27.8% (2.0%) over three different similarity measures and three subontologies was observed. AVAILABILITY: C/C++/Perl implementation is available from authors upon request.  相似文献   

16.
The need for accurate, automated protein classification methods continues to increase as advances in biotechnology uncover new proteins. G-protein coupled receptors (GPCRs) are a particularly difficult superfamily of proteins to classify due to extreme diversity among its members. Previous comparisons of BLAST, k-nearest neighbor (k-NN), hidden markov model (HMM) and support vector machine (SVM) using alignment-based features have suggested that classifiers at the complexity of SVM are needed to attain high accuracy. Here, analogous to document classification, we applied Decision Tree and Naive Bayes classifiers with chi-square feature selection on counts of n-grams (i.e. short peptide sequences of length n) to this classification task. Using the GPCR dataset and evaluation protocol from the previous study, the Naive Bayes classifier attained an accuracy of 93.0 and 92.4% in level I and level II subfamily classification respectively, while SVM has a reported accuracy of 88.4 and 86.3%. This is a 39.7 and 44.5% reduction in residual error for level I and level II subfamily classification, respectively. The Decision Tree, while inferior to SVM, outperforms HMM in both level I and level II subfamily classification. For those GPCR families whose profiles are stored in the Protein FAMilies database of alignments and HMMs (PFAM), our method performs comparably to a search against those profiles. Finally, our method can be generalized to other protein families by applying it to the superfamily of nuclear receptors with 94.5, 97.8 and 93.6% accuracy in family, level I and level II subfamily classification respectively.  相似文献   

17.
Frech C  Chen N 《PloS one》2010,5(10):e13409
Correct classification of genes into gene families is important for understanding gene function and evolution. Although gene families of many species have been resolved both computationally and experimentally with high accuracy, gene family classification in most newly sequenced genomes has not been done with the same high standard. This project has been designed to develop a strategy to effectively and accurately classify gene families across genomes. We first examine and compare the performance of computer programs developed for automated gene family classification. We demonstrate that some programs, including the hierarchical average-linkage clustering algorithm MC-UPGMA and the popular Markov clustering algorithm TRIBE-MCL, can reconstruct manual curation of gene families accurately. However, their performance is highly sensitive to parameter setting, i.e. different gene families require different program parameters for correct resolution. To circumvent the problem of parameterization, we have developed a comparative strategy for gene family classification. This strategy takes advantage of existing curated gene families of reference species to find suitable parameters for classifying genes in related genomes. To demonstrate the effectiveness of this novel strategy, we use TRIBE-MCL to classify chemosensory and ABC transporter gene families in C. elegans and its four sister species. We conclude that fully automated programs can establish biologically accurate gene families if parameterized accordingly. Comparative gene family classification finds optimal parameters automatically, thus allowing rapid insights into gene families of newly sequenced species.  相似文献   

18.
19.
For bioscientists studying protein structure and function, the Protein Family Alignment Annotation Tool (Pfaat) is a useful and simple program for annotating collections of proteins. This open-source software includes methods for viewing and aligning protein families, and for annotating sequence structure and residues with known functions. It offers new options to aid the study of proteins, and an extensible annotation tool for bioinformatics developers.  相似文献   

20.
Given a collection of fossil sites with data about the taxa that occur in each site, the task in biochronology is to find good estimates for the ages or ordering of sites. We describe a full probabilistic model for fossil data. The parameters of the model are natural: the ordering of the sites, the origination and extinction times for each taxon, and the probabilities of different types of errors. We show that the posterior distributions of these parameters can be estimated reliably by using Markov chain Monte Carlo techniques. The posterior distributions of the model parameters can be used to answer many different questions about the data, including seriation (finding the best ordering of the sites) and outlier detection. We demonstrate the usefulness of the model and estimation method on synthetic data and on real data on large late Cenozoic mammals. As an example, for the sites with large number of occurrences of common genera, our methods give orderings, whose correlation with geochronologic ages is 0.95.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号