首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Given sufficient large protein families, and using a global statistical inference approach, it is possible to obtain sufficient accuracy in protein residue contact predictions to predict the structure of many proteins. However, these approaches do not consider the fact that the contacts in a protein are neither randomly, nor independently distributed, but actually follow precise rules governed by the structure of the protein and thus are interdependent. Here, we present PconsC2, a novel method that uses a deep learning approach to identify protein-like contact patterns to improve contact predictions. A substantial enhancement can be seen for all contacts independently on the number of aligned sequences, residue separation or secondary structure type, but is largest for β-sheet containing proteins. In addition to being superior to earlier methods based on statistical inferences, in comparison to state of the art methods using machine learning, PconsC2 is superior for families with more than 100 effective sequence homologs. The improved contact prediction enables improved structure prediction.  相似文献   

2.
We report the results of residue-residue contact prediction of a new pipeline built purely on the learning of coevolutionary features in the CASP13 experiment. For a query sequence, the pipeline starts with the collection of multiple sequence alignments (MSAs) from multiple genome and metagenome sequence databases using two complementary Hidden Markov Model (HMM)-based searching tools. Three profile matrices, built on covariance, precision, and pseudolikelihood maximization respectively, are then created from the MSAs, which are used as the input features of a deep residual convolutional neural network architecture for contact-map training and prediction. Two ensembling strategies have been proposed to integrate the matrix features through end-to-end training and stacking, resulting in two complementary programs called TripletRes and ResTriplet, respectively. For the 31 free-modeling domains that do not have homologous templates in the PDB, TripletRes and ResTriplet generated comparable results with an average accuracy of 0.640 and 0.646, respectively, for the top L/5 long-range predictions, where 71% and 74% of the cases have an accuracy above 0.5. Detailed data analyses showed that the strength of the pipeline is due to the sensitive MSA construction and the advanced strategies for coevolutionary feature ensembling. Domain splitting was also found to help enhance the contact prediction performance. Nevertheless, contact models for tail regions, which often involve a high number of alignment gaps, and for targets with few homologous sequences are still suboptimal. Development of new approaches where the model is specifically trained on these regions and targets might help address these problems.  相似文献   

3.
Protein sequences have evolved to fold into functional structures, resulting in families of diverse protein sequences that all share the same overall fold. One can harness protein family sequence data to infer likely contacts between pairs of residues. In the current study, we combine this kind of inference from coevolutionary information with a coarse‐grained protein force field ordinarily used with single sequence input, the Associative memory, Water mediated, Structure and Energy Model (AWSEM), to achieve improved structure prediction. The resulting Associative memory, Water mediated, Structure and Energy Model with Evolutionary Restraints (AWSEM‐ER) yields a significant improvement in the quality of protein structure prediction over the single sequence prediction from AWSEM when a sufficiently large number of homologous sequences are available. Free energy landscape analysis shows that the addition of the evolutionary term shifts the free energy minimum to more native‐like structures, which explains the improvement in the quality of structures when performing predictions using simulated annealing. Simulations using AWSEM without coevolutionary information have proved useful in elucidating not only protein folding behavior, but also mechanisms of protein function. The success of AWSEM‐ER in de novo structure prediction suggests that the enhanced model opens the door to functional studies of proteins even when no experimentally solved structures are available.  相似文献   

4.
Lim Heo  Michael Feig 《Proteins》2020,88(5):637-642
Protein structure prediction has long been available as an alternative to experimental structure determination, especially via homology modeling based on templates from related sequences. Recently, models based on distance restraints from coevolutionary analysis via machine learning to have significantly expanded the ability to predict structures for sequences without templates. One such method, AlphaFold, also performs well on sequences where templates are available but without using such information directly. Here we show that combining machine-learning based models from AlphaFold with state-of-the-art physics-based refinement via molecular dynamics simulations further improves predictions to outperform any other prediction method tested during the latest round of CASP. The resulting models have highly accurate global and local structures, including high accuracy at functionally important interface residues, and they are highly suitable as initial models for crystal structure determination via molecular replacement.  相似文献   

5.
A probabilistic graphical model is proposed in order to detect the coevolution between different sites in biological sequences. The model extends the continuous-time Markov process of sequence substitution for single nucleic or amino acids and imposes general constraints regarding simultaneous changes on the substitution rate matrix. Given a multiple sequence alignment for each molecule of interest and a phylogenetic tree, the model can predict potential interactions within or between nucleic acids and proteins. Initial validation of the model is carried out using tRNA and 16S rRNA sequence data. The model accurately identifies the secondary interactions of tRNA as well as several known tertiary interactions. In addition, results on 16S rRNA data indicate this general and simple coevolutionary model outperforms several other parametric and nonparametric methods in predicting secondary interactions. Furthermore, the majority of the putative predictions exhibit either direct contact or proximity of the nucleotide pairs in the 3-dimensional structure of the Thermus thermophilus ribosomal small subunit. The results on RNA data suggest a general model of coevolution might be applied to other types of interactions between protein, DNA, and RNA molecules.  相似文献   

6.
Predicting protein structure from primary sequence is one of the ultimate challenges in computational biology. Given the large amount of available sequence data, the analysis of co-evolution, i.e., statistical dependency, between columns in multiple alignments of protein domain sequences remains one of the most promising avenues for predicting residues that are contacting in the structure. A key impediment to this approach is that strong statistical dependencies are also observed for many residue pairs that are distal in the structure. Using a comprehensive analysis of protein domains with available three-dimensional structures we show that co-evolving contacts very commonly form chains that percolate through the protein structure, inducing indirect statistical dependencies between many distal pairs of residues. We characterize the distributions of length and spatial distance traveled by these co-evolving contact chains and show that they explain a large fraction of observed statistical dependencies between structurally distal pairs. We adapt a recently developed Bayesian network model into a rigorous procedure for disentangling direct from indirect statistical dependencies, and we demonstrate that this method not only successfully accomplishes this task, but also allows contacts with weak statistical dependency to be detected. To illustrate how additional information can be incorporated into our method, we incorporate a phylogenetic correction, and we develop an informative prior that takes into account that the probability for a pair of residues to contact depends strongly on their primary-sequence distance and the amount of conservation that the corresponding columns in the multiple alignment exhibit. We show that our model including these extensions dramatically improves the accuracy of contact prediction from multiple sequence alignments.  相似文献   

7.
The prediction of novel pre-microRNA (miRNA) from genomic sequence has received considerable attention recently. However, the majority of studies have focused on the human genome. Previous studies have demonstrated that sensitivity (correctly detecting true miRNA) is sustained when human-trained methods are applied to other species, however they have failed to report the dramatic drop in specificity (the ability to correctly reject non-miRNA sequences) in non-human genomes. Considering the ratio of true miRNA sequences to pseudo-miRNA sequences is on the order of 1:1000, such low specificity prevents the application of most existing tools to non-human genomes, as the number of false positives overwhelms the true predictions. We here introduce a framework (SMIRP) for creating species-specific miRNA prediction systems, leveraging sequence conservation and phylogenetic distance information. Substantial improvements in specificity and precision are obtained for four non-human test species when our framework is applied to three different prediction systems representing two types of classifiers (support vector machine and Random Forest), based on three different feature sets, with both human-specific and taxon-wide training data. The SMIRP framework is potentially applicable to all miRNA prediction systems and we expect substantial improvement in precision and specificity, while sustaining sensitivity, independent of the machine learning technique chosen.  相似文献   

8.
Coevolution between protein residues is normally interpreted as direct contact. However, the evolutionary record of a protein sequence contains rich information that may include long-range functional couplings, couplings that report on homo-oligomeric states or even conformational changes. Due to the complexity of the sequence space and the lack of structural information on various members of a protein family, it has been difficult to effectively mine the additional information encoded in a multiple sequence alignment (MSA). Here, taking advantage of the recent release of the AlphaFold (AF) database we attempt to identify coevolutionary couplings that cannot be explained simply by spatial proximity. We propose a simple computational method that performs direct coupling analysis on a MSA and searches for couplings that are not satisfied in any of the AF models of members of the identified protein family. Application of this method on 2012 protein families suggests that ~12% of the total identified coevolving residue pairs are spatially distant and more likely to be disordered than their contacting counterparts. We expect that this analysis will help improve the quality of coevolutionary distance restraints used for structure determination and will be useful in identifying potentially functional/allosteric cross-talk between distant residues.  相似文献   

9.
Phylogenetically enhanced statistical tools for RNA structure prediction   总被引:4,自引:0,他引:4  
MOTIVATION: Methods that predict the structure of molecules by looking for statistical correlation have been quite effective. Unfortunately, these methods often disregard phylogenetic information in the sequences they analyze. Here, we present a number of statistics for RNA molecular-structure prediction. Besides common pair-wise comparisons, we consider a few reasonable statistics for base-triple predictions, and present an elaborate analysis of these methods. All these statistics incorporate phylogenetic relationships of the sequences in the analysis to varying degrees, and the different nature of these tests gives a wide choice of statistical tools for RNA structure prediction. RESULTS: Starting from statistics that incorporate phylogenetic information only as independent sequence evolution models for each position of a multiple alignment, and extending this idea to a joint evolution model of two positions, we enhance the usual purely statistical methods (e.g. methods based on the Mutual Information statistic) with the use of phylogenetic information available in the sequences. In particular, we present a joint model based on the HKY evolution model, and consequently a X(2) test of independence for two positions. A significant part of this work is devoted to some mathematical analysis of these methods. We tested these statistics on regions of 16S and 23S rRNA, and tRNA.  相似文献   

10.
A novel multivariate statistical approach is presented for extracting and exploiting intrinsic information present in our ever-growing sequence data banks. The information extraction from the sequences avoids the pitfalls of intersequence alignment by analyzing secondary invariant functions derived from the sequences in the data bank rather than the sequences themselves. Such typical invariant function is a 20 x 20 histogram of occurrences of amino acid pairs in a given sequence or fragment thereof. To illustrate the potential of the approach an analysis of 10,000 protein sequences from the National Biomedical Research Foundation Protein Identification Resource is presented, whose analysis already reveals great biological detail. For example, zeta-hemoglobin is found to lie close to amphibian and fish chi-hemoglobin which, in turn, is an important clue to the physiological function of this mammalian early embryonic hemoglobin. The multivariate statistical framework presented unifies such apparently unrelated issues as phylogenetic comparisons between a set of sequences and distance matrices between the constituents of the biological sequences. The Multivariate Statistical Sequence Analysis (MSSA) principles can be used for a wide spectrum of sequence analysis problems such as: assignment of family memberships to new sequences, validation of new incoming sequences to be entered into the database, prediction of structure from sequence, discrimination of coding from non-coding DNA regions, and automatic generation of an atlas of protein or DNA sequences. The MSSA techniques represent a self-contained approach to learning continuously and automatically from the growing stream of new sequences. The MSSA approach is particularly likely to play a significant role in major sequencing efforts such as the human genome project.  相似文献   

11.
Comprehensive sampling of genomic biodiversity is fast becoming a reality for some genomic regions and complete organelle genomes. Genomic biodiversity is defined as large genomic sequences from many species, and here some recent work is reviewed that demonstrates the potential benefits of genomic biodiversity for molecular evolutionary analysis and phylogenetic reconstruction. This work shows that using likelihood-based approaches, taxon addition can dramatically improve phylogenetic reconstruction. Features or dynamics of the evolutionary process are much more easily inferred with large numbers of taxa, and large numbers are essential for discriminating differences in evolutionary patterns between sites. Accurate prediction of site-specific patterns can improve phylogenetic reconstruction by an amount equivalent to quadrupling sequence length. Genomic biodiversity is particularly central to research relating patterns of evolution, adaptation and coevolution to structural and functional features of proteins. Research on detecting coevolution between amino acid residues in proteins demonstrates a clear need for much greater numbers of closely related taxa to better discriminate site-specific patterns of interaction, and to allow more detailed analysis of coevolutionary interactions between subunits in protein complexes. It is argued that parsing out coevolutionary and other context-dependent substitution probabilities is essential for discriminating between coevolution and adaptation, and for more realistically modelling the evolution of proteins. Also reviewed is research that argues for increasing the efficiency of acquiring genomic biodiversity, and suggests that this might be done by simultaneously shotgun cloning and sequencing genomic mixtures from many species. Increased efficiency is a prerequisite if genomic biodiversity levels are to rapidly increase by orders of magnitude, and thus lead to dramatically improved understanding of interactions between protein structure, function and sequence evolution.  相似文献   

12.
Lo SL  Cai CZ  Chen YZ  Chung MC 《Proteomics》2005,5(4):876-884
Knowledge of protein-protein interaction is useful for elucidating protein function via the concept of 'guilt-by-association'. A statistical learning method, Support Vector Machine (SVM), has recently been explored for the prediction of protein-protein interactions using artificial shuffled sequences as hypothetical noninteracting proteins and it has shown promising results (Bock, J. R., Gough, D. A., Bioinformatics 2001, 17, 455-460). It remains unclear however, how the prediction accuracy is affected if real protein sequences are used to represent noninteracting proteins. In this work, this effect is assessed by comparison of the results derived from the use of real protein sequences with that derived from the use of shuffled sequences. The real protein sequences of hypothetical noninteracting proteins are generated from an exclusion analysis in combination with subcellular localization information of interacting proteins found in the Database of Interacting Proteins. Prediction accuracy using real protein sequences is 76.9% compared to 94.1% using artificial shuffled sequences. The discrepancy likely arises from the expected higher level of difficulty for separating two sets of real protein sequences than that for separating a set of real protein sequences from a set of artificial sequences. The use of real protein sequences for training a SVM classification system is expected to give better prediction results in practical cases. This is tested by using both SVM systems for predicting putative protein partners of a set of thioredoxin related proteins. The prediction results are consistent with observations, suggesting that real sequence is more practically useful in development of SVM classification system for facilitating protein-protein interaction prediction.  相似文献   

13.
The surprising fact that global statistical properties computed on a genomewide scale may reveal species information has first been observed in studies of dinucleotide frequencies. Here we will look at the same phenomenon with a totally different statistical approach. We show that patterns in the short-range statistical correlations in DNA sequences serve as evolutionary fingerprints of eukaryotes. All chromosomes of a species display the same characteristic pattern, markedly different from those of other species. The chromosomes of a species are sorted onto the same branch of a phylogenetic tree due to this correlation pattern. The average correlation between nucleotides at a distance k is quantified in two independent ways: (i) by estimating it from a higher-order Markov process and (ii) by computing the mutual information function at a distance k. We show how the quality of phylogenetic reconstruction depends on the range of correlation strengths and on the length of the underlying sequence segment. This concept of the correlation pattern as a phylogenetic signature of eukaryote species combines two rather distant domains of research, namely phylogenetic analysis based on molecular observation and the study of the correlation structure of DNA sequences.  相似文献   

14.
J. G. Lawrence  D. L. Hartl 《Genetics》1992,131(3):753-760
Inconsistencies in taxonomic relationships implicit in different sets of nucleic acid sequences potentially result from horizontal transfer of genetic material between genomes. A nonparametric method is proposed to determine whether such inconsistencies are statistically significant. A similarity coefficient is calculated from ranked pairwise identities and evaluated against a distribution of similarity coefficients generated from resampled data. Subsequent analyses of partial data sets, obtained by the elimination of individual taxa, identify particular taxa to which the significance may be attributed, and can sometimes help in distinguishing horizontal genetic transfer from inconsistencies due to convergent evolution or variation in evolutionary rate. The method was successfully applied to data sets that were not found to be significantly different with existing methods that use comparisons of phylogenetic trees. The new statistical framework is also applicable to the inference of horizontal transfer from restriction fragment length polymorphism distributions and protein sequences.  相似文献   

15.
It has long been suspected that analysis of correlated amino acid substitutions should uncover pairs or clusters of sites that are spatially proximal in mature protein structures. Accordingly, methods based on different mathematical principles such as information theory, correlation coefficients and maximum likelihood have been developed to identify co-evolving amino acids from multiple sequence alignments. Sets of pairs of sites whose behaviour is identified by these methods as correlated are often significantly enriched in pairs of spatially proximal residues. However, relatively high levels of false-positive predictions typically render such methods, in isolation, of little use in the ab initio prediction of protein structure. Misleading signal (or problems with the estimation of significance levels) can be caused by phylogenetic correlations between homologous sequences and from correlation due to factors other than spatial proximity (for example, correlation of sites which are not spatially close but which are involved in common functional properties of the protein). In recent years, several workers have suggested that information from correlated substitutions should be combined with other sources of information (secondary structure, solvent accessibility, evolutionary rates) in an attempt to reduce the proportion of false-positive predictions. We review methods for the detection of correlated amino acid substitutions, compare their relative performance in contact prediction and predict future directions in the field.  相似文献   

16.
17.
Maximum entropy-based inference methods have been successfully used to infer direct interactions from biological datasets such as gene expression data or sequence ensembles. Here, we review undirected pairwise maximum-entropy probability models in two categories of data types, those with continuous and categorical random variables. As a concrete example, we present recently developed inference methods from the field of protein contact prediction and show that a basic set of assumptions leads to similar solution strategies for inferring the model parameters in both variable types. These parameters reflect interactive couplings between observables, which can be used to predict global properties of the biological system. Such methods are applicable to the important problems of protein 3-D structure prediction and association of gene–gene networks, and they enable potential applications to the analysis of gene alteration patterns and to protein design.  相似文献   

18.
We present an approach to predicting protein structural class that uses amino acid composition and hydrophobic pattern frequency information as input to two types of neural networks: (1) a three-layer back-propagation network and (2) a learning vector quantization network. The results of these methods are compared to those obtained from a modified Euclidean statistical clustering algorithm. The protein sequence data used to drive these algorithms consist of the normalized frequency of up to 20 amino acid types and six hydrophobic amino acid patterns. From these frequency values the structural class predictions for each protein (all-alpha, all-beta, or alpha-beta classes) are derived. Examples consisting of 64 previously classified proteins were randomly divided into multiple training (56 proteins) and test (8 proteins) sets. The best performing algorithm on the test sets was the learning vector quantization network using 17 inputs, obtaining a prediction accuracy of 80.2%. The Matthews correlation coefficients are statistically significant for all algorithms and all structural classes. The differences between algorithms are in general not statistically significant. These results show that information exists in protein primary sequences that is easily obtainable and useful for the prediction of protein structural class by neural networks as well as by standard statistical clustering algorithms.  相似文献   

19.
Previously proposed methods for protein secondary structure prediction from multiple sequence alignments do not efficiently extract the evolutionary information that these alignments contain. The predictions of these methods are less accurate than they could be, because of their failure to consider explicitly the phylogenetic tree that relates aligned protein sequences. As an alternative, we present a hidden Markov model approach to secondary structure prediction that more fully uses the evolutionary information contained in protein sequence alignments. A representative example is presented, and three experiments are performed that illustrate how the appropriate representation of evolutionary relatedness can improve inferences. We explain why similar improvement can be expected in other secondary structure prediction methods and indeed any comparative sequence analysis method.  相似文献   

20.
We provide a new automated statistical method for DNA barcoding based on a Bayesian phylogenetic analysis. The method is based on automated database sequence retrieval, alignment, and phylogenetic analysis using a custom-built program for Bayesian phylogenetic analysis. We show on real data that the method outperforms Blast searches as a measure of confidence and can help eliminate 80% of all false assignment based on best Blast hit. However, the most important advance of the method is that it provides statistically meaningful measures of confidence. We apply the method to a re-analysis of previously published ancient DNA data and show that, with high statistical confidence, most of the published sequences are in fact of Neanderthal origin. However, there are several cases of chimeric sequences that are comprised of a combination of both Neanderthal and modern human DNA.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号