首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Coevolution-based contact prediction, either directly by coevolutionary couplings resulting from global statistical sequence models or using structural supervision and deep learning, has found widespread application in protein-structure prediction from sequence. However, one of the basic assumptions in global statistical modeling is that sequences form an at least approximately independent sample of an unknown probability distribution, which is to be learned from data. In the case of protein families, this assumption is obviously violated by phylogenetic relations between protein sequences. It has turned out to be notoriously difficult to take phylogenetic correlations into account in coevolutionary model learning. Here, we propose a complementary approach: we develop strategies to randomize or resample sequence data, such that conservation patterns and phylogenetic relations are preserved, while intrinsic (i.e. structure- or function-based) coevolutionary couplings are removed. A comparison between the results of Direct Coupling Analysis applied to real and to resampled data shows that the largest coevolutionary couplings, i.e. those used for contact prediction, are only weakly influenced by phylogeny. However, the phylogeny-induced spurious couplings in the resampled data are compatible in size with the first false-positive contact predictions from real data. Dissecting functional from phylogeny-induced couplings might therefore extend accurate contact predictions to the range of intermediate-size couplings.  相似文献   

2.
The vast expansion of protein sequence databases provides an opportunity for new protein design approaches which seek to learn the sequence-function relationship directly from natural sequence variation. Deep generative models trained on protein sequence data have been shown to learn biologically meaningful representations helpful for a variety of downstream tasks, but their potential for direct use in the design of novel proteins remains largely unexplored. Here we show that variational autoencoders trained on a dataset of almost 70000 luciferase-like oxidoreductases can be used to generate novel, functional variants of the luxA bacterial luciferase. We propose separate VAE models to work with aligned sequence input (MSA VAE) and raw sequence input (AR-VAE), and offer evidence that while both are able to reproduce patterns of amino acid usage characteristic of the family, the MSA VAE is better able to capture long-distance dependencies reflecting the influence of 3D structure. To confirm the practical utility of the models, we used them to generate variants of luxA whose luminescence activity was validated experimentally. We further showed that conditional variants of both models could be used to increase the solubility of luxA without disrupting function. Altogether 6/12 of the variants generated using the unconditional AR-VAE and 9/11 generated using the unconditional MSA VAE retained measurable luminescence, together with all 23 of the less distant variants generated by conditional versions of the models; the most distant functional variant contained 35 differences relative to the nearest training set sequence. These results demonstrate the feasibility of using deep generative models to explore the space of possible protein sequences and generate useful variants, providing a method complementary to rational design and directed evolution approaches.  相似文献   

3.

Background  

Studies of the structure-function relationship in proteins for which no 3D structure is available are often based on inspection of multiple sequence alignments. Many functionally important residues of proteins can be identified because they are conserved during evolution. However, residues that vary can also be critically important if their variation is responsible for diversity of protein function and improved phenotypes. If too few sequences are studied, the support for hypotheses on the role of a given residue will be weak, but analysis of large multiple alignments is too complex for simple inspection. When a large body of sequence and functional data are available for a protein family, mature data mining tools, such as machine learning, can be applied to extract information more easily, sensitively and reliably. We have undertaken such an analysis of voltage-gated potassium channels, a transmembrane protein family whose members play indispensable roles in electrically excitable cells.  相似文献   

4.
In recent years, numerous biocomputational tools have been designed to extract functional and evolutionary information from multiple sequence alignments (MSAs) of proteins and genes. Most biologists working actively on the characterization of proteins from a single or family perspective use the MSA analysis to retrieve valuable information about amino acid conservation and the functional role of residues in query protein(s). In MSAs, adjustment of alignment parameters is a key point to improve the quality of MSA output. However, this issue is frequently underestimated and/or misunderstood by scientists and there is no in-depth knowledge available in this field. This brief review focuses on biocomputational approaches complementary to MSA to help distinguish functional residues in protein families. These additional analyses involve issues ranging from phylogenetic to statistical, which address the detection of amino acids pivotal for protein function at any level. In recent years, a large number of tools has been designed for this very purpose. Using some of these relevant, useful tools, we have designed a practical pipeline to perform in silico studies with a view to improving the characterization of family proteins and their functional residues. This review-guide aims to present biologists a set of specially designed tools to study proteins. These tools are user-friendly as they use web servers or easy-to-handle applications. Such criteria are essential for this review as most of the biologists (experimentalists) working in this field are unfamiliar with these biocomputational analysis approaches.  相似文献   

5.
We describe a new strategy for utilizing multiple sequence alignment information to detect distant relationships in searches of sequence databases. A single sequence representing a protein family is enriched by replacing conserved regions with position-specific scoring matrices (PSSMs) or consensus residues derived from multiple alignments of family members. In comprehensive tests of these and other family representations, PSSM-embedded queries produced the best results overall when used with a special version of the Smith-Waterman searching algorithm. Moreover, embedding consensus residues instead of PSSMs improved performance with readily available single sequence query searching programs, such as BLAST and FASTA. Embedding PSSMs or consensus residues into a representative sequence improves searching performance by extracting multiple alignment information from motif regions while retaining single sequence information where alignment is uncertain.  相似文献   

6.
Correlated mutation analysis (CMA) has been used to investigate protein functional sites. However, CMA has suffered from low signal-to-noise ratio caused by meaningless phylogenetic signals or structural constraints. We present a new method, Structure-based Correlated Mutation Analysis (SCMA), which encodes coevolution scores into a protein structure network. A path-based network model is adapted to describe information transfer between residues, and the statistical significance is estimated by network shuffling. This model intrinsically assumes that residues in physical contact have a more reliable coevolution score than distant residues, and that coevolution in distant residues likely arises from a series of contacting and coevolving residues. In addition, coevolutionary coupling is statistically controlled to remove the structural effects. When applied to the rhodopsin structure, the SCMA method identified a much higher percentage of functional residues than the typical coevolution score (61% vs. 22%). In addition, statistically significant residues are used to construct the coevolved residue-residue subnetwork. The network has one highly connected node (retinal bound Lys296), indicating that Lys296 can induce and regulate most other coevolved residues in a variety of locations. The coevolved network consists of a few modular clusters which have distinct functional roles. This article is part of a Special Issue entitled: Computational Methods for Protein Interaction and Structural Prediction.  相似文献   

7.
Protein sequences have evolved to fold into functional structures, resulting in families of diverse protein sequences that all share the same overall fold. One can harness protein family sequence data to infer likely contacts between pairs of residues. In the current study, we combine this kind of inference from coevolutionary information with a coarse‐grained protein force field ordinarily used with single sequence input, the Associative memory, Water mediated, Structure and Energy Model (AWSEM), to achieve improved structure prediction. The resulting Associative memory, Water mediated, Structure and Energy Model with Evolutionary Restraints (AWSEM‐ER) yields a significant improvement in the quality of protein structure prediction over the single sequence prediction from AWSEM when a sufficiently large number of homologous sequences are available. Free energy landscape analysis shows that the addition of the evolutionary term shifts the free energy minimum to more native‐like structures, which explains the improvement in the quality of structures when performing predictions using simulated annealing. Simulations using AWSEM without coevolutionary information have proved useful in elucidating not only protein folding behavior, but also mechanisms of protein function. The success of AWSEM‐ER in de novo structure prediction suggests that the enhanced model opens the door to functional studies of proteins even when no experimentally solved structures are available.  相似文献   

8.
Aquaporins (AQPs) are members of the Major Intrinsic Protein (MIP) family that can transport water or glycerol, as well as other compounds. The rationale for substrate selectivity at the structural level is still incompletely understood. The information present in multiple sequence alignments (MSAs) can help identify both structural and functional features, especially the complex networks of interactions responsible for water or glycerol selectivity. Herein, we have used the method of Statistical Coupling Analysis (SCA) to identify co-evolving pairs of residues in two separate groups of sequences predicted to correspond to water or glycerol transporters. Differentially co-evolved pairs between the two groups were tested by their efficacy in correctly classifying a training set of MSAs, and binary classifiers were built with these pairs. Up to 50% of the residues found in hundreds of binary classifiers corresponded to only ten positions in the MSA of aquaporins. Most of these residues are close to the lining of the aquaporin pore and have been identified previously as important for selectivity. Therefore, this method can shed light on the residues that are important for substrate selectivity of aquaporins and other proteins. SCA requires a very large sequence dataset with relatively low homology amongst its members, and these requirements are met by aquaporins.  相似文献   

9.
Lee Y  Mick J  Furdui C  Beamer LJ 《PloS one》2012,7(6):e38114
Coevolution analyses identify residues that co-vary with each other during evolution, revealing sequence relationships unobservable from traditional multiple sequence alignments. Here we describe a coevolutionary analysis of phosphomannomutase/phosphoglucomutase (PMM/PGM), a widespread and diverse enzyme family involved in carbohydrate biosynthesis. Mutual information and graph theory were utilized to identify a network of highly connected residues with high significance. An examination of the most tightly connected regions of the coevolutionary network reveals that most of the involved residues are localized near an interdomain interface of this enzyme, known to be the site of a functionally important conformational change. The roles of four interface residues found in this network were examined via site-directed mutagenesis and kinetic characterization. For three of these residues, mutation to alanine reduces enzyme specificity to ~10% or less of wild-type, while the other has ~45% activity of wild-type enzyme. An additional mutant of an interface residue that is not densely connected in the coevolutionary network was also characterized, and shows no change in activity relative to wild-type enzyme. The results of these studies are interpreted in the context of structural and functional data on PMM/PGM. Together, they demonstrate that a network of coevolving residues links the highly conserved active site with the interdomain conformational change necessary for the multi-step catalytic reaction. This work adds to our understanding of the functional roles of coevolving residue networks, and has implications for the definition of catalytically important residues.  相似文献   

10.
The protein kinase catalytic domain is one of the most abundant domains across all branches of life. Although kinases share a common core function of phosphoryl-transfer, they also have wide functional diversity and play varied roles in cell signaling networks, and for this reason are implicated in a number of human diseases. This functional diversity is primarily achieved through sequence variation, and uncovering the sequence-function relationships for the kinase family is a major challenge. In this study we use a statistical inference technique inspired by statistical physics, which builds a coevolutionary “Potts” Hamiltonian model of sequence variation in a protein family. We show how this model has sufficient power to predict the probability of specific subsequences in the highly diverged kinase family, which we verify by comparing the model’s predictions with experimental observations in the Uniprot database. We show that the pairwise (residue-residue) interaction terms of the statistical model are necessary and sufficient to capture higher-than-pairwise mutation patterns of natural kinase sequences. We observe that previously identified functional sets of residues have much stronger correlated interaction scores than are typical.  相似文献   

11.
12.
During the course of our large-scale genome analysis a conserved domain, currently detectable only in the genomes of Drosophila melanogaster, Caenorhabditis elegans and Anopheles gambiae, has been identified. The function of this domain is currently unknown and no function annotation is provided for this domain in the publicly available genomic, protein family and sequence databases. The search for the homologues of this domain in the non-redundant sequence database using PSI-BLAST, resulted in identification of distant relationship between this family and the alkaline phosphatase-like superfamily, which includes families of aryl sulfatase, N-acetylgalactosomine-4-sulfatase, alkaline phosphatase and 2,3-bisphosphoglycerate-independent phosphoglycerate mutase (iPGM). The fold recognition procedures showed that this new domain could adopt a similar 3-D fold as for this superfamily. Most of the phosphatases and sulfatases of this superfamily are characterized by functional residues Ser and Cys respectively in the topologically equivalent positions. This functionally important site aligns with Ser/Thr in the members of the new family. Additionally, set of residues responsible for a metal binding site in phosphatases and sulphtases are conserved in the new family. The in-depth analysis suggests that the new family could possess phosphatase activity.  相似文献   

13.
Multiple comparison or alignmentof protein sequences has become a fundamental tool in many different domains in modern molecular biology, from evolutionary studies to prediction of 2D/3D structure, molecular function and inter-molecular interactions etc. By placing the sequence in the framework of the overall family, multiple alignments can be used to identify conserved features and to highlight differences or specificities. In this paper, we describe a comprehensive evaluation of many of the most popular methods for multiple sequence alignment (MSA), based on a new benchmark test set. The benchmark is designed to represent typical problems encountered when aligning the large protein sequence sets that result from today's high throughput biotechnologies. We show that alignmentmethods have significantly progressed and can now identify most of the shared sequence features that determine the broad molecular function(s) of a protein family, even for divergent sequences. However,we have identified a number of important challenges. First, the locally conserved regions, that reflect functional specificities or that modulate a protein's function in a given cellular context,are less well aligned. Second, motifs in natively disordered regions are often misaligned. Third, the badly predicted or fragmentary protein sequences, which make up a large proportion of today's databases, lead to a significant number of alignment errors. Based on this study, we demonstrate that the existing MSA methods can be exploited in combination to improve alignment accuracy, although novel approaches will still be needed to fully explore the most difficult regions. We then propose knowledge-enabled, dynamic solutions that will hopefully pave the way to enhanced alignment construction and exploitation in future evolutionary systems biology studies.  相似文献   

14.
Prediction of protein residue contacts with a PDB-derived likelihood matrix   总被引:8,自引:0,他引:8  
Proteins with similar folds often display common patterns of residue variability. A widely discussed question is how these patterns can be identified and deconvoluted to predict protein structure. In this respect, correlated mutation analysis (CMA) has shown considerable promise. CMA compares multiple members of a protein family and detects residues that remain constant or mutate in tandem. Often this behavior points to structural or functional interdependence between residues. CMA has been used to predict pairs of amino acids that are distant in the primary sequence but likely to form close contacts in the native three-dimensional structure. Until now these methods have used evolutionary or biophysical models to score the fit between residues. We wished to test whether empirical methods, derived from known protein structures, would provide useful predictive power for CMA. We analyzed 672 known protein structures, derived contact likelihood scores for all possible amino acid pairs, and used these scores to predict contacts. We then tested the method on 118 different protein families for which structures have been solved to atomic resolution. The mean performance was almost seven times better than random prediction. Used in concert with secondary structure prediction, the new CMA method could supply restraints for predicting still undetermined structures.  相似文献   

15.
Communication between distant sites often defines the biological role of a protein: amino acid long-range interactions are as important in binding specificity, allosteric regulation and conformational change as residues directly contacting the substrate. The maintaining of functional and structural coupling of long-range interacting residues requires coevolution of these residues. Networks of interaction between coevolved residues can be reconstructed, and from the networks, one can possibly derive insights into functional mechanisms for the protein family. We propose a combinatorial method for mapping conserved networks of amino acid interactions in a protein which is based on the analysis of a set of aligned sequences, the associated distance tree and the combinatorics of its subtrees. The degree of coevolution of all pairs of coevolved residues is identified numerically, and networks are reconstructed with a dedicated clustering algorithm. The method drops the constraints on high sequence divergence limiting the range of applicability of the statistical approaches previously proposed. We apply the method to four protein families where we show an accurate detection of functional networks and the possibility to treat sets of protein sequences of variable divergence.  相似文献   

16.
Viruses often encode proteins with multiple functions due to their compact genomes. Existing approaches to identify functional residues largely rely on sequence conservation analysis. Inferring functional residues from sequence conservation can produce false positives, in which the conserved residues are functionally silent, or false negatives, where functional residues are not identified since they are species-specific and therefore non-conserved. Furthermore, the tedious process of constructing and analyzing individual mutations limits the number of residues that can be examined in a single study. Here, we developed a systematic approach to identify the functional residues of a viral protein by coupling experimental fitness profiling with protein stability prediction using the influenza virus polymerase PA subunit as the target protein. We identified a significant number of functional residues that were influenza type-specific and were evolutionarily non-conserved among different influenza types. Our results indicate that type-specific functional residues are prevalent and may not otherwise be identified by sequence conservation analysis alone. More importantly, this technique can be adapted to any viral (and potentially non-viral) protein where structural information is available.  相似文献   

17.
A portion of the RNA genome of beet yellows closterovirus (BYV) has been sequenced encompassing a complete long open reading frame (ORF) potentially encoding a 65 kDa protein. The sequence of this putative protein was strikingly similar to those of HSP70-related heat shock proteins. The counterparts of all the eight segments strongly conserved in HSP70s could be confidently identified in the BYV 65 kDa protein. It is suggested that some of these segments might be the ATP-binding site(s) and that, similarly to the heat shock proteins, the 65 kDa is probably ATP-binding. Generally, however, the divergence between the 65 kDa sequence and the sequences of the HSP70s was much more pronounced than that between any two members of the latter family, allowing a clearer delineation of clusters of conserved residues that might be crucial for protein function. It is suggested that these observations will be helpful in functional dissection of the proteins of the HSP70 family. Analysis of the sequence of a portion of the ORF found upstream from the 65 kDa ORF showed that the C-terminal domain of the encoded protein could be an RNA-dependent RNA polymerase closely related to those of tricornaviruses, a family of RNA plant viruses with three component genomes.  相似文献   

18.
Detecting homology of distantly related proteins with consensus sequences   总被引:15,自引:0,他引:15  
A simple protocol is described that is suitable for the detection of distantly related members of a protein family. In this procedure, similarity to a consensus sequence is used to distinguish chance similarity from similarity due to common ancestry. The consensus sequence is constructed from the sequences of established members of a protein family and it incorporates features characteristic of the protein fold of this family: conserved residues, the pattern of variable and conserved segments, preferred location of gaps etc. The database is searched with the consensus sequence, using the unitary matrix or log odds matrix for scoring the alignments, with variable gap penalty. The advantage of the method is that it weights key residues, ignores sequence similarity in variable segments (thus partially eliminating "background noise" coming from chance similarity), distinguishes gaps disrupting conserved segments from those occurring in positions known to be tolerant of gap events. The utility of the method was demonstrated in the case of the protein family homologous with the internal repeats of complement B as well as the internal repeats identified in fibroblast proteoglycan PG40. The consensus sequence method succeeded in finding some new members of these protein families that could not be detected by earlier methods of sequence comparison.  相似文献   

19.
Understanding how proteins evolve to provide both exquisite specificity and proficient activity is a fundamental problem in biology that has implications for protein function prediction and protein engineering. To study this problem, we analyzed the evolution of structure and function in the o-succinylbenzoate synthase/N-acylamino acid racemase (OSBS/NAAAR) family, part of the mechanistically diverse enolase superfamily. Although all characterized members of the family catalyze the OSBS reaction, this family is extraordinarily divergent, with some members sharing <15% identity. In addition, a member of this family, Amycolatopsis OSBS/NAAAR, is promiscuous, catalyzing both dehydration and racemization. Although the OSBS/NAAAR family appears to have a single evolutionary origin, no sequence or structural motifs unique to this family could be identified; all residues conserved in the family are also found in enolase superfamily members that have different functions. Based on their species distribution, several uncharacterized proteins similar to Amycolatopsis OSBS/NAAAR appear to have been transmitted by lateral gene transfer. Like Amycolatopsis OSBS/NAAAR, these might have additional or alternative functions to OSBS because many are from organisms lacking the pathway in which OSBS is an intermediate. In addition to functional differences, the OSBS/NAAAR family exhibits surprising structural variations, including large differences in orientation between the two domains. These results offer several insights into protein evolution. First, orthologous proteins can exhibit significant structural variation, and specificity can be maintained with little conservation of ligand-contacting residues. Second, the discovery of a set of proteins similar to Amycolatopsis OSBS/NAAAR supports the hypothesis that new protein functions evolve through promiscuous intermediates. Finally, a combination of evolutionary, structural, and sequence analyses identified characteristics that might prime proteins, such as Amycolatopsis OSBS/NAAAR, for the evolution of new activities.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号