首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 218 毫秒
1.
Statistical methods have been developed for finding local patterns, also called motifs, in multiple protein sequences. The aligned segments may imply functional or structural core regions. However, the existing methods often have difficulties in aligning multiple proteins when sequence residue identities are low (e.g., less than 25%). In this article, we develop a Bayesian model and Markov chain Monte Carlo (MCMC) methods for identifying subtle motifs in protein sequences. Specifically, a motif is defined not only in terms of specific sites characterized by amino acid frequency vectors, but also as a combination of secondary characteristics such as hydrophobicity, polarity, etc. Markov chain Monte Carlo methods are proposed to search for a motif pattern with high posterior probability under the new model. A special MCMC algorithm is developed, involving transitions between state spaces of different dimensions. The proposed methods were supported by a simulated study. It was then tested by two real datasets, including a group of helix-turn-helix proteins, and one set from the CATH Protein Structure Classification Database. Statistical comparisons showed that the new approach worked better than a typical Gibbs sampling approach which is based only on an amino acid model.  相似文献   

2.
The amino acid sequences of proteins determine their three-dimensional structures and functions. However, how sequence information is related to structures and functions is still enigmatic. In this study, we show that at least a part of the sequence information can be extracted by treating amino acid sequences of proteins as a collection of English words, based on a working hypothesis that amino acid sequences of proteins are composed of short constituent amino acid sequences (SCSs) or “words”. We first confirmed that the English language highly likely follows Zipf''s law, a special case of power law. We found that the rank-frequency plot of SCSs in proteins exhibits a similar distribution when low-rank tails are excluded. In comparison with natural English and “compressed” English without spaces between words, amino acid sequences of proteins show larger linear ranges and smaller exponents with heavier low-rank tails, demonstrating that the SCS distribution in proteins is largely scale-free. A distribution pattern of SCSs in proteins is similar among species, but species-specific features are also present. Based on the availability scores of SCSs, we found that sequence motifs are enriched in high-availability sites (i.e., “key words”) and vice versa. In fact, the highest availability peak within a given protein sequence often directly corresponds to a sequence motif. The amino acid composition of high-availability sites within motifs is different from that of entire motifs and all protein sequences, suggesting the possible functional importance of specific SCSs and their compositional amino acids within motifs. We anticipate that our availability-based word decoding approach is complementary to sequence alignment approaches in predicting functionally important sites of unknown proteins from their amino acid sequences.  相似文献   

3.
We present a comprehensive evaluation of a new structure mining method called PB-ALIGN. It is based on the encoding of protein structure as 1D sequence of a combination of 16 short structural motifs or protein blocks (PBs). PBs are short motifs capable of representing most of the local structural features of a protein backbone. Using derived PB substitution matrix and simple dynamic programming algorithm, PB sequences are aligned the same way amino acid sequences to yield structure alignment. PBs are short motifs capable of representing most of the local structural features of a protein backbone. Alignment of these local features as sequence of symbols enables fast detection of structural similarities between two proteins. Ability of the method to characterize and align regions beyond regular secondary structures, for example, N and C caps of helix and loops connecting regular structures, puts it a step ahead of existing methods, which strongly rely on secondary structure elements. PB-ALIGN achieved efficiency of 85% in extracting true fold from a large database of 7259 SCOP domains and was successful in 82% cases to identify true super-family members. On comparison to 13 existing structure comparison/mining methods, PB-ALIGN emerged as the best on general ability test dataset and was at par with methods like YAKUSA and CE on nontrivial test dataset. Furthermore, the proposed method performed well when compared to flexible structure alignment method like FATCAT and outperforms in processing speed (less than 45 s per database scan). This work also establishes a reliable cut-off value for the demarcation of similar folds. It finally shows that global alignment scores of unrelated structures using PBs follow an extreme value distribution. PB-ALIGN is freely available on web server called Protein Block Expert (PBE) at http://bioinformatics.univ-reunion.fr/PBE/.  相似文献   

4.
Short motifs are known to play diverse roles in proteins, such as in mediating the interactions with other molecules, binding to membranes, or conducting a specific biological function. Standard approaches currently employed to detect short motifs in proteins search for enrichment of amino acid motifs considering mostly the sequence information. Here, we presented a new approach to search for common motifs (protein signatures) which share both physicochemical and structural properties, looking simultaneously at different features. Our method takes as an input an amino acid sequence and translates it to a new alphabet that reflects its intrinsic structural and chemical properties. Using the MEME search algorithm, we identified the proteins signatures within subsets of protein which encompass common sequence and structural information. We demonstrated that we can detect enriched structural motifs, such as the amphipathic helix, from large datasets of linear sequences, as well as predicting common structural properties (such as disorder, surface accessibility, or secondary structures) of known functional‐motifs. Finally, we applied the method to the yeast protein interactome and identified novel putative interacting motifs. We propose that our approach can be applied for de novo protein function prediction given either sequence or structural information. Proteins 2013; © 2012 Wiley Periodicals, Inc.  相似文献   

5.
In recent years, there has been an increased number of sequenced RNAs leading to the development of new RNA databases. Thus, predicting RNA structure from multiple alignments is an important issue to understand its function. Since RNA secondary structures are often conserved in evolution, developing methods to identify covariate sites in an alignment can be essential for discovering structural elements. Structure Logo is a technique established on the basis of entropy and mutual information measured to analyze RNA sequences from an alignment. We proposed an efficient Structure Logo approach to analyze conservations and correlations in a set of Cardioviral RNA sequences. The entropy and mutual information content were measured to examine the conservations and correlations, respectively. The conserved secondary structure motifs were predicted on the basis of the conservation and correlation analyses. Our predictive motifs were similar to the ones observed in the viral RNA structure database, and the correlations between bases also corresponded to the secondary structure in the database.  相似文献   

6.
7.
The information required to generate a protein structure is contained in its amino acid sequence, but how three-dimensional information is mapped onto a linear sequence is still incompletely understood. Multiple structure alignments of similar protein structures have been used to investigate conserved sequence features but contradictory results have been obtained, due, in large part, to the absence of subjective criteria to be used in the construction of sequence profiles and in the quantitative comparison of alignment results. Here, we report a new procedure for multiple structure alignment and use it to construct structure-based sequence profiles for similar proteins. The definition of "similar" is based on the structural alignment procedure and on the protein structural distance (PSD) described in paper I of this series, which offers an objective measure for protein structure relationships. Our approach is tested in two well-studied groups of proteins; serine proteases and Ig-like proteins. It is demonstrated that the quality of a sequence profile generated by a multiple structure alignment is quite sensitive to the PSD used as a threshold for the inclusion of proteins in the alignment. Specifically, if the proteins included in the aligned set are too distant in structure from one another, there will be a dilution of information and patterns that are relevant to a subset of the proteins are likely to be lost.In order to understand better how the same three-dimensional information can be encoded in seemingly unrelated sequences, structure-based sequence profiles are constructed for subsets of proteins belonging to nine superfolds. We identify patterns of relatively conserved residues in each subset of proteins. It is demonstrated that the most conserved residues are generally located in the regions where tertiary interactions occur and that are relatively conserved in structure. Nevertheless, the conservation patterns are relatively weak in all cases studied, indicating that structure-determining factors that do not require a particular sequential arrangement of amino acids, such as secondary structure propensities and hydrophobic interactions, are important in encoding protein fold information. In general, we find that similar structures can fold without having a set of highly conserved residue clusters or a well-conserved sequence profile; indeed, in some cases there is no apparent conservation pattern common to structures with the same fold. Thus, when a group of proteins exhibits a common and well-defined sequence pattern, it is more likely that these sequences have a close evolutionary relationship rather than the similarities having arisen from the structural requirements of a given fold.  相似文献   

8.
Lee DW  Kim JK  Lee S  Choi S  Kim S  Hwang I 《The Plant cell》2008,20(6):1603-1622
The N-terminal transit peptides of nuclear-encoded plastid proteins are necessary and sufficient for their import into plastids, but the information encoded by these transit peptides remains elusive, as they have a high sequence diversity and lack consensus sequences or common sequence motifs. Here, we investigated the sequence information contained in transit peptides. Hierarchical clustering on transit peptides of 208 plastid proteins showed that the transit peptide sequences are grouped to multiple sequence subgroups. We selected representative proteins from seven of these multiple subgroups and confirmed that their transit peptide sequences are highly dissimilar. Protein import experiments revealed that each protein contained transit peptide-specific sequence motifs critical for protein import into chloroplasts. Bioinformatics analysis identified sequence motifs that were conserved among members of the identified subgroups. The sequence motifs identified by the two independent approaches were nearly identical or significantly overlapped. Furthermore, the accuracy of predicting a chloroplast protein was greatly increased by grouping the transit peptides into multiple sequence subgroups. Based on these data, we propose that the transit peptides are composed of multiple sequence subgroups that contain distinctive sequence motifs for chloroplast targeting.  相似文献   

9.
We have collected a set of 44 Arabidopsis proteins with similarity to the USPA (universal stress protein A of Escherichia coli) domain of bacteria. The USPA domain is found either in small proteins, or it makes up the N-terminal portion of a larger protein, usually a protein kinase. Phylogenetic tree analysis based upon a multiple sequence alignment of the USPA domains shows that these domains of protein kinases 1.3.1 and 1.3.2 form distinct groups, as do the protein kinases 1.4.1. This indicates that their USPA domain structures have diverged appreciably and suggests that they may subserve distinct cellular functions. Two USPA fold classes have been proposed: one based on Methanococcus jannaschii MJ0577 (1MJH) that binds ATP, and the other based on the Haemophilus influenzae universal stress protein (1JMV), highly similar to E. coli UspA, which does not bind ATP. A set of common residues involved in ATP binding in 1MJH and conserved in similar bacterial sequences is also found in a distinct cluster of Arabidopsis sequences. Threading analysis, which examines aspects of secondary and tertiary structure, confirms this Arabidopsis sequence cluster as highly similar to 1MJH. This structural approach can distinguish between the characteristic fold differences of 1MJH-like and 1JMV-like bacterial proteins and was used to assign the complete set of candidate Arabidopsis proteins to one of these fold classes. It is clear that all the plant sequences have arisen from a 1MJH-like ancestor.  相似文献   

10.
11.
RxLR effectors represent one of the largest and most diverse effector families in oomycete plant pathogens. These effectors have attracted enormous attention since they can be delivered inside the plant cell and manipulates host immunity. With the exceptions of a signal peptide and the following RxLR-dEER and C-terminal W/Y/L motifs identified from the sequences themselves, nearly no functional domains have been found. Recently, protein structures of several RxLRs were revealed to comprise alpha-helical bundle repeats. However, approximately half of all RxLRs lack obvious W/Y/L motifs, which are associated with helical structures. In this study, secondary structure prediction of the putative RxLR proteins was performed. We found that the C-terminus of the majority of these RxLR proteins, irrespective of the presence of W/Y/L motifs, contains abundant short alpha-helices. Since a large-scale experimental determination of protein structures has been difficult to date, results of the current study extend our understanding on the oomycete RxLR effectors in protein secondary structures from individual members to the entire family. Moreover, we identified less alpha-helix-rich proteins from secretomes of several oomycete and fungal organisms in which RxLRs have not been identified, providing additional evidence that these organisms are unlikely to harbor RxLR-like proteins. Therefore, these results provide additional information that will aid further studies on the evolution and functional mechanisms of RxLR effectors.  相似文献   

12.
MOTIVATION: Non-coding RNA genes and RNA structural regulatory motifs play important roles in gene regulation and other cellular functions. They are often characterized by specific secondary structures that are critical to their functions and are often conserved in phylogenetically or functionally related sequences. Predicting common RNA secondary structures in multiple unaligned sequences remains a challenge in bioinformatics research. Methods and RESULTS: We present a new sampling based algorithm to predict common RNA secondary structures in multiple unaligned sequences. Our algorithm finds the common structure between two sequences by probabilistically sampling aligned stems based on stem conservation calculated from intrasequence base pairing probabilities and intersequence base alignment probabilities. It iteratively updates these probabilities based on sampled structures and subsequently recalculates stem conservation using the updated probabilities. The iterative process terminates upon convergence of the sampled structures. We extend the algorithm to multiple sequences by a consistency-based method, which iteratively incorporates and reinforces consistent structure information from pairwise comparisons into consensus structures. The algorithm has no limitation on predicting pseudoknots. In extensive testing on real sequence data, our algorithm outperformed other leading RNA structure prediction methods in both sensitivity and specificity with a reasonably fast speed. It also generated better structural alignments than other programs in sequences of a wide range of identities, which more accurately represent the RNA secondary structure conservations. AVAILABILITY: The algorithm is implemented in a C program, RNA Sampler, which is available at http://ural.wustl.edu/software.html  相似文献   

13.
Nair R  Rost B 《Proteins》2003,53(4):917-930
The native sub-cellular compartment of a protein is one aspect of its function. Thus, predicting localization is an important step toward predicting function. Short zip code-like sequence fragments regulate some of the shuttling between compartments. Cataloguing and predicting such motifs is the most accurate means of determining localization in silico. However, only few motifs are currently known, and not all the trafficking appears regulated in this way. The amino acid composition of a protein correlates with its localization. All general prediction methods employed this observation. Here, we explored the evolutionary information contained in multiple alignments and aspects of protein structure to predict localization in absence of homology and targeting motifs. Our final system combined statistical rules and a variety of neural networks to achieve an overall four-state accuracy above 65%, a significant improvement over systems using only composition. The system was at its best for extra-cellular and nuclear proteins; it was significantly less accurate than TargetP for mitochondrial proteins. Interestingly, all methods that were developed on SWISS-PROT sequences failed grossly when fed with sequences from proteins of known structures taken from PDB. We therefore developed two separate systems: one for proteins of known structure and one for proteins of unknown structure. Finally, we applied the PDB-based system along with homology-based inferences and automatic text analysis to annotate all eukaryotic proteins in the PDB (http://cubic.bioc.columbia.edu/db/LOC3D). We imagine that this pilot method-certainly in combination with similar tools-may be valuable target selection in structural genomics.  相似文献   

14.
Proteins of related functions are often similar in sequence, reflecting a common phylogenetic origin. Proteins with no known homology are probably diversified proteins, too distantly related to known sequences in databases to retain significant similarity. All proteins, however, probably share common ancestries if one moves far enough back in evolution; therefore, given the huge accumulation of protein sequences in current databases, it could be expected that some proteins with no obvious sequence resemblance to any other share some residues that could represent footprints of ancient common ancestries. To identify such putative footprints, we have searched for short stretches of amino acids present in a given protein sequence that are also found in a significant number of nonrelated proteins in the database. The significantly high frequency of occurrence of these patterns in the database would support a common evolutionary source, and a diversity of non-related proteins that contain the pattern would express their ancient origin. Using this strategy, significant patterns were found in actual exons, but not in randomized amino acid sequences, nor in translated sequences of noncoding DNA, suggesting that this strategy actually leads to the identification of patterns with a biological significance. These significant patterns are not randomly positioned along the sequences analyzed, but they tend to accumulate within specific regions, producing a profile of discrete domains. In some well-known proteins analyzed in this study, some of these domains are coincident with known motifs. Thus, the procedure described in this paper could be useful for identifying ancient patterns and domains in protein sequences, some of which could also have a functional or structural significance.  相似文献   

15.
16.
ABSTRACT: BACKGROUND: Searching for structural motifs across known protein structures can be useful for identifying unrelated proteins with similar function and characterising secondary structures such as beta-sheets. This is infeasible using conventional sequence alignment because linear protein sequences do not contain spatial information. beta-residue motifs are beta-sheet substructures that can be represented as graphs and queried using existing graph indexing methods, however, these approaches are designed for general graphs that do not incorporate the inherent structural constraints of beta-sheets and require computationally-expensive filtering and verification procedures. 3D substructure search methods, on the other hand, allow beta-residue motifs to be queried in a three-dimensional context but at significant computational costs. RESULTS: We developed a new method for querying beta-residue motifs, called BetaSearch, which leverages the natural planar constraints of beta-sheets by indexing them as 2D matrices, thus avoiding much of the computational complexities involved with structural and graph querying. BetaSearch demonstrates faster filtering, verification, and overall query time than existing graph indexing approaches whilst producing comparable index sizes. Compared to 3D substructure search methods, BetaSearch achieves 33 and 240 times speedups over index-based and pairwise alignment-based approaches, respectively. Furthermore, we have presented case-studies to demonstrate its capability of motif matching in sequentially dissimilar proteins and described a method for using BetaSearch to predict beta-strand pairing. CONCLUSIONS: We have demonstrated that BetaSearch is a fast method for querying substructure motifs. The improvements in speed over existing approaches make it useful for efficiently performing high-volume exploratory querying of possible protein substructural motifs or conformations. BetaSearch was used to identify a nearly identical beta-residue motif between an entirely synthetic (Top7) and a naturally-occurring protein (Charcot-Leyden crystal protein), as well as identifying structural similarities between biotin-binding domains of avidin, streptavidin and the lipocalin gamma subunit of human C8. AVAILABILITY: The web-interface, source code, and datasets for BetaSearch can be accessed from http://www.csse.unimelb.edu.au/~hohkhkh1/betasearch.  相似文献   

17.
Discriminating outer membrane proteins (OMPs) from other folding types of globular and membrane proteins is an important problem for predicting their secondary and tertiary structures and detecting outer membrane proteins from genomic sequences as well. In this work, we have systematically analyzed the distribution of amino acid residues in the sequences of globular and outer membrane proteins with several motifs, such as A*B, A**B, etc. We observed that the motifs E*L, A*K and L*E occur frequently in globular proteins while S*S, N*S and R*D predominantly occur in OMPs. We have devised a statistical method based on frequently occurring motifs in globular and OMPs and obtained an accuracy of 96% and 82% for correctly identifying OMPs and excluding globular proteins, respectively. Further, we noticed that the motifs of transmembrane helical (TMH) proteins are different from that of OMPs. While I*A, I*L and L*I prefer in TMH proteins S*S, N*S and N*N predominantly occur in OMPs. The information about the occurrence of A*B motifs in TMH and OMPs could discriminate them with an accuracy of 80% for excluding OMPs and 100% for identifying OMPs. The influence of protein size and structural class for discrimination is discussed.  相似文献   

18.
We present an algorithm to detect protein sub-structural motifs from primary sequence. The input to the algorithm is a set of aligned multiple protein sequences. It uses wavelet transforms to decompose protein sequences represented numerically by different indices (such as polarity, accessible surface area or electron-ion integration potentials of the amino acids). The numerical representation of a protein sequence has significant correlation with its biological activity, thus common motifs are expected to be observable from the wavelet spectrum. The decomposed signals are then up-sampled and similarity search techniques are used to identify similar regions across all the proteins at multiple scales. Results indicate that wavelet transform techniques are a promising approach for rapid motif detection.  相似文献   

19.
Secondary structure models are an important step for aligning sequences, understanding probabilities of nucleotide substitutions, and evaluating the reliability of phylogenetic reconstructions. A set of conserved sequence motifs is derived from comparative sequence analysis of 184 invertebrate and vertebrate taxa (including many taxa from the same genera, families, and orders) with reference to a secondary structure model for domain III of animal mitochondrial small subunit (12S) ribosomal RNA. A template is presented to assist with secondary structure drawing. Our model is similar to previous models but is more specific to mitochondrial DNA, fitting both invertebrate and vertebrate groups, including taxa with markedly different nucleotide compositions. The second half of the domain III sequence can be difficult to align precisely, even when secondary structure information is considered. This is especially true for comparisons of anciently diverged taxa, but well-conserved motifs assist in determining biologically meaningful alignments. Patterns of conservation and variability in both paired and unpaired regions make differential phylogenetic weighting in terms of "stems" and "loops" unsatisfactory. We emphasize looking carefully at the sequence data before and during analyses, and advocate the use of conserved motifs and other secondary structure information for assessing sequencing fidelity.   相似文献   

20.
Hu YJ 《Nucleic acids research》2002,30(17):3886-3893
Given a set of homologous or functionally related RNA sequences, the consensus motifs may represent the binding sites of RNA regulatory proteins. Unlike DNA motifs, RNA motifs are more conserved in structures than in sequences. Knowing the structural motifs can help us gain a deeper insight of the regulation activities. There have been various studies of RNA secondary structure prediction, but most of them are not focused on finding motifs from sets of functionally related sequences. Although recent research shows some new approaches to RNA motif finding, they are limited to finding relatively simple structures, e.g. stem-loops. In this paper, we propose a novel genetic programming approach to RNA secondary structure prediction. It is capable of finding more complex structures than stem-loops. To demonstrate the performance of our new approach as well as to keep the consistency of our comparative study, we first tested it on the same data sets previously used to verify the current prediction systems. To show the flexibility of our new approach, we also tested it on a data set that contains pseudoknot motifs which most current systems cannot identify. A web-based user interface of the prediction system is set up at http://bioinfo. cis.nctu.edu.tw/service/gprm/.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号