首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
Finding conserved motifs in genomic sequences represents one of essential bioinformatic problems. However, achieving high discovery performance without imposing substantial auxiliary constraints on possible motif features remains a key algorithmic challenge. This work describes BAMBI-a sequential Monte Carlo motif-identification algorithm, which is based on a position weight matrix model that does not require additional constraints and is able to estimate such motif properties as length, logo, number of instances and their locations solely on the basis of primary nucleotide sequence data. Furthermore, should biologically meaningful information about motif attributes be available, BAMBI takes advantage of this knowledge to further refine the discovery results. In practical applications, we show that the proposed approach can be used to find sites of such diverse DNA-binding molecules as the cAMP receptor protein (CRP) and Din-family site-specific serine recombinases. Results obtained by BAMBI in these and other settings demonstrate better statistical performance than any of the four widely-used profile-based motif discovery methods: MEME, BioProspector with BioOptimizer, SeSiMCMC and Motif Sampler as measured by the nucleotide-level correlation coefficient. Additionally, in the case of Din-family recombinase target site discovery, the BAMBI-inferred motif is found to be the only one functionally accurate from the underlying biochemical mechanism standpoint. C++ and Matlab code is available at http://www.ee.columbia.edu/~guido/BAMBI or http://genomics.lbl.gov/BAMBI/.  相似文献   

3.
Large portions of higher eukaryotic proteomes are intrinsically disordered, and abundant evidence suggests that these unstructured regions of proteins are rich in regulatory interaction interfaces. A major class of disordered interaction interfaces are the compact and degenerate modules known as short linear motifs (SLiMs). As a result of the difficulties associated with the experimental identification and validation of SLiMs, our understanding of these modules is limited, advocating the use of computational methods to focus experimental discovery. This article evaluates the use of evolutionary conservation as a discriminatory technique for motif discovery. A statistical framework is introduced to assess the significance of relatively conserved residues, quantifying the likelihood a residue will have a particular level of conservation given the conservation of the surrounding residues. The framework is expanded to assess the significance of groupings of conserved residues, a metric that forms the basis of SLiMPrints (short linear motif fingerprints), a de novo motif discovery tool. SLiMPrints identifies relatively overconstrained proximal groupings of residues within intrinsically disordered regions, indicative of putatively functional motifs. Finally, the human proteome is analysed to create a set of highly conserved putative motif instances, including a novel site on translation initiation factor eIF2A that may regulate translation through binding of eIF4E.  相似文献   

4.
MOTIVATION: Motif discovery in sequential data is a problem of great interest and with many applications. However, previous methods have been unable to combine exhaustive search with complex motif representations and are each typically only applicable to a certain class of problems. RESULTS: Here we present a generic motif discovery algorithm (Gemoda) for sequential data. Gemoda can be applied to any dataset with a sequential character, including both categorical and real-valued data. As we show, Gemoda deterministically discovers motifs that are maximal in composition and length. As well, the algorithm allows any choice of similarity metric for finding motifs. Finally, Gemoda's output motifs are representation-agnostic: they can be represented using regular expressions, position weight matrices or any number of other models for any type of sequential data. We demonstrate a number of applications of the algorithm, including the discovery of motifs in amino acids sequences, a new solution to the (l,d)-motif problem in DNA sequences and the discovery of conserved protein substructures. AVAILABILITY: Gemoda is freely available at http://web.mit.edu/bamel/gemoda  相似文献   

5.
Intrinsically disordered proteins (IDPs) exist without the presence of a stable tertiary structure in isolation. These proteins are often involved in molecular recognition processes via their disordered binding regions that can recognize partner molecules by undergoing a coupled folding and binding process. The specific properties of disordered binding regions give way to specific, yet transient interactions that enable IDPs to play central roles in signaling pathways and act as hubs of protein interaction networks. An alternative model of protein-protein interactions with largely overlapping functional properties is offered by the concept of linear interaction motifs. This approach focuses on distilling a short consensus sequence pattern from proteins with a common interaction partner. These motifs often reside in disordered regions and are considered to mediate the interaction roughly independent from the rest of the protein. Although a connection between linear motifs and disordered binding regions has been established through common examples, the complementary nature of the two concepts has yet to be fully explored. In many cases the sequence based definition of linear motifs and the structural context based definition of disordered binding regions describe two aspects of the same phenomenon. To gain insight into the connection between the two models, prediction methods were utilized. We combined the regular expression based prediction of linear motifs with the disordered binding region prediction method ANCHOR, each specialized for either model to get the best of both worlds. The thorough analysis of the overlap of the two methods offers a bioinformatics tool for more efficient binding site prediction that can serve a wide range of practical implications. At the same time it can also shed light on the theoretical connection between the two co-existing interaction models.  相似文献   

6.
7.
Eight amino acid permease genes from the protozoan parasite Leishmania donovani (AAPLDs) were cloned, sequenced, and shown to be expressed in promastigotes. Seven of these belong to the amino acid transporter-1 and one to the amino acid polyamino-choline superfamilies. Using these sequences as well as known and characterized amino acid permease genes from all kingdoms, a training set was established and used to search for motifs, using the MEME motif discovery tool. This study revealed two motifs that are specific to the genus Leishmania, four to the family trypanosomatidae, and a single motif that is common between trypanosomatidae and mammalian systems A1 and N. Interestingly, most of these motifs are clustered in two regions of 50-60 amino acids. Blast search analyses indicated a close relationship between the L. donovani and Trypanosoma brucei amino acid permeases. The results of this work describe the cloning of the first amino acid permease genes in parasitic protozoa and contribute to the understanding of amino acid permease evolution in these organisms. Furthermore, the identification of genus-specific motifs in these proteins might be useful to better understand parasite physiology within its hosts.  相似文献   

8.
MOTIVATION: This paper studies the problem of discovering subsequences, known as motifs, that are common to a given collection of related biosequences, by proposing a greedy algorithm for learning a mixture of motifs model through likelihood maximization. The approach adds sequentially a new motif to a mixture model by performing a combined scheme of global and local search for appropriately initializing its parameters. In addition, a hierarchical partitioning scheme based on kd-trees is presented for partitioning the input dataset in order to speed-up the global searching procedure. The proposed method compares favorably over the well-known MEME approach and treats successfully several drawbacks of MEME. RESULTS: Experimental results indicate that the algorithm is advantageous in identifying larger groups of motifs characteristic of biological families with significant conservation. In addition, it offers better diagnostic capabilities by building more powerful statistical motif-models with improved classification accuracy.  相似文献   

9.
10.
Prediction of short linear protein binding regions   总被引:1,自引:0,他引:1  
Short linear motifs in proteins (typically 3-12 residues in length) play key roles in protein-protein interactions by frequently binding specifically to peptide binding domains within interacting proteins. Their tendency to be found in disordered segments of proteins has meant that they have often been overlooked. Here we present SLiMPred (short linear motif predictor), the first general de novo method designed to computationally predict such regions in protein primary sequences independent of experimentally defined homologs and interactors. The method applies machine learning techniques to predict new motifs based on annotated instances from the Eukaryotic Linear Motif database, as well as structural, biophysical, and biochemical features derived from the protein primary sequence. We have integrated these data sources and benchmarked the predictive accuracy of the method, and found that it performs equivalently to a predictor of protein binding regions in disordered regions, in addition to having predictive power for other classes of motif sites such as polyproline II helix motifs and short linear motifs lying in ordered regions. It will be useful in predicting peptides involved in potential protein associations and will aid in the functional characterization of proteins, especially of proteins lacking experimental information on structures and interactions. We conclude that, despite the diversity of motif sequences and structures, SLiMPred is a valuable tool for prioritizing potential interaction motifs in proteins.  相似文献   

11.
Kim S  Wang Z  Dalkilic M 《Proteins》2007,66(3):671-681
The motif prediction problem is to predict short, conserved subsequences that are part of a family of sequences, and it is a very important biological problem. Gibbs is one of the first successful motif algorithms and it runs very fast compared with other algorithms, and its search behavior is based on the well-studied Gibbs random sampling. However, motif prediction is a very difficult problem and Gibbs may not predict true motifs in some cases. Thus, the authors explored a possibility of improving the prediction accuracy of Gibbs while retaining its fast runtime performance. In this paper, the authors considered Gibbs only for proteins, not for DNA binding sites. The authors have developed iGibbs, an integrated motif search framework for proteins that employs two previous techniques of their own: one for guiding motif search by clustering sequences and another by pattern refinement. These two techniques are combined to a new double clustering approach to guiding motif search. The unique feature of their framework is that users do not have to specify the number of motifs to be predicted when motifs occur in different subsets of the input sequences since it automatically clusters input sequences into clusters and predict motifs from the clusters. Tests on the PROSITE database show that their framework improved the prediction accuracy of Gibbs significantly. Compared with more exhaustive search methods like MEME, iGibbs predicted motifs more accurately and runs one order of magnitude faster.  相似文献   

12.

Background  

Automatic extraction of motifs from biological sequences is an important research problem in study of molecular biology. For proteins, it is desired to discover sequence motifs containing a large number of wildcard symbols, as the residues associated with functional sites are usually largely separated in sequences. Discovering such patterns is time-consuming because abundant combinations exist when long gaps (a gap consists of one or more successive wildcards) are considered. Mining algorithms often employ constraints to narrow down the search space in order to increase efficiency. However, improper constraint models might degrade the sensitivity and specificity of the motifs discovered by computational methods. We previously proposed a new constraint model to handle large wildcard regions for discovering functional motifs of proteins. The patterns that satisfy the proposed constraint model are called W-patterns. A W-pattern is a structured motif that groups motif symbols into pattern blocks interleaved with large irregular gaps. Considering large gaps reflects the fact that functional residues are not always from a single region of protein sequences, and restricting motif symbols into clusters corresponds to the observation that short motifs are frequently present within protein families. To efficiently discover W-patterns for large-scale sequence annotation and function prediction, this paper first formally introduces the problem to solve and proposes an algorithm named WildSpan (sequential pattern mining across large wildcard regions) that incorporates several pruning strategies to largely reduce the mining cost.  相似文献   

13.
MOTIVATION: The automatic identification of over-represented motifs present in a collection of sequences continues to be a challenging problem in computational biology. In this paper, we propose a self-organizing map of position weight matrices as an alternative method for motif discovery. The advantage of this approach is that it can be used to simultaneously characterize every feature present in the dataset, thus lessening the chance that weaker signals will be missed. Features identified are ranked in terms of over-representation relative to a background model. RESULTS: We present an implementation of this approach, named SOMBRERO (self-organizing map for biological regulatory element recognition and ordering), which is capable of discovering multiple distinct motifs present in a single dataset. Demonstrated here are the advantages of our approach on various datasets and SOMBRERO's improved performance over two popular motif-finding programs, MEME and AlignACE. AVAILABILITY: SOMBRERO is available free of charge from http://bioinf.nuigalway.ie/sombrero SUPPLEMENTARY INFORMATION: http://bioinf.nuigalway.ie/sombrero/additional.  相似文献   

14.
Short and long disordered regions of proteins have different preference for different amino acid residues. Different methods often have to be trained to predict them separately. In this study, we developed a single neural-network-based technique called SPINE-D that makes a three-state prediction first (ordered residues and disordered residues in short and long disordered regions) and reduces it into a two-state prediction afterwards. SPINE-D was tested on various sets composed of different combinations of Disprot annotated proteins and proteins directly from the PDB annotated for disorder by missing coordinates in X-ray determined structures. While disorder annotations are different according to Disprot and X-ray approaches, SPINE-D's prediction accuracy and ability to predict disorder are relatively independent of how the method was trained and what type of annotation was employed but strongly depend on the balance in the relative populations of ordered and disordered residues in short and long disordered regions in the test set. With greater than 85% overall specificity for detecting residues in both short and long disordered regions, the residues in long disordered regions are easier to predict at 81% sensitivity in a balanced test dataset with 56.5% ordered residues but more challenging (at 65% sensitivity) in a test dataset with 90% ordered residues. Compared to eleven other methods, SPINE-D yields the highest area under the curve (AUC), the highest Mathews correlation coefficient for residue-based prediction, and the lowest mean square error in predicting disorder contents of proteins for an independent test set with 329 proteins. In particular, SPINE-D is comparable to a meta predictor in predicting disordered residues in long disordered regions and superior in short disordered regions. SPINE-D participated in CASP 9 blind prediction and is one of the top servers according to the official ranking. In addition, SPINE-D was examined for prediction of functional molecular recognition motifs in several case studies.  相似文献   

15.
Abstract

Short and long disordered regions of proteins have different preference for different amino acid residues. Different methods often have to be trained to predict them separately. In this study, we developed a single neural-network-based technique called SPINE-D that makes a three-state prediction first (ordered residues and disordered residues in short and long disordered regions) and reduces it into a two-state prediction afterwards. SPINE-D was tested on various sets composed of different combinations of Disprot annotated proteins and proteins directly from the PDB annotated for disorder by missing coordinates in X-ray determined structures. While disorder annotations are different according to Disprot and X-ray approaches, SPINE-D's prediction accuracy and ability to predict disorder are relatively independent of how the method was trained and what type of annotation was employed but strongly depend on the balance in the relative populations of ordered and disordered residues in short and long disordered regions in the test set. With greater than 85% overall specificity for detecting residues in both short and long disordered regions, the residues in long disordered regions are easier to predict at 81% sensitivity in a balanced test dataset with 56.5% ordered residues but more challenging (at 65% sensitivity) in a test dataset with 90% ordered residues. Compared to eleven other methods, SPINE-D yields the highest area under the curve (AUC), the highest Mathews correlation coefficient for residue-based prediction, and the lowest mean square error in predicting disorder contents of proteins for an independent test set with 329 proteins. In particular, SPINE-D is comparable to a meta predictor in predicting disordered residues in long disordered regions and superior in short disordered regions. SPINE-D participated in CASP 9 blind prediction and is one of the top servers according to the official ranking. In addition, SPINE-D was examined for prediction of functional molecular recognition motifs in several case studies. The server and databases are available at http://sparks.informatics.iupui.edu/.  相似文献   

16.
17.
18.
Finding subtle motifs by branching from sample strings   总被引:1,自引:0,他引:1  
Many motif finding algorithms apply local search techniques to a set of seeds. For example, GibbsDNA (Lawrence et al. 1993, Science, 262, 208-214) applies Gibbs sampling to random seeds, and MEME (Bailey and Elkan, 1994, Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology (ISMB-94), 28-36) applies the EM algorithm to selected sample strings, i.e. substrings of the sample. In the case of subtle motifs, recent benchmarking efforts show that both random seeds and selected sample strings may never get close to the globally optimal motif. We propose a new approach which searches motif space by branching from sample strings, and implement this idea in both pattern-based and profile-based settings. Our PatternBranching and ProfileBranching algorithms achieve favorable results relative to other motif finding algorithms. Availability: http://www-cse.ucsd.edu/groups/bioinformatics/software.html  相似文献   

19.
20.
Natively unstructured or disordered regions appear to be abundant in eukaryotic proteins. Many such regions have been found alongside small linear binding motifs. We report a Monte Carlo study that aims to elucidate the role of disordered regions adjacent to such binding motifs. The coarse-grained simulations show that small hydrophobic peptides without disordered flanks tend to aggregate under conditions where peptides embedded in unstructured peptide sequences are stable as monomers or as part of small micelle-like clusters. Surprisingly, the binding free energy of the motif is barely decreased by the presence of disordered flanking regions, although it is sensitive to the loss of entropy of the motif itself upon binding. This latter effect allows for reversible binding of the signalling motif to the substrate. The work provides insights into a mechanism that prevents the aggregation of signalling peptides, distinct from the general mechanism of protein folding, and provides a testable hypothesis to explain the abundance of disordered regions in proteins.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号