首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Detection of functional DNA motifs via statistical over-representation   总被引:14,自引:0,他引:14  
  相似文献   

2.
Subtle motifs: defining the limits of motif finding algorithms   总被引:4,自引:0,他引:4  
MOTIVATION: What constitutes a subtle motif? Intuitively, it is a motif that is almost indistinguishable, in the statistical sense, from random motifs. This question has important practical consequences: consider, for example, a biologist that is generating a sample of upstream regulatory sequences with the goal of finding a regulatory pattern that is shared by these sequences. If the sequences are too short then one risks losing some of the regulatory patterns that are located further upstream. Conversely, if the sequences are too long, the motif becomes too subtle and one is then likely to encounter random motifs which are at least as significant statistically as the regulatory pattern itself. In practical terms one would like to recognize the sequence length threshold, or the twilight zone, beyond which the motifs are in some sense too subtle. RESULTS: The paper defines the motif twilight zone where every motif finding algorithm would be exposed to random motifs which are as significant as the one which is sought. We also propose an objective tool for evaluating the performance of subtle motif finding algorithms. Finally we apply these tools to evaluate the success of our MULTIPROFILER algorithm to detect subtle motifs.  相似文献   

3.
Identifying non-coding RNA regions on the genome using computational methods is currently receiving a lot of attention. In general, it is essentially more difficult than the problem of detecting protein-coding genes because non-coding RNA regions have only weak statistical signals. On the other hand, most functional RNA families have conserved sequences and secondary structures which are characteristic of their molecular function in a cell. These are known as sequence motifs and consensus structures, respectively. In this paper, we propose an improved method which extends a pairwise structural alignment method for RNA sequences to handle position specific scoring matrices and hence to incorporate motifs into structural alignment of RNA sequences. To model sequence motifs, we employ position specific scoring matrices (PSSMs). Experimental results show that PSSMs enable us to find individual RNA families efficiently, especially if we have biological knowledge such as sequence motifs. K. Sato and K. Morita contributed equally to this work.  相似文献   

4.
Predicting the function of a protein from its sequence is a long-standing goal of bioinformatic research. While sequence similarity is the most popular tool used for this purpose, sequence motifs may also subserve this goal. Here we develop a motif-based method consisting of applying an unsupervised motif extraction algorithm (MEX) to all enzyme sequences, and filtering the results by the four-level classification hierarchy of the Enzyme Commission (EC). The resulting motifs serve as specific peptides (SPs), appearing on single branches of the EC. In contrast to previous motif-based methods, the new method does not require any preprocessing by multiple sequence alignment, nor does it rely on over-representation of motifs within EC branches. The SPs obtained comprise on average 8.4 +/- 4.5 amino acids, and specify the functions of 93% of all enzymes, which is much higher than the coverage of 63% provided by ProSite motifs. The SP classification thus compares favorably with previous function annotation methods and successfully demonstrates an added value in extreme cases where sequence similarity fails. Interestingly, SPs cover most of the annotated active and binding site amino acids, and occur in active-site neighboring 3-D pockets in a highly statistically significant manner. The latter are assumed to have strong biological relevance to the activity of the enzyme. Further filtering of SPs by biological functional annotations results in reduced small subsets of SPs that possess very large enzyme coverage. Overall, SPs both form a very useful tool for enzyme functional classification and bear responsibility for the catalytic biological function carried out by enzymes.  相似文献   

5.
When investigators undertake searches of DNA databases, they normally discard large numbers of alignments that demonstrate very weak resemblances to each other, retaining only those that show statistically significant levels of resemblance. We show here that a great deal of information can be extracted from these weak alignments by examining them en masse. This is done by building three-dimensional similarity landscapes from the alignments, landscapes that reveal whether an unusual number of individually nonsignificant alignments tend to match up to a particular region of the query sequence being searched. The power of the search is increased by the use of libraries consisting entirely of introns or of exons. We show that (1) similarity landscapes with a variety of features can be generated from both intron and exon libraries, using introns or exons as query sequences; (2) the landscape features are real and not a statistical artifact; (3) well-known protein motifs used as query sequences can generate various landscape features; and (4) there is some evidence for resemblances between short regions of sequence carried by introns and exons. One possible interpretation of these results is that both introns and exons may have been built up during their evolution from short regions of sequence that as a result are now widely distributed throughout eukaryotic genomes. Such an interpretation would imply that these short regions have common ancestry. Alternatively, the wide sharing of short pieces of DNA may reflect regions with particular structural properties that have arisen through convergent evolution. The similarity-landscape approach can be used to detect such widespread structural motifs and sequence motifs in the genome that might be missed by less-global searches. It can also be used in conjunction with algorithms developed for detecting significant multiple alignments by isolating promising subsets of the databases that can be examined in more detail.Correspondence to: C. Wills  相似文献   

6.
MOTIVATION: RNA structure motifs contained in mRNAs have been found to play important roles in regulating gene expression. However, identification of novel RNA regulatory motifs using computational methods has not been widely explored. Effective tools for predicting novel RNA regulatory motifs based on genomic sequences are needed. RESULTS: We present a new method for predicting common RNA secondary structure motifs in a set of functionally or evolutionarily related RNA sequences. This method is based on comparison of stems (palindromic helices) between sequences and is implemented by applying graph-theoretical approaches. It first finds all possible stable stems in each sequence and compares stems pairwise between sequences by some defined features to find stems conserved across any two sequences. Then by applying a maximum clique finding algorithm, it finds all significant stems conserved across at least k sequences. Finally, it assembles in topological order all possible compatible conserved stems shared by at least k sequences and reports a number of the best assembled stem sets as the best candidate common structure motifs. This method does not require prior structural alignment of the sequences and is able to detect pseudoknot structures. We have tested this approach on some RNA sequences with known secondary structures, in which it is capable of detecting the real structures completely or partially correctly and outperforms other existing programs for similar purposes. AVAILABILITY: The algorithm has been implemented in C++ in a program called comRNA, which is available at http://ural.wustl.edu/softwares.html  相似文献   

7.
We use methods from Data Mining and Knowledge Discovery to design an algorithm for detecting motifs in protein sequences. The algorithm assumes that a motif is constituted by the presence of a "good" combination of residues in appropriate locations of the motif. The algorithm attempts to compile such good combinations into a "pattern dictionary" by processing an aligned training set of protein sequences. The dictionary is subsequently used to detect motifs in new protein sequences. Statistical significance of the detection results are ensured by statistically determining the various parameters of the algorithm. Based on this approach, we have implemented a program called GYM. The Helix-Turn-Helix motif was used as a model system on which to test our program. The program was also extended to detect Homeodomain motifs. The detection results for the two motifs compare favorably with existing programs. In addition, the GYM program provides a lot of useful information about a given protein sequence.  相似文献   

8.
Compilation and analysis of intein sequences.   总被引:18,自引:3,他引:15       下载免费PDF全文
We have compiled a list of all the inteins (protein splicing elements) whose sequences have been published or were available from on-line sequence databases as of September 18, 1996. Analysis of the 36 available intein sequences refines the previously described intein motifs and reveals the presence of another intein motif, Block H. Furthermore, analysis of the new inteins reshapes our view of the conserved splice junction residues, since three inteins lack the intein penultimate His seen in prior examples. Comparison of intein sequences suggests that, in general, (i) inteins present in the same location within extein homologs from different organisms are very closely related to each other in paired sequence comparison or phylogenetic analysis and we suggest that they should be considered intein alleles; (ii) multiple inteins present in the same gene are no more similar to each other than to inteins present in different genes; (iii) phylogenetic analysis indicates that inteins are so divergent that trees with statistically significant branches cannot be generated except for intein alleles.  相似文献   

9.
10.
Characterizing and classifying regularities in protein structure is an important element in uncovering the mechanisms that regulate protein structure, function and evolution. Recent research concentrates on analysis of structural motifs that can be used to describe larger, fold-sized structures based on homologous primary sequences. At the same time, accuracy of secondary protein structure prediction based on multiple sequence alignment drops significantly when low homology (twilight zone) sequences are considered. To this end, this paper addresses a problem of providing an alternative sequences representation that would improve ability to distinguish secondary structures for the twilight zone sequences without using alignment. We consider a novel classification problem, in which, structural motifs, referred to as structural fragments (SFs) are defined as uniform strand, helix and coil fragments. Classification of SFs allows to design novel sequence representations, and to investigate which other factors and prediction algorithms may result in the improved discrimination. Comprehensive experimental results show that statistically significant improvement in classification accuracy can be achieved by: (1) improving sequence representations, and (2) removing possible noise on the terminal residues in the SFs. Combining these two approaches reduces the error rate on average by 15% when compared to classification using standard representation and noisy information on the terminal residues, bringing the classification accuracy to over 70%. Finally, we show that certain prediction algorithms, such as neural networks and boosted decision trees, are superior to other algorithms.This research was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC).  相似文献   

11.
We have developed a pattern comparative method for identifying functionally important motifs in protein sequences. The essence of most standard pattern comparative methods is a comparison of patterns occurring in different sequences using an optimized weight matrix. In contrast, our approach is based on a measure of similarity among all the candidate motifs within the same sequence. This method may prove to be particularly efficient for proteins encoding the same biochemical function, but with different primary sequences, and when tertiary structure information from one or more sequences is available. We have applied this method to a special class of zinc-binding enzymes known as endopeptidases.  相似文献   

12.
13.
Kagiampakis I  Jin H  Kim S  Vannucci M  LiWang PJ  Tsai J 《Biochemistry》2008,47(40):10637-10648
In the chemokine family, we characterize two examples of evolutionarily conserved unfavorable sequence motifs that affect quaternary structure. In contrast to the straightforward action of favorable sequences, these unfavorable motifs produce interactions disfavoring one outcome to indirectly promote another one but should not be confused with the broad sampling produced by negative selection and/or design. To identify such motifs, we developed a statistically validated computational method combining structure and phylogeny. This approach was applied in an analysis of the alternate forms of homodimerization exhibited in the chemokine family. While the chemokine family exhibits the same tertiary fold, members of certain subfamilies, including CXCL8, form a homodimer across the beta1 strand whereas members of other subfamilies, including CCL4 and CCL2, form a homodimer on the opposite side of the chemokine fold. These alternate dimerization states suggest that CCL4 and CCL2 contain specific sequences that disfavor CXCL8 dimerization. Using our computational approach, we identified two evolutionarily conserved sequence motifs in the CC subfamilies: a drastic two-residue deletion (DeltaRV) and a simple point mutation (V27R). Cloned into the CXCL8 background, these two motifs were experimentally proven to confer a monomeric state. NMR analyses indicate that these variants are structured in solution and retain the chemokine fold. Structurally, the motifs retain a chemokine tertiary fold while introducing unfavorable quaternary interactions that inhibit CXCL8 dimerization. In demonstrating the success of our computational method, our results argue that these unfavorable motifs have been evolutionarily conserved to specifically disfavor one dimerization state and, as a result, indirectly contribute to favoring another.  相似文献   

14.
MOTIVATION: Direct recognition, or direct readout, of DNA bases by a DNA-binding protein involves amino acids that interact directly with features specific to each base. Experimental evidence also shows that in many cases the protein achieves partial sequence specificity by indirect recognition, i.e., by recognizing structural properties of the DNA. (1) Could threading a DNA sequence onto a crystal structure of bound DNA help explain the indirect recognition component of sequence specificity? (2) Might the resulting pure-structure computational motif manifest itself in familiar sequence-based computational motifs? RESULTS: The starting structure motif was a crystal structure of DNA bound to the integration host factor protein (IHF) of E. coli. IHF is known to exhibit both direct and indirect recognition of its binding sites. (1) Threading DNA sequences onto the crystal structure showed statistically significant partial separation of 60 IHF binding sites from random and intragenic sequences and was positively correlated with binding affinity. (2) The crystal structure was shown to be equivalent to a linear Markov network, and so, to a joint probability distribution over sequences, computable in linear time. It was transformed algorithmically into several common pure-sequence representations, including (a) small sets of short exact strings, (b) weight matrices, (c) consensus regular patterns, (d) multiple sequence alignments, and (e) phylogenetic trees. In all cases the pure-sequence motifs retained statistically significant partial separation of the IHF binding sites from random and intragenic sequences. Most exhibited positive correlation with binding affinity. The multiple alignment showed some conserved columns, and the phylogenetic tree partially mixed low-energy sequences with IHF binding sites but separated high-energy sequences. The conclusion is that deformation energy explains part of indirect recognition, which explains part of IHF sequence-specific binding.  相似文献   

15.
MOTIVATION: Identification of short conserved sequence motifs common to a protein family or superfamily can be more useful than overall sequence similarity in suggesting the function of novel gene products. Locating motifs still requires expert knowledge, as automated methods using stringent criteria may not differentiate subtle similarities from statistical noise. RESULTS: We have developed a novel automatic method, based on patterns of conservation of 237 physical-chemical properties of amino acids in aligned protein sequences, to find related motifs in proteins with little or no overall sequence similarity. As an application, our web-server MASIA identified 12 property-based motifs in the apurinic/apyrimidinic endonuclease (APE) family of DNA-repair enzymes of the DNase-I superfamily. Searching with these motifs located distantly related representatives of the DNase-I superfamily, such as Inositol 5'-polyphosphate phosphatases in the ASTRAL40 database, using a Bayesian scoring function. Other proteins containing APE motifs had no overall sequence or structural similarity. However, all were phosphatases and/or had a metal ion binding active site. Thus our automated method can identify discrete elements in distantly related proteins that define local structure and aspects of function. We anticipate that our method will complement existing ones to functionally annotate novel protein sequences from genomic projects. AVAILABILITY: MASIA WEB site: http://www.scsb.utmb.edu/masia/masia.html SUPPLEMENTARY INFORMATION: The dendrogram of 42 APE sequences used to derive motifs is available on http://www.scsb.utmb.edu/comp_biol.html/DNA_repair/publication.html  相似文献   

16.
MOTIVATION: Cis-acting regulatory elements are frequently constrained by both sequence content and positioning relative to a functional site, such as a splice or polyadenylation site. We describe an approach to regulatory motif analysis based on non-negative matrix factorization (NMF). Whereas existing pattern recognition algorithms commonly focus primarily on sequence content, our method simultaneously characterizes both positioning and sequence content of putative motifs. RESULTS: Tests on artificially generated sequences show that NMF can faithfully reproduce both positioning and content of test motifs. We show how the variation of the residual sum of squares can be used to give a robust estimate of the number of motifs or patterns in a sequence set. Our analysis distinguishes multiple motifs with significant overlap in sequence content and/or positioning. Finally, we demonstrate the use of the NMF approach through characterization of biologically interesting datasets. Specifically, an analysis of mRNA 3'-processing (cleavage and polyadenylation) sites from a broad range of higher eukaryotes reveals a conserved core pattern of three elements.  相似文献   

17.
MOTIVATION: Short linear peptide motifs mediate protein-protein interaction, cell compartment targeting and represent the sites of post-translational modification. The identification of functional motifs by conventional sequence searches, however, is hampered by the short length of the motifs resulting in a large number of hits of which only a small portion is functional. RESULTS: We have developed a procedure for the identification of functional motifs, which scores pattern conservation in homologous sequences by taking explicitly into account the sequence similarity to the query sequence. For a further improvement of this method, sequence filters have been optimized to mask those sequence regions containing little or no linear motifs. The performance of this approach was verified by measuring its ability to identify 576 experimentally validated motifs among a total of 15 563 instances in a set of 415 protein sequences. Compared to a random selection procedure, the joint application of sequence filters and the novel scoring scheme resulted in a 9-fold enrichment of validated functional motifs on the first rank. In addition, only half as many hits need to be investigated to recover 75% of the functional instances in our dataset. Therefore, this motif-scoring approach should be helpful to guide experiments because it allows focusing on those short linear peptide motifs that have a high probability to be functional.  相似文献   

18.
In the postgenomic era it is essential that protein sequences are annotated correctly in order to help in the assignment of their putative functions. Over 1300 proteins in current protein sequence databases are predicted to contain a PAS domain based upon amino acid sequence alignments. One of the problems with the current annotation of the PAS domain is that this domain exhibits limited similarity at the amino acid sequence level. It is therefore essential, when using proteins with low-sequence similarities, to apply profile hidden Markov model searches for the PAS domain-containing proteins, as for the PFAM database. From recent 3D X-ray and NMR structures, however, PAS domains appear to have a conserved 3D fold as shown here by structural alignment of the six representative 3D-structures from the PDB database. Large-scale modelling of the PAS sequences from the PFAM database against the 3D-structures of these six structural prototypes was performed. All 3D models generated (> 5700) were evaluated using prosaii. We conclude from our large-scale modelling studies that the PAS and PAC motifs (which are separately defined in the PFAM database) are directly linked and that these two motifs form the PAS fold. The existing subdivision in PAS and PAC motifs, as used by the PFAM and SMART databases, appears to be caused by major differences in sequences in the region connecting these two motifs. This region, as has been shown by Gardner and coworkers for human PAS kinase (Amezcua, C.A., Harper, S.M., Rutter, J. & Gardner, K.H. (2002) Structure 10, 1349-1361, [1]), is very flexible and adopts different conformations depending on the bound ligand. Some PAS sequences present in the PFAM database did not produce a good structural model, even after realignment using a structure-based alignment method, suggesting that these representatives are unlikely to have a fold resembling any of the structural prototypes of the PAS domain superfamily.  相似文献   

19.
20.
Predictive motifs derived from cytosine methyltransferases.   总被引:36,自引:51,他引:36       下载免费PDF全文
Thirteen bacterial DNA methyltransferases that catalyze the formation of 5-methylcytosine within specific DNA sequences possess related structures. Similar building blocks (motifs), containing invariant positions, can be found in the same order in all thirteen sequences. Five of these blocks are highly conserved while a further five contain weaker similarities. One block, which has the most invariant residues, contains the proline-cysteine dipeptide of the proposed catalytic site. A region in the second half of each sequence is unusually variable both in length and sequence composition. Those methyltransferases that exhibit significant homology in this region share common specificity in DNA recognition. The five highly conserved motifs can be used to discriminate the known 5-methylcytosine forming methyltransferases from all other methyltransferases of known sequence, and from all other identified proteins in the PIR, GenBank and EMBL databases. These five motifs occur in a mammalian methyltransferase responsible for the formation of 5-methylcytosine within CG dinucleotides. By searching the unidentified open reading frames present in the GenBank and EMBL databases, two potential 5-methylcytosine forming methyltransferases have been found.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号