首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The generation of genomic binding or accessibility data from massively parallel sequencing technologies such as ChIP-seq and DNase-seq continues to accelerate. Yet state-of-the-art computational approaches for the identification of DNA binding motifs often yield motifs of weak predictive power. Here we present a novel computational algorithm called MotifSpec, designed to find predictive motifs, in contrast to over-represented sequence elements. The key distinguishing feature of this algorithm is that it uses a dynamic search space and a learned threshold to find discriminative motifs in combination with the modeling of motifs using a full PWM (position weight matrix) rather than k-mer words or regular expressions. We demonstrate that our approach finds motifs corresponding to known binding specificities in several mammalian ChIP-seq datasets, and that our PWMs classify the ChIP-seq signals with accuracy comparable to, or marginally better than motifs from the best existing algorithms. In other datasets, our algorithm identifies novel motifs where other methods fail. Finally, we apply this algorithm to detect motifs from expression datasets in C. elegans using a dynamic expression similarity metric rather than fixed expression clusters, and find novel predictive motifs.  相似文献   

2.
Quantifying similarity between motifs   总被引:2,自引:0,他引:2  
A common question within the context of de novo motif discovery is whether a newly discovered, putative motif resembles any previously discovered motif in an existing database. To answer this question, we define a statistical measure of motif-motif similarity, and we describe an algorithm, called Tomtom, for searching a database of motifs with a given query motif. Experimental simulations demonstrate the accuracy of Tomtom's E values and its effectiveness in finding similar motifs.  相似文献   

3.
4.
MOTIVATION: The sequence specificity of DNA-binding proteins is typically represented as a position weight matrix in which each base position contributes independently to relative affinity. Assessment of the accuracy and broad applicability of this representation has been limited by the lack of extensive DNA-binding data. However, new microarray techniques, in which preferences for all possible K-mers are measured, enable a broad comparison of both motif representation and methods for motif discovery. Here, we consider the problem of accounting for all of the binding data in such experiments, rather than the highest affinity binding data. We introduce the RankMotif++, an algorithm designed for finding motifs whenever sequences are associated with a semi-quantitative measure of protein-DNA-binding affinity. RankMotif++ learns motif models by maximizing the likelihood of a set of binding preferences under a probabilistic model of how sequence binding affinity translates into binding preference observations. Because RankMotif++ makes few assumptions about the relationship between binding affinity and the semi-quantitative readout, it is applicable to a wide variety of experimental assays of DNA-binding preference. RESULTS: By several criteria, RankMotif++ predicts binding affinity better than two widely used motif finding algorithms (MDScan, MatrixREDUCE) or more recently developed algorithms (PREGO, Seed and Wobble), and its performance is comparable to a motif model that separately assigns affinities to 8-mers. Our results validate the PWM model and provide an approximation of the precision and recall that can be expected in a genomic scan. AVAILABILITY: RankMotif++ is available upon request. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

5.
Taxonomy of thermophilic, endospore-forming bacteria has evoked a great interest over the past few years. Although a number of taxonomic markers were previously evaluated, their sequences in Geobacillus were too conservative, and identification of more variable markers is needed. Repetitive DNA is one of the promising variable targets in the development of the taxon-specific genotyping and identification schemes in bacteria. The aim of our study was to evaluate the possibility of using repetitive DNA in the taxonomy of Geobacillus. In this paper, we report the analysis of perfect tandem repeats of geobacilli. We focused on the long repeats (with a motif length of ≥20 nucleotides). This choice was based on the assumption that these motifs can be used for the construction of oligonucleotides — primers and probes. Thirty-three Geobacillus genus-specific motifs were identified in our work, fifteen of them were species-specific and fifteen — species cluster -specific. Three of them were genus-, but not species- or species cluster-specific. Some of the motifs were used for the construction of the primer pairs. The primers were validated by PCR. Out of 12 designed primer pairs, 11 were genus-specific and 4 — species-specific. Species-specific primers were successfully constructed for the phylogenetically defined species Geobacillus thermodenitrificans and Geobacillus toebii.  相似文献   

6.
7.
8.
Mining frequent stem patterns from unaligned RNA sequences   总被引:1,自引:0,他引:1  
MOTIVATION: In detection of non-coding RNAs, it is often necessary to identify the secondary structure motifs from a set of putative RNA sequences. Most of the existing algorithms aim to provide the best motif or few good motifs, but biologists often need to inspect all the possible motifs thoroughly. RESULTS: Our method RNAmine employs a graph theoretic representation of RNA sequences and detects all the possible motifs exhaustively using a graph mining algorithm. The motif detection problem boils down to finding frequently appearing patterns in a set of directed and labeled graphs. In the tasks of common secondary structure prediction and local motif detection from long sequences, our method performed favorably both in accuracy and in efficiency with the state-of-the-art methods such as CMFinder. AVAILABILITY: The software is available upon request.  相似文献   

9.
10.
11.
12.
This paper introduces two exact algorithms for extracting conserved structured motifs from a set of DNA sequences. Structured motifs may be described as an ordered collection of p > or = 1 "boxes" (each box corresponding to one part of the structured motif), p substitution rates (one for each box) and p - 1 intervals of distance (one for each pair of successive boxes in the collection). The contents of the boxes--that is, the motifs themselves--are unknown at the start of the algorithm. This is precisely what the algorithms are meant to find. A suffix tree is used for finding such motifs. The algorithms are efficient enough to be able to infer site consensi, such as, for instance, promoter sequences or regulatory sites, from a set of unaligned sequences corresponding to the noncoding regions upstream from all genes of a genome. In particular, both algorithms time complexity scales linearly with N2n where n is the average length of the sequences and N their number. An application to the identification of promoter and regulatory consensus sequences in bacterial genomes is shown.  相似文献   

13.
14.
15.
The motif DGYW/WRCH (Mh) and its frequently discussed simplified derivative GYW/WRC (Mhs) are involved in immunoglobulin (Ig) hypermutation. Both these motifs appear to be markedly shorter than the corresponding conventionally predicted minima of valid sequence lengths (MVSL). The same conclusion concerning both Mh and Mhs can also be obtained in the combined case including a less strict semi-empirically defined w-value and one nucleotide length tolerance related to MVSL. Such disagreement indicates considerably low information content in Mh and Mhs when evaluating these motifs as alphabetical structures (words). This fact raises a question of actually recognized structures (presumably longer than Mh and Mhs). Interestingly, both Mh and Mhs dimers or pairs of closely located Mh or Mhs achieve confirmation of length validity in the case of w=0.05, suggesting thus double-motif recognition as one of statistically consistent explanations. This possibility is also in agreement with the results of our model sequence study of mRNA derived from variable Ig gene sequences (rIgV) with respect to the most frequently occurring structures formed by motif overlaps in all model sequence sets. On the other hand, additional superior occurrence of motif pairs at a structurally important distance of a single DNA thread was found in the conserved domain (cd00099) related sequences of Elasmobranchii origin and less markedly in the corresponding human rIgV, but not in a randomly selected human subset of rIgV. The data are discussed with respect to statistical evaluation and structural properties of hypermutation motifs or the competent enzyme, i.e. activation-induced cytidine deaminase.  相似文献   

16.
Recent studies have shown that RNA structural motifs play essential roles in RNA folding and interaction with other molecules. Computational identification and analysis of RNA structural motifs remains a challenging task. Existing motif identification methods based on 3D structure may not properly compare motifs with high structural variations. Other structural motif identification methods consider only nested canonical base-pairing structures and cannot be used to identify complex RNA structural motifs that often consist of various non-canonical base pairs due to uncommon hydrogen bond interactions. In this article, we present a novel RNA structural alignment method for RNA structural motif identification, RNAMotifScan, which takes into consideration the isosteric (both canonical and non-canonical) base pairs and multi-pairings in RNA structural motifs. The utility and accuracy of RNAMotifScan is demonstrated by searching for kink-turn, C-loop, sarcin-ricin, reverse kink-turn and E-loop motifs against a 23S rRNA (PDBid: 1S72), which is well characterized for the occurrences of these motifs. Finally, we search these motifs against the RNA structures in the entire Protein Data Bank and the abundances of them are estimated. RNAMotifScan is freely available at our supplementary website (http://genome.ucf.edu/RNAMotifScan).  相似文献   

17.

Background  

DNA signatures are distinct short nucleotide sequences that provide valuable information that is used for various purposes, such as the design of Polymerase Chain Reaction primers and microarray experiments. Biologists usually use a discovery algorithm to find unique signatures from DNA databases, and then apply the signatures to microarray experiments. Such discovery algorithms require to set some input factors, such as signature length l and mismatch tolerance d, which affect the discovery results. However, suggestions about how to select proper factor values are rare, especially when an unfamiliar DNA database is used. In most cases, biologists typically select factor values based on experience, or even by guessing. If the discovered result is unsatisfactory, biologists change the input factors of the algorithm to obtain a new result. This process is repeated until a proper result is obtained. Implicit signatures under the discovery condition (l, d) are defined as the signatures of length ≤ l with mismatch tolerance ≥ d. A discovery algorithm that could discover all implicit signatures, such that those that meet the requirements concerning the results, would be more helpful than one that depends on trial and error. However, existing discovery algorithms do not address the need to discover all implicit signatures.  相似文献   

18.
A deep understanding of protein structure benefits from the use of a variety of classification strategies that enhance our ability to effectively describe local patterns of conformation. Here, we use a clustering algorithm to analyze 76,533 all-trans segments from protein structures solved at 1.2 Å resolution or better to create a purely φ,ψ-based comprehensive empirical categorization of common conformations adopted by two adjacent φ,ψ pairs (i.e., (φ,ψ)2 motifs). The clustering algorithm works in an origin-shifted four-dimensional space based on the two φ,ψ pairs to yield a parameter-dependent list of (φ,ψ)2 motifs, in order of their prominence. The results are remarkably distinct from and complementary to the standard hydrogen-bond-centered view of secondary structure. New insights include an unprecedented level of precision in describing the φ,ψ angles of both previously known and novel motifs, ordering of these motifs by their population density, a data-driven recommendation that the standard Cαi…Cαi + 3 < 7 Å criteria for defining turns be changed to 6.5 Å, identification of β-strand and turn capping motifs, and identification of conformational capping by residues in polypeptide II conformation. We further document that the conformational preferences of a residue are substantially influenced by the conformation of its neighbors, and we suggest that accounting for these dependencies will improve protein modeling accuracy. Although the CUEVAS-4D(r10?14) ‘parts list’ presented here is only an initial exploration of the complex (φ,ψ)2 landscape of proteins, it shows that there is value to be had from this approach, and it opens the door to more in-depth characterizations at the (φ,ψ)2 level and at higher dimensions.  相似文献   

19.
20.

Background  

Automatic extraction of motifs from biological sequences is an important research problem in study of molecular biology. For proteins, it is desired to discover sequence motifs containing a large number of wildcard symbols, as the residues associated with functional sites are usually largely separated in sequences. Discovering such patterns is time-consuming because abundant combinations exist when long gaps (a gap consists of one or more successive wildcards) are considered. Mining algorithms often employ constraints to narrow down the search space in order to increase efficiency. However, improper constraint models might degrade the sensitivity and specificity of the motifs discovered by computational methods. We previously proposed a new constraint model to handle large wildcard regions for discovering functional motifs of proteins. The patterns that satisfy the proposed constraint model are called W-patterns. A W-pattern is a structured motif that groups motif symbols into pattern blocks interleaved with large irregular gaps. Considering large gaps reflects the fact that functional residues are not always from a single region of protein sequences, and restricting motif symbols into clusters corresponds to the observation that short motifs are frequently present within protein families. To efficiently discover W-patterns for large-scale sequence annotation and function prediction, this paper first formally introduces the problem to solve and proposes an algorithm named WildSpan (sequential pattern mining across large wildcard regions) that incorporates several pruning strategies to largely reduce the mining cost.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号