首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 865 毫秒
1.
This paper introduces two exact algorithms for extracting conserved structured motifs from a set of DNA sequences. Structured motifs may be described as an ordered collection of p > or = 1 "boxes" (each box corresponding to one part of the structured motif), p substitution rates (one for each box) and p - 1 intervals of distance (one for each pair of successive boxes in the collection). The contents of the boxes--that is, the motifs themselves--are unknown at the start of the algorithm. This is precisely what the algorithms are meant to find. A suffix tree is used for finding such motifs. The algorithms are efficient enough to be able to infer site consensi, such as, for instance, promoter sequences or regulatory sites, from a set of unaligned sequences corresponding to the noncoding regions upstream from all genes of a genome. In particular, both algorithms time complexity scales linearly with N2n where n is the average length of the sequences and N their number. An application to the identification of promoter and regulatory consensus sequences in bacterial genomes is shown.  相似文献   

2.
3.
This paper describes a novel evolutionary algorithm for regulatory motif discovery in DNA promoter sequences. The algorithm uses data clustering to logically distribute the evolving population across the search space. Mating then takes place within local regions of the population, promoting overall solution diversity and encouraging discovery of multiple solutions. Experiments using synthetic data sets have demonstrated the algorithm's capacity to find position frequency matrix models of known regulatory motifs in relatively long promoter sequences. These experiments have also shown the algorithm's ability to maintain diversity during search and discover multiple motifs within a single population. The utility of the algorithm for discovering motifs in real biological data is demonstrated by its ability to find meaningful motifs within muscle-specific regulatory sequences.  相似文献   

4.
MOTIVATION: The ability to identify complex motifs, i.e. non-contiguous nucleotide sequences, is a key feature of modern motif finders. Addressing this problem is extremely important, not only because these motifs can accurately model biological phenomena but because its extraction is highly dependent upon the appropriate selection of numerous search parameters. Currently available combinatorial algorithms have proved to be highly efficient in exhaustively enumerating motifs (including complex motifs), which fulfill certain extraction criteria. However, one major problem with these methods is the large number of parameters that need to be specified. RESULTS: We propose a new algorithm, MUSA (Motif finding using an UnSupervised Approach), that can be used either to autonomously find over-represented complex motifs or to estimate search parameters for modern motif finders. This method relies on a biclustering algorithm that operates on a matrix of co-occurrences of small motifs. The performance of this method is independent of the composite structure of the motifs being sought, making few assumptions about their characteristics. The MUSA algorithm was applied to two datasets involving the bacterium Pseudomonas putida KT2440. The first one was composed of 70 sigma(54)-dependent promoter sequences and the second dataset included 54 promoter sequences of up-regulated genes in response to phenol, as suggested by quantitative proteomics. The results obtained indicate that this approach is very effective at identifying complex motifs of biological significance. AVAILABILITY: The MUSA algorithm is available upon request from the authors, and will be made available via a Web based interface.  相似文献   

5.
In this paper we present a branch and bound algorithm for local gapless multiple sequence alignment (motif alignment) and its implementation. The algorithm uses both score-based bounding and a novel bounding technique based on the "consistency" of the alignment. A sequence order independent search tree is used in conjunction with a technique for avoiding redundant calculations inherent in the structure of the tree. This is the first program to exploit the fact that the motif alignment problem is easier for short motifs. Indeed, for a short fixed motif width, the running time of the algorithm is asymptotically linear in the size of the input. We tested the performance of the program on a dataset of 300 E. coli promoter sequences and a dataset of 85 lipocalin protein sequences. For a motif width of 4, the optimal alignment of the entire set of sequences can be found. For the more natural motif width of 6, the program can align 21 sequences of length 100, more than twice the number of sequences which can be aligned by the best previous exact algorithm. The algorithm can relax the constraint of requiring each sequence to be aligned, and align 105 of the 300 promoter sequences with a motif width of 6. For the lipocalin dataset, we introduce a technique for reducing the effective alphabet size with a minimal loss of useful information. With this technique, we show that the program can find meaningful motifs in a reasonable amount of time by optimizing the score over three motif positions.  相似文献   

6.
The -10 and -35 regions of E. coli promoter sequences are separated by a spacer region which has a consensus length of 17 base-pairs. This region is thought to contribute to promoter function by correctly positioning the two conserved regions. We have performed a statistical evaluation of 224 spacer sequences and found that spacers which deviate from the 17 base-pair consensus length have nonrandom sequences in their upstream ends. Spacer regions which are shorter than 17 base-pairs in length have a significantly higher than expected frequency of purine-purine and pyrimidine-pyrimidine homo-dinucleotides at the six upstream positions. Spacer regions which are longer than 17 base-pairs in length have a significantly higher than expected frequency of purine-pyrimidine and pyrimidine-purine hetero-dinucleotides at these positions. This suggests that the nature of the purine-pyrimidine sequence at the upstream end of spacer regions affect promoter function in a manner which is related to the spacer length. We examine the spacer sequences as a function of spacer length and discuss some possible explanations for the observed relationship between sequence and length.  相似文献   

7.
8.
PO149, a new member of pollen pectate lyase-like gene family from alfalfa   总被引:5,自引:0,他引:5  
PO149 is a low-copy-number gene expressed in the late stages of pollen development. The promoter region contains no similarities in DNA sequence to those of other pollen-specific genes, except for a tobacco sequence (AAATGA), which occurs four times in this alfalfa gene and much further upstream than in tobacco. Four distinct TATA boxes were detected in the promoter with the distal and proximal TATA boxes being separated by a spacer of 269 nucleotides. Hairpin loop structures were found in the 5-and 3-untranslated regions of PO149 mRNA. The coding region of PO149 is interrupted by two introns and encodes a putative prepeptide of 450 amino acids with homology to pollen pectate lyase-like proteins and pollen allergens. The coding region also contains sequences characteristic of both a signal peptide and a nuclear localization signal.  相似文献   

9.
The resolution potential of internal transcribed spacer 2 (ITS2) at deeper levels remains controversial. In this study, 105 ITS2 sequences of 55 species in Calyptratae were analyzed to examine the phylogenetic utility of the spacer above the subfamily level and to further understand its evolutionary characteristics. We predicted the secondary structure of each sequence using the minimum-energy algorithm and constructed two data matrixes for phylogenetic analysis. The ITS2 regions of Calyptratae display strong A-T bias and slight variation in length. The tandem and dispersed repeats embedded in the spacers possibly resulted from replication slippage or transposition. Most foldings conformed to the four-domain model. Sequence comparison in combination with the secondary structures revealed six conserved motifs. Covariation analysis from the conserved motifs indicated that the secondary structure restrains the sequence evolution of the spacer. The deep-level phylogeny derived from the ITS2 data largely agreed with the phylogenetic hypotheses from morphologic and other molecular evidence. Our analyses suggest that the accordant resolutions generated from different analyses can be used to infer deep-level phylogenetic relations.  相似文献   

10.
11.
12.
13.
14.
15.
We use methods from Data Mining and Knowledge Discovery to design an algorithm for detecting motifs in protein sequences. The algorithm assumes that a motif is constituted by the presence of a "good" combination of residues in appropriate locations of the motif. The algorithm attempts to compile such good combinations into a "pattern dictionary" by processing an aligned training set of protein sequences. The dictionary is subsequently used to detect motifs in new protein sequences. Statistical significance of the detection results are ensured by statistically determining the various parameters of the algorithm. Based on this approach, we have implemented a program called GYM. The Helix-Turn-Helix motif was used as a model system on which to test our program. The program was also extended to detect Homeodomain motifs. The detection results for the two motifs compare favorably with existing programs. In addition, the GYM program provides a lot of useful information about a given protein sequence.  相似文献   

16.
An Expectation Maximization algorithm for identification of DNA binding sites is presented. The approach predicts the location of binding regions while allowing variable length spacers within the sites. In addition to predicting the most likely spacer length for a set of DNA fragments, the method identifies individual sites that differ in spacer size. No alignment of DNA sequences is necessary. The method is illustrated by application to 231 Escherichia coli DNA fragments known to contain promoters with variable spacings between their consensus regions. Maximum-likelihood tests of the differences between the spacing classes indicate that the consensus regions of the spacing classes are not distinct. Further tests suggest that several positions within the spacing region may contribute to promoter specificity.  相似文献   

17.
We aligned published sequences for the U3 region of 35 type C mammalian retroviruses. The alignment reveals that certain sequence motifs within the U3 region are strikingly conserved. A number of these motifs correspond to previously identified sites. In particular, we found that the enhancer region of most of the viruses examined contains a binding site for leukemia virus factor b, a viral corelike element, the consensus motif for nuclear factor 1, and the glucocorticoid response element. Most viruses containing more than one copy of enhancer sequences include these binding sites in both copies of the repeat. We consider this set of binding sites to constitute a framework for the enhancers of this set of viruses. Other highly conserved motifs in the U3 region include the retrovirus inverted repeat sequence, a negative regulatory element, and the CCAAT and TATA boxes. In addition, we identified two novel motifs in the promoter region that were exceptionally highly conserved but have not been previously described.  相似文献   

18.
19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号