首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
MOTIVATION: Cis-acting regulatory elements are frequently constrained by both sequence content and positioning relative to a functional site, such as a splice or polyadenylation site. We describe an approach to regulatory motif analysis based on non-negative matrix factorization (NMF). Whereas existing pattern recognition algorithms commonly focus primarily on sequence content, our method simultaneously characterizes both positioning and sequence content of putative motifs. RESULTS: Tests on artificially generated sequences show that NMF can faithfully reproduce both positioning and content of test motifs. We show how the variation of the residual sum of squares can be used to give a robust estimate of the number of motifs or patterns in a sequence set. Our analysis distinguishes multiple motifs with significant overlap in sequence content and/or positioning. Finally, we demonstrate the use of the NMF approach through characterization of biologically interesting datasets. Specifically, an analysis of mRNA 3'-processing (cleavage and polyadenylation) sites from a broad range of higher eukaryotes reveals a conserved core pattern of three elements.  相似文献   

2.
3.
4.
Many methods have been described to predict the subcellular location of proteins from sequence information. However, most of these methods either rely on global sequence properties or use a set of known protein targeting motifs to predict protein localization. Here, we develop and test a novel method that identifies potential targeting motifs using a discriminative approach based on hidden Markov models (discriminative HMMs). These models search for motifs that are present in a compartment but absent in other, nearby, compartments by utilizing an hierarchical structure that mimics the protein sorting mechanism. We show that both discriminative motif finding and the hierarchical structure improve localization prediction on a benchmark data set of yeast proteins. The motifs identified can be mapped to known targeting motifs and they are more conserved than the average protein sequence. Using our motif-based predictions, we can identify potential annotation errors in public databases for the location of some of the proteins. A software implementation and the data set described in this paper are available from http://murphylab.web.cmu.edu/software/2009_TCBB_motif/.  相似文献   

5.
Short motifs are known to play diverse roles in proteins, such as in mediating the interactions with other molecules, binding to membranes, or conducting a specific biological function. Standard approaches currently employed to detect short motifs in proteins search for enrichment of amino acid motifs considering mostly the sequence information. Here, we presented a new approach to search for common motifs (protein signatures) which share both physicochemical and structural properties, looking simultaneously at different features. Our method takes as an input an amino acid sequence and translates it to a new alphabet that reflects its intrinsic structural and chemical properties. Using the MEME search algorithm, we identified the proteins signatures within subsets of protein which encompass common sequence and structural information. We demonstrated that we can detect enriched structural motifs, such as the amphipathic helix, from large datasets of linear sequences, as well as predicting common structural properties (such as disorder, surface accessibility, or secondary structures) of known functional‐motifs. Finally, we applied the method to the yeast protein interactome and identified novel putative interacting motifs. We propose that our approach can be applied for de novo protein function prediction given either sequence or structural information. Proteins 2013; © 2012 Wiley Periodicals, Inc.  相似文献   

6.
7.
8.
The amino acid sequences of proteins determine their three-dimensional structures and functions. However, how sequence information is related to structures and functions is still enigmatic. In this study, we show that at least a part of the sequence information can be extracted by treating amino acid sequences of proteins as a collection of English words, based on a working hypothesis that amino acid sequences of proteins are composed of short constituent amino acid sequences (SCSs) or “words”. We first confirmed that the English language highly likely follows Zipf''s law, a special case of power law. We found that the rank-frequency plot of SCSs in proteins exhibits a similar distribution when low-rank tails are excluded. In comparison with natural English and “compressed” English without spaces between words, amino acid sequences of proteins show larger linear ranges and smaller exponents with heavier low-rank tails, demonstrating that the SCS distribution in proteins is largely scale-free. A distribution pattern of SCSs in proteins is similar among species, but species-specific features are also present. Based on the availability scores of SCSs, we found that sequence motifs are enriched in high-availability sites (i.e., “key words”) and vice versa. In fact, the highest availability peak within a given protein sequence often directly corresponds to a sequence motif. The amino acid composition of high-availability sites within motifs is different from that of entire motifs and all protein sequences, suggesting the possible functional importance of specific SCSs and their compositional amino acids within motifs. We anticipate that our availability-based word decoding approach is complementary to sequence alignment approaches in predicting functionally important sites of unknown proteins from their amino acid sequences.  相似文献   

9.
10.
11.
12.
Rapid DNA mapping by fluorescent single molecule detection   总被引:2,自引:2,他引:0       下载免费PDF全文
DNA mapping is an important analytical tool in genomic sequencing, medical diagnostics and pathogen identification. Here we report an optical DNA mapping strategy based on direct imaging of individual DNA molecules and localization of multiple sequence motifs on the molecules. Individual genomic DNA molecules were labeled with fluorescent dyes at specific sequence motifs by the action of nicking endonuclease followed by the incorporation of dye terminators with DNA polymerase. The labeled DNA molecules were then stretched into linear form on a modified glass surface and imaged using total internal reflection fluorescence (TIRF) microscopy. By determining the positions of the fluorescent labels with respect to the DNA backbone, the distribution of the sequence motif recognized by the nicking endonuclease can be established with good accuracy, in a manner similar to reading a barcode. With this approach, we constructed a specific sequence motif map of lambda-DNA. We further demonstrated the capability of this approach to rapidly type a human adenovirus and several strains of human rhinovirus.  相似文献   

13.
Palindromic Units (PU or REP) were defined as DNA sequences of 40 nucleotides highly repeated on the genome of Escherichia coli and other Enterobacteriaceae. PU are found in clusters of up to six occurrences always localized in extragenic regions. By sorting the DNA sequences of the known PU containing regions into different classes, we show here for the first time that, besides the PU themselves, each PU clusters contains a number of other conserved sequence motifs. Seven such motifs were identified with the present list of PU regions. Remarkably, each PU cluster is exclusively composed of a mosaic combination of PU and of these other sequence motifs. We demonstrate directly by hybridization experiments that one of these motifs (called L) is indeed present at a large number of copies on the Escherichia coli chromosome and that its distribution follows the same species specificity as PU sequences themselves. We propose that the mosaic pattern of motif combination in PU clusters reveals a new type of bacterial genetic element which we propose to call BIME for Bacterial Interspersed Mosaic Element. The Escherichia coli genome contains about 500 BIME.  相似文献   

14.
15.
Assessing the exceptionality of network motifs.   总被引:1,自引:0,他引:1  
Getting and analyzing biological interaction networks is at the core of systems biology. To help understanding these complex networks, many recent works have suggested to focus on motifs which occur more frequently than expected in random. To identify such exceptional motifs in a given network, we propose a statistical and analytical method which does not require any simulation. For this, we first provide an analytical expression of the mean and variance of the count under any exchangeable random graph model. Then we approximate the motif count distribution by a compound Poisson distribution whose parameters are derived from the mean and variance of the count. Thanks to simulations, we show that the compound Poisson approximation outperforms the Gaussian approximation. The compound Poisson distribution can then be used to get an approximate p-value and to decide if an observed count is significantly high or not. Our methodology is applied on protein-protein interaction (PPI) networks, and statistical issues related to exceptional motif detection are discussed.  相似文献   

16.
Although in vitro selection technology is a versatile experimental tool for discovering novel synthetic RNA molecules, finding complex RNA molecules is difficult because most RNAs identified from random sequence pools are simple motifs, consistent with recent computational analysis of such sequence pools. Thus, enriching in vitro selection pools with complex structures could increase the probability of discovering novel RNAs. Here we develop an approach for engineering sequence pools that links RNA sequence space regions with corresponding structural distributions via a "mixing matrix" approach combined with a graph theory analysis. We define five classes of mixing matrices motivated by covariance mutations in RNA; these constructs define nucleotide transition rates and are applied to chosen starting sequences to yield specific nonrandom pools. We examine the coverage of sequence space as a function of the mixing matrix and starting sequence via clustering analysis. We show that, in contrast to random sequences, which are associated only with a local region of sequence space, our designed pools, including a structured pool for GTP aptamers, can target specific motifs. It follows that experimental synthesis of designed pools can benefit from using optimized starting sequences, mixing matrices, and pool fractions associated with each of our constructed pools as a guide. Automation of our approach could provide practical tools for pool design applications for in vitro selection of RNAs and related problems.  相似文献   

17.
18.
When investigators undertake searches of DNA databases, they normally discard large numbers of alignments that demonstrate very weak resemblances to each other, retaining only those that show statistically significant levels of resemblance. We show here that a great deal of information can be extracted from these weak alignments by examining them en masse. This is done by building three-dimensional similarity landscapes from the alignments, landscapes that reveal whether an unusual number of individually nonsignificant alignments tend to match up to a particular region of the query sequence being searched. The power of the search is increased by the use of libraries consisting entirely of introns or of exons. We show that (1) similarity landscapes with a variety of features can be generated from both intron and exon libraries, using introns or exons as query sequences; (2) the landscape features are real and not a statistical artifact; (3) well-known protein motifs used as query sequences can generate various landscape features; and (4) there is some evidence for resemblances between short regions of sequence carried by introns and exons. One possible interpretation of these results is that both introns and exons may have been built up during their evolution from short regions of sequence that as a result are now widely distributed throughout eukaryotic genomes. Such an interpretation would imply that these short regions have common ancestry. Alternatively, the wide sharing of short pieces of DNA may reflect regions with particular structural properties that have arisen through convergent evolution. The similarity-landscape approach can be used to detect such widespread structural motifs and sequence motifs in the genome that might be missed by less-global searches. It can also be used in conjunction with algorithms developed for detecting significant multiple alignments by isolating promising subsets of the databases that can be examined in more detail.Correspondence to: C. Wills  相似文献   

19.
Multipartite viruses contain more than one distinctive genome component, and the origin of multipartite viruses has been suggested to evolve from a non-segmented wild-type virus. To explore whether recombination also plays a role in the evolution of the genomes of multipartite viruses, we developed a systematic approach that employs motif-finding tools to detect conserved motifs from divergent genomic regions and applies statistical approaches to select high-confidence motifs. The information that this approach provides helps us understand the evolution of viruses. In this study, we compared our motif-based strategy with current alignment-based recombination-detecting methods and applied our methods to the analysis of multipartite single-stranded plant DNA viruses, including bipartite begomoviruses, Banana bunchy top virus (BBTV) (consisting of 6 genome components) and Faba bean necrotic yellows virus (FBNYV) (consisting of 8 genome components). Our analysis revealed that recombination occurred between genome components in some begomoviruses, BBTV and FBNYV. Our data also show that several unusual recombination events have contributed to the evolution of BBTV genome components. We believe that similar approaches can be applied to resolve the evolutionary history of other viruses.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号