首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Cis-acting short sequence motifs play important roles in alternative splicing. It is now possible to identify such sequence motifs as conserved sequence patterns in genome sequence alignments. Here, we report the systematic search for motifs in the neighboring introns of alternatively spliced exons by using comparative analysis of mammalian genome alignments. We identified 11 conserved sequence motifs that might be involved in the regulation of alternative splicing. These motifs are not only significantly overrepresented near alternatively spliced exons, but they also co-occur with each other, thus, forming a network of cis-elements, likely to be the basis for context-dependent regulation. Based on this finding, we applied the motif co-occurrence to predict alternatively skipped exons. We verified exon skipping in 29 cases out of 118 predictions (25%) by EST and mRNA sequences in the databases. For the predictions not verified by the database sequences, we confirmed exon skipping in 10 additional cases by using both RT–PCR experiments and the publicly available RNA-Seq data. These results indicate that even more alternative splicing events will be found with the progress of large-scale and high-throughput analyses for various tissue samples and developmental stages.  相似文献   

2.
3.
H J?rnvall 《FEBS letters》1999,456(1):85-88
Motifer is a software tool able to find directly in nucleotide databases very distant homologues to an amino acid query sequence. It focuses searches on a specific amino acid pattern, scoring the matching and intervening residues as specified by the user. The program has been developed for searching databases of expressed sequence tags (ESTs), but it is also well suited to search genomic sequences. The query sequence can be a variable pattern with alternative amino acids or gaps and the sequences searched can contain introns or sequencing errors with accompanying frame shifts. Other features include options to generate a searchable output, set the maximal sequencing error frequency, limit searches to given species, or exclude already known matches. Motifer can find sequence homologues that other search algorithms would deem unrelated or would not find because of sequencing errors or a too large number of other homologues. The ability of Motifer to find relatives to a given sequence is exemplified by searches for members of the transforming growth factor-beta family and for proteins containing a WW-domain. The functions aimed at enhancing EST searches are illustrated by the 'in silico' cloning of a novel cytochrome P450 enzyme.  相似文献   

4.
Li Z  Zhang Y 《Nucleic acids research》2005,33(7):2118-2128
The large number of currently available group I intron sequences in the public databases provides opportunity for studying this large family of structurally complex catalytic RNA by large-scale comparative sequence analysis. In this study, the detailed secondary structures of 211 group I introns in the IE subgroup were manually predicted. The secondary structure-favored alignments showed that IE introns contain 14 conserved stems. The P13 stem formed by long-range base-pairing between P2.1 and P9.1 is conserved among IE introns. Sequence variations in the conserved core divide IE introns into three distinct minor subgroups, namely IE1, IE2 and IE3. Co-variation of the peripheral structural motifs with core sequences supports that the peripheral elements function in assisting the core structure folding. Interestingly, host-specific structural motifs were found in IE2 introns inserted at S516 position. Competitive base-pairing is found to be conserved at the junctions of all long-range paired regions, suggesting a possible mechanism of establishing long-range base-pairing during large RNA folding. These findings extend our knowledge of IE introns, indicating that comparative analysis can be a very good complement for deepening our understanding of RNA structure and function in the genomic era.  相似文献   

5.
A tool for searching pattern and fingerprint databases is described.Fingerprints are groups of motifs excised from conserved regionsof sequence alignments and used for iterative database scanning.The constituent motifs are thus encoded as small alignmentsin which sequence information is maximised with each databasepass; they therefore differ from regular-expression patterns,in which alignments are reduced to single consensus sequences.Different database formats have evolved to store these disparatetypes of information, namely the PROSITE dictionary of patternsand the PRINTS fingerprint database, but programs have not beenavailable with the flexibility to search them both. We havedeveloped a facility to do this: the system allows query sequencesto be scanned against either PROSITE, the full PRINTS database,or against individual fingerprints. The results of fingerprintsearches are displayed simultaneously in both text and graphicalwindows to render them more tangible to the user. Where structuralcoordinates are available, identified motifs may be visualisedin a 3D context. The program runs on Silicon Graphics machinesusing GL graphics libraries and on machines with X servers supportingthe PEX extension: its use is illustrated here by depictingthe location of low-density lipoprotein-binding (LDL) motifsand leucine-rich repeats in a mosaic G-protein-coupled receptor(GPCR).  相似文献   

6.
In this study, we collected and analyzed DNA sequence data for 789 previously mapped RFLP probes from Sorghum bicolor (L.) Moench. DNA sequences, comprising 894 non-redundant contigs and end sequences, were searched against three GenBank databases, nucleotide (nt), protein (nr) and EST (dbEST), using BLAST algorithms. Matching ESTs were also searched against nt and nr. Translated DNA sequences were then searched against the conserved domain database (CDD) to determine if functional domains/motifs were congruent with the proteins identified in previous searches. More than half (500/894 or 56%) of the query sequences had significant matches in at least one of the GenBank searches. Overall, proteins identified for 148 sequences (17%) were consistent among all searches, of which 66 sequences (7%) contained congruent coding domains. The RFLP probe sequences were also evaluated for the presence of simple sequence repeats (SSRs) and 60 SSRs were developed and assayed in an array of sorghum germplasm comprising inbreds, landraces and wild relatives. Overall, these SSR loci had lower levels of polymorphism ( D = 0.46, averaged over 51 polymorphic loci) compared with sorghum SSRs that were isolated by library hybridization screens ( D = 0.69, averaged over 38 polymorphic loci). This result was probably due to the relatively small proportion of di-nucleotide repeat-containing markers (42% of the total SSR loci) obtained from the DNA sequence data. These di-nucleotide markers also contained shorter repeat motifs than those isolated from genomic libraries. Based on BLAST results, 24 SSRs (40%) were located within, or near, previously annotated or hypothetical genes. We determined the location of 19 of these SSRs relative to putative coding regions. In general, SSRs located in coding regions were less polymorphic ( D = 0.07, averaged over three loci) than those from gene flanking regions, UTRs and introns ( D = 0.49, averaged over 16 loci). The sequence information and SSR loci generated through this study will be valuable for application to sorghum genetics and improvement, including gene discovery, marker-assisted selection, diversity and pedigree analyses, comparative mapping and evolutionary genetic studies.  相似文献   

7.
MOTIVATION: Short linear peptide motifs mediate protein-protein interaction, cell compartment targeting and represent the sites of post-translational modification. The identification of functional motifs by conventional sequence searches, however, is hampered by the short length of the motifs resulting in a large number of hits of which only a small portion is functional. RESULTS: We have developed a procedure for the identification of functional motifs, which scores pattern conservation in homologous sequences by taking explicitly into account the sequence similarity to the query sequence. For a further improvement of this method, sequence filters have been optimized to mask those sequence regions containing little or no linear motifs. The performance of this approach was verified by measuring its ability to identify 576 experimentally validated motifs among a total of 15 563 instances in a set of 415 protein sequences. Compared to a random selection procedure, the joint application of sequence filters and the novel scoring scheme resulted in a 9-fold enrichment of validated functional motifs on the first rank. In addition, only half as many hits need to be investigated to recover 75% of the functional instances in our dataset. Therefore, this motif-scoring approach should be helpful to guide experiments because it allows focusing on those short linear peptide motifs that have a high probability to be functional.  相似文献   

8.
A new method to analyze the similarity between multiply aligned protein motifs (blocks) was developed. It identifies sets of consistently aligned blocks. These are found to be protein regions of similar function and structure that appear in different contexts. For example, the Rossmann fold ligand-binding region is found similar to TIM barrel and methylase regions, various protein families are predicted to have a TIM-barrel fold and the structural relation between the ClpP protease and crotonase folds is identified from their sequence. Besides identifying local structure features, sequence similarity across short sequence-regions (less than 20 amino acid regions) also predicts structure similarity of whole domains (folds) a few hundred amino acid residues long. Most of these relations could not be identified by other advanced sequence-to-sequence or sequence-to-multiple alignments comparisons. We describe the method (termed CYRCA), present examples of our findings, and discuss their implications.  相似文献   

9.
Genes composed of tandem repetitive sequence motifs are abundant in nature and are enriched in eukaryotes. To investigate repeat protein gene formation mechanisms, we have conducted a large-scale analysis of their introns and exons. We find that a wide variety of repeat motifs exhibit a striking conservation of intron position and phase, and are composed of exons that encode one or two complete repeats. These results suggest a simple model of repeat protein gene formation from local duplications. This model is corroborated by amino acid sequence similarity patterns among neighboring repeats from various repeat protein genes. The distribution of one- and two-repeat exons indicates that intron-facilitated repeat motif duplication, in which the start and end points of duplication are located in consecutive intronic regions, significantly exceeds intron-independent duplication. These results suggest that introns have contributed to the greater abundance of repeat protein genes in eukaryotic versus prokaryotic organisms, a conclusion that is supported by taxonomic analysis.  相似文献   

10.
11.
Many alternative splicing events are regulated by pentameric and hexameric intronic sequences that serve as binding sites for splicing regulatory factors. We hypothesized that intronic elements that regulate alternative splicing are under selective pressure for evolutionary conservation. Using a Wobble Aware Bulk Aligner genomic alignment of Caenorhabditis elegans and Caenorhabditis briggsae, we identified 147 alternatively spliced cassette exons that exhibit short regions of high nucleotide conservation in the introns flanking the alternative exon. In vivo experiments on the alternatively spliced let-2 gene confirm that these conserved regions can be important for alternative splicing regulation. Conserved intronic element sequences were collected into a dataset and the occurrence of each pentamer and hexamer motif was counted. We compared the frequency of pentamers and hexamers in the conserved intronic elements to a dataset of all C. elegans intron sequences in order to identify short intronic motifs that are more likely to be associated with alternative splicing. High-scoring motifs were examined for upstream or downstream preferences in introns surrounding alternative exons. Many of the high- scoring nematode pentamer and hexamer motifs correspond to known mammalian splicing regulatory sequences, such as (T)GCATG, indicating that the mechanism of alternative splicing regulation is well conserved in metazoans. A comparison of the analysis of the conserved intronic elements, and analysis of the entire introns flanking these same exons, reveals that focusing on intronic conservation can increase the sensitivity of detecting putative splicing regulatory motifs. This approach also identified novel sequences whose role in splicing is under investigation and has allowed us to take a step forward in defining a catalog of splicing regulatory elements for an organism. In vivo experiments confirm that one novel high-scoring sequence from our analysis, (T)CTATC, is important for alternative splicing regulation of the unc-52 gene.  相似文献   

12.

Background  

The functional annotation of proteins relies on published information concerning their close and remote homologues in sequence databases. Evidence for remote sequence similarity can be further strengthened by a similar biological background of the query sequence and identified database sequences. However, few tools exist so far, that provide a means to include functional information in sequence database searches.  相似文献   

13.
Completed genome sequences provide templates for the design of genome analysis tools in orphan species lacking sequence information. To demonstrate this principle, we designed 384 PCR primer pairs to conserved exonic regions flanking introns, using Sorghum/Pennisetum expressed sequence tag alignments to the Oryza genome. Conserved-intron scanning primers (CISPs) amplified single-copy loci at 37% to 80% success rates in taxa that sample much of the approximately 50-million years of Poaceae divergence. While the conserved nature of exons fostered cross-taxon amplification, the lesser evolutionary constraints on introns enhanced single-nucleotide polymorphism detection. For example, in eight rice (Oryza sativa) genotypes, polymorphism averaged 12.1 per kb in introns but only 3.6 per kb in exons. Curiously, among 124 CISPs evaluated across Oryza, Sorghum, Pennisetum, Cynodon, Eragrostis, Zea, Triticum, and Hordeum, 23 (18.5%) seemed to be subject to rigid intron size constraints that were independent of per-nucleotide DNA sequence variation. Furthermore, we identified 487 conserved-noncoding sequence motifs in 129 CISP loci. A large CISP set (6,062 primer pairs, amplifying introns from 1,676 genes) designed using an automated pipeline showed generally higher abundance in recombinogenic than in nonrecombinogenic regions of the rice genome, thus providing relatively even distribution along genetic maps. CISPs are an effective means to explore poorly characterized genomes for both DNA polymorphism and noncoding sequence conservation on a genome-wide or candidate gene basis, and also provide anchor points for comparative genomics across a diverse range of species.  相似文献   

14.
Basic local alignment search tool   总被引:1594,自引:0,他引:1594  
A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.  相似文献   

15.
FELINES (Finding and Examining Lots of Intron 'N' Exon Sequences) is a utility written to automate construction and analysis of high quality intron and exon sequence databases produced from EST (expressed sequence tag) to genomic sequence alignments. We demonstrated the various programs of the FELINES utility by creating intron and exon sequence databases for the fungal organism Schizosaccharomyces pombe from alignments of EST to genomic sequences. In addition, we analyzed our constructed S.pombe sequence databases and the well-established Saccharomyces cerevisiae intron database from Manuel Ares' Laboratory for conserved sequence motifs. FELINES was shown to be useful for characterizing branchsites, polypyrimidine tracts and 5' and 3' splice sites in the intron databases and exonic splicing enhancers (ESEs) in S.pombe exons. FELINES is available at http://www.genome.ou.edu/informatics.html.  相似文献   

16.
Complete structure of the chicken alpha 2(VI) collagen gene   总被引:4,自引:0,他引:4  
Type VI collagen is a hybrid molecule consisting of a short triple helix flanked by two large globular domains. These globular domains are composed of several homologous repeats which show a striking similarity to the collagen-binding motifs found in von Willebrand factor. The alpha 2(VI) subunit contains three of these homologous repeats termed D1, D2 and D3. We have isolated and characterized the entire gene for chicken alpha 2(VI) collagen. This gene, which is present as a single copy in the chicken genome, is 26 kbp long and comprises 28 exons. All exons can be classified in three groups. (a) The triple-helical domain is encoded by 19 short exons (27-90 bp) separated by introns of phase class 0. These exons are multiples of 9 bp and encode an integral number of collagenous Gly-Xaa-Yaa triplets. (b) The homologous repeats D1-D3 are encoded by one or two very long exons each (153-1578 bp). These exons are separated by introns of phase class 1. (c) The homologous repeats and the collagen sequence are linked to each other by three short adapter segments which are each encoded by a single exon (21-46 bp). The modular nature of the polypeptide is thus clearly reflected by the mosaic structure of its gene. The size of the exons and the phase class of the introns suggest that the alpha 2(VI) gene evolved by duplication and shuffling of two different primordial exons, one of 9 bp encoding a collagen Gly-Xaa-Yaa triplet and one of 600 bp encoding the precursor of the homologous repeats.  相似文献   

17.
Landscape similarity search involves finding landscapes from among a large collection that are similar to a query landscape. An example of such collection is a large land cover map subdivided into a grid of smaller local landscapes, a query is a local landscape of interest, and the task is to find other local landscapes within a map which are perceptually similar to the query. Landscape search and the related task of pattern-based regionalization, requires a measure of similarity – a function which quantifies the level of likeness between two landscapes. The standard approach is to use the Euclidean distance between vectors of landscape metrics derived from the two landscapes, but no in-depth analysis of this approach has been conducted. In this paper we investigate the performance of different implementations of the standard similarity measure. Five different implementations are tested against each other and against a control similarity measure based on histograms of class co-occurrence features and the Jensen–Shannon divergence. Testing consists of a series of numerical experiments combined with visual assessments on a set of 400 3 km-scale landscapes. Based on the cases where visual assessment provides definitive answer, we have determined that the standard similarity measure is sensitive to the way landscape metrics are normalized and, additionally, to whether weights aimed at controlling the relative contribution of landscape composition vs. configuration are used. The standard measure achieves the best performance when metrics are normalized using their extreme values extracted from all possible landscapes, not just the landscapes in the given collection, and when weights are assigned so the combined influence of composition metrics on the similarity value equals the combined influence of configuration metrics. We have also determined that the control similarity measure outperforms all implementations of the standard measure.  相似文献   

18.
Protein sequence similarity searches using patterns as seeds.   总被引:18,自引:1,他引:17       下载免费PDF全文
Protein families often are characterized by conserved sequence patterns or motifs. A researcher frequently wishes to evaluate the significance of a specific pattern within a protein, or to exploit knowledge of known motifs to aid the recognition of greatly diverged but homologous family members. To assist in these efforts, the pattern-hit initiated BLAST (PHI-BLAST) program described here takes as input both a protein sequence and a pattern of interest that it contains. PHI-BLAST searches a protein database for other instances of the input pattern, and uses those found as seeds for the construction of local alignments to the query sequence. The random distribution of PHI-BLAST alignment scores is studied analytically and empirically. In many instances, the program is able to detect statistically significant similarity between homologous proteins that are not recognizably related using traditional single-pass database search methods. PHI-BLAST is applied to the analysis of CED4-like cell death regulators, HS90-type ATPase domains, archaeal tRNA nucleotidyltransferases and archaeal homologs of DnaG-type DNA primases.  相似文献   

19.
We describe a new strategy for utilizing multiple sequence alignment information to detect distant relationships in searches of sequence databases. A single sequence representing a protein family is enriched by replacing conserved regions with position-specific scoring matrices (PSSMs) or consensus residues derived from multiple alignments of family members. In comprehensive tests of these and other family representations, PSSM-embedded queries produced the best results overall when used with a special version of the Smith-Waterman searching algorithm. Moreover, embedding consensus residues instead of PSSMs improved performance with readily available single sequence query searching programs, such as BLAST and FASTA. Embedding PSSMs or consensus residues into a representative sequence improves searching performance by extracting multiple alignment information from motif regions while retaining single sequence information where alignment is uncertain.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号