首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 312 毫秒
1.

Background  

Automatic extraction of motifs from biological sequences is an important research problem in study of molecular biology. For proteins, it is desired to discover sequence motifs containing a large number of wildcard symbols, as the residues associated with functional sites are usually largely separated in sequences. Discovering such patterns is time-consuming because abundant combinations exist when long gaps (a gap consists of one or more successive wildcards) are considered. Mining algorithms often employ constraints to narrow down the search space in order to increase efficiency. However, improper constraint models might degrade the sensitivity and specificity of the motifs discovered by computational methods. We previously proposed a new constraint model to handle large wildcard regions for discovering functional motifs of proteins. The patterns that satisfy the proposed constraint model are called W-patterns. A W-pattern is a structured motif that groups motif symbols into pattern blocks interleaved with large irregular gaps. Considering large gaps reflects the fact that functional residues are not always from a single region of protein sequences, and restricting motif symbols into clusters corresponds to the observation that short motifs are frequently present within protein families. To efficiently discover W-patterns for large-scale sequence annotation and function prediction, this paper first formally introduces the problem to solve and proposes an algorithm named WildSpan (sequential pattern mining across large wildcard regions) that incorporates several pruning strategies to largely reduce the mining cost.  相似文献   

2.
We determined the complete sequences of six size variants of intergenic spacer (IGS) region from one individual of the malaria vector mosquito species, Anopheles sinensis. All six size variants observed in this study show almost the same basic primary structure in which three repeat regions (A, B, and C) are interspersed by highly conserved nonrepeating sections. In contrast to the well-ordered subrepeating patterns found in A and C, the repeat region B displays extremely variable and complicated profiles in the number and arrangement of subrepeat units among different size classes. It is apparent that the prominent level of length difference in the repeat regions B and C is responsible for the intragenomic length variations of the IGS molecule observed in the present study. High level of sequence homology and regularly arranged repeating pattern of 11 to 14 bp motif sequences harbored within the B repeat region allow us to consider that these motif sequences may be associated with their potential role as a recombination site. Compared to those previously published in other mosquito species, the IGS of A. sinensis showed a very unique structural format in subrepeat patterns of the IGS region. This result suggests that the structure and sequence profiles of the IGS region would provide useful information for the exploitation of a convenient molecular marker to identify morphologically complicated species complex and to characterize the genetic variation of population. This suggestion is far from being conclusive at present, but a further genetic study will bring more compelling evidences for this pending issue.  相似文献   

3.
Microsatellite clustering may account for genetic maps which do not coalesce into the expected number of linkage groups. Microsatellite organization within the large genome of Pinus taeda (1C = 20,000 Mb) was determined by (1) testing whether repeat motifs were sequestered within the low-copy DNA kinetic component and (2) testing for repeat motif clusters within DNA fragments regardless of copy number. Within the low-copy kinetic component, either (AC)n or (AG)n repeat units were present in 32% of sequences. No repeat motifs were found in the total genome control. Clustered repeat motifs were frequent; the (ATG)n triplet repeat motif was located upstream from a CG-rich trinucleotide microsatellite in 26 out of 44 microsatellite sequences. Fourteen of the clustered (ATG)n sequences could be assembled into four microsatellite sequence families based on similarities in the flanking regions. Consistent with the DNA turnover model, family members shared similar flanking regions but differed in repeat motif composition and length.  相似文献   

4.
Liu F  Baggerman G  Schoofs L  Wets G 《Peptides》2006,27(12):3137-3153
Bioactive (neuro)peptides play critical roles in regulating most biological processes in animals. Peptides belonging to the same family are characterized by a typical sequence pattern that is conserved among the family's peptide members. Such a conserved pattern or motif usually corresponds to the functionally important part of the biologically active peptide. In this paper, all known bioactive (neuro)peptides annotated in Swiss-Prot and TrEMBL protein databases are collected, and the pattern searching program Pratt is used to search these unaligned peptide sequences for conserved patterns. The obtained patterns are then refined by combining the information on amino acids at important functional sites collected from the literature. All the identified patterns are further tested by scanning them against Swiss-Prot and TrEMBL protein databases. The diagnostic power of each pattern is validated by the fact that any annotated protein from Swiss-Prot and TrEMBL that contains one of the established patterns, is indeed a known (neuro)peptide precursor. We discovered 155 novel peptide patterns in addition to the 56 established ones in the PROSITE database. All the patterns cover 110 peptide families. Fifty-five of these families are not characterized by the PROSITE signatures, and 12 are also not identified by other existing motif databases, such as Pfam and SMART. Using the newly identified peptide signatures as a search tool, we predicted 95 hypothetical proteins as putative peptide precursors.  相似文献   

5.
Systematic and fully automated identification of protein sequence patterns.   总被引:4,自引:0,他引:4  
We present an efficient algorithm to systematically and automatically identify patterns in protein sequence families. The procedure is based on the Splash deterministic pattern discovery algorithm and on a framework to assess the statistical significance of patterns. We demonstrate its application to the fully automated discovery of patterns in 974 PROSITE families (the complete subset of PROSITE families which are defined by patterns and contain DR records). Splash generates patterns with better specificity and undiminished sensitivity, or vice versa, in 28% of the families; identical statistics were obtained in 48% of the families, worse statistics in 15%, and mixed behavior in the remaining 9%. In about 75% of the cases, Splash patterns identify sequence sites that overlap more than 50% with the corresponding PROSITE pattern. The procedure is sufficiently rapid to enable its use for daily curation of existing motif and profile databases. Third, our results show that the statistical significance of discovered patterns correlates well with their biological significance. The trypsin subfamily of serine proteases is used to illustrate this method's ability to exhaustively discover all motifs in a family that are statistically and biologically significant. Finally, we discuss applications of sequence patterns to multiple sequence alignment and the training of more sensitive score-based motif models, akin to the procedure used by PSI-BLAST. All results are available at httpl//www.research.ibm.com/spat/.  相似文献   

6.
Kaur T  Ong AH 《Biochemical genetics》2011,49(9-10):562-575
This study describes the organization of the repetitive pattern in the mtDNA control region of Tomistoma schlegelii. Using newly designed primers, we detected length variations of approximately 50-100?bp among individuals, and only one individual showed a heteroplasmic band. Sequencing the region after CSB III revealed two main patterns: a repeat motif and a variable number tandem repeat (VNTR) pattern. The VNTR region, with a core unit of 104?bp, consisting of four motifs and a short AT chain, is implicated in the length variation seen among individuals of Tomistoma. A conserved motif seen in a family unit indicated that the repeat pattern was stably inherited from the maternal parent to all offspring. A combination of VNTR patterns specific to different crocodilians was seen in Tomistoma, and the overall secondary structure was shown to be similar to that in Crocodylus and Gavialis.  相似文献   

7.
8.
The sequences of four-alpha-helical bundle proteins are characterized by a pattern of hydrophilic and hydrophobic amino acids which is repeated every seven residues. At each position of the heptad repeat there are specific constraints on the amino acid properties which result from the topology of the tertiary motif. These constraints give rise to patterns of amino acid distribution which are distinct from those of other proteins. The distributions in each of the heptad positions have been determined by a statistical analysis of structural and sequence data derived from seven families of aligned protein sequences. The constitution of each position is dominated by a very small number of different amino acids, with the core positions consisting overwhelmingly of Leu and Ala. The positional preferences of the individual amino acids can be generally interpreted in terms of residue properties and topological constraints. The potential for four-alpha-helix bundle folding is reflected primarily in the pattern of residue occurrence in the heptad and not in the overall amino acid composition of the protein. Possible applications of this analysis in structure predictions, sequence alignments and in the rational design and engineering of four-alpha-helical bundle proteins are discussed.  相似文献   

9.
Massively parallel sequencing(MPS) technology is capable of determining the sizes of short tandem repeat(STR) alleles as well as their individual nucleotide sequences. Thus, single nucleotide polymorphisms(SNPs) within the repeat regions of STRs and variations in the pattern of repeat units in a given repeat motif can be used to differentiate alleles of the same length. In this study, MPS was used to sequence 28 forensically-relevant Y-chromosome STRs in a set of 41 DNA samples from the 3 major U.S. population groups(African Americans, Caucasians, and Hispanics).The resulting sequence data, which were analyzed with STRait Razor v2.0, revealed 37 unique allele sequence variants that have not been previously reported. Of these, 19 sequences were variations of documented sequences resulting from the presence of intra-repeat SNPs or alternative repeat unit patterns. Despite a limited sampling, two of the most frequently-observed variants were found only in African American samples. The remaining 18 variants represented allele sequences for which there were no published data with which to compare. These findings illustrate the great potential of MPS with regard to increasing the resolving power of STR typing and emphasize the need for sample population characterization of STR alleles.  相似文献   

10.
Specific interactions of transmembrane helices play a pivotal role in the folding and oligomerization of integral membrane proteins. The helix-helix interfaces frequently depend on specific amino acid patterns. In this study, a heptad repeat pattern was randomized with all naturally occurring amino acids to uncover novel sequence motifs promoting transmembrane domain interactions. Self-interacting transmembrane domains were selected from the resulting combinatorial library by means of the ToxR/POSSYCCAT system. A comparison of the amino acid composition of high-and low-affinity sequences revealed that high-affinity transmembrane domains exhibit position-specific enrichment of histidine. Further, sequences containing His preferentially display Gly, Ser, and/or Thr residues at flanking positions and frequently contain a C-terminal GxxxG motif. Mutational analysis of selected sequences confirmed the importance of these residues in homotypic interaction. Probing heterotypic interaction indicated that His interacts in trans with hydroxylated residues. Reconstruction of minimal interaction motifs within the context of an oligo-Leu sequence confirmed that His is part of a hydrogen bonded cluster that is brought into register by the GxxxG motif. Notably, a similar motif contributes to self-interaction of the BNIP3 transmembrane domain.  相似文献   

11.
Myotonic dystrophy (DM), the most common form of muscular dystrophy in adults, can be caused by a mutation on either chromosome 19 (DM1) or 3 (DM2). In 2001, we demonstrated that DM2 is caused by a CCTG expansion in intron 1 of the zinc finger protein 9 (ZNF9) gene. To investigate the ancestral origins of the DM2 expansion, we compared haplotypes for 71 families with genetically confirmed DM2, using 19 short tandem repeat markers that we developed that flank the repeat tract. All of the families are white, with the majority of Northern European/German descent and a single family from Afghanistan. Several conserved haplotypes spanning >700 kb appear to converge into a single haplotype near the repeat tract. The common interval that is shared by all families with DM2 immediately flanks the repeat, extending up to 216 kb telomeric and 119 kb centromeric of the CCTG expansion. The DM2 repeat tract contains the complex repeat motif (TG)(n)(TCTG)(n)(CCTG)(n). The CCTG portion of the repeat tract is interrupted on normal alleles, but, as in other expansion disorders, these interruptions are lost on affected alleles. We examined haplotypes of 228 control chromosomes and identified a potential premutation allele with an uninterrupted (CCTG)(20) on a haplotype that was identical to the most common affected haplotype. Our data suggest that the predominant Northern European ancestry of families with DM2 resulted from a common founder and that the loss of interruptions within the CCTG portion of the repeat tract may predispose alleles to further expansion. To gain insight into possible function of the repeat tract, we looked for evolutionary conservation. The complex repeat motif and flanking sequences within intron 1 are conserved among human, chimpanzee, gorilla, mouse, and rat, suggesting a conserved biological function.  相似文献   

12.
13.
Subtle motifs: defining the limits of motif finding algorithms   总被引:4,自引:0,他引:4  
MOTIVATION: What constitutes a subtle motif? Intuitively, it is a motif that is almost indistinguishable, in the statistical sense, from random motifs. This question has important practical consequences: consider, for example, a biologist that is generating a sample of upstream regulatory sequences with the goal of finding a regulatory pattern that is shared by these sequences. If the sequences are too short then one risks losing some of the regulatory patterns that are located further upstream. Conversely, if the sequences are too long, the motif becomes too subtle and one is then likely to encounter random motifs which are at least as significant statistically as the regulatory pattern itself. In practical terms one would like to recognize the sequence length threshold, or the twilight zone, beyond which the motifs are in some sense too subtle. RESULTS: The paper defines the motif twilight zone where every motif finding algorithm would be exposed to random motifs which are as significant as the one which is sought. We also propose an objective tool for evaluating the performance of subtle motif finding algorithms. Finally we apply these tools to evaluate the success of our MULTIPROFILER algorithm to detect subtle motifs.  相似文献   

14.
Protein motif extraction with neuro-fuzzy optimization   总被引:2,自引:0,他引:2  
MOTIVATION: It is attempted to improve the speed and flexibility of protein motif identification. The proposed algorithm is able to extract both rigid and flexible protein motifs. RESULTS: In this work, we present a new algorithm for extracting the consensus pattern, or motif, from a group of related protein sequences. This algorithm involves a statistical method to find short patterns with high frequency and then neural network training to optimize the final classification accuracies. Fuzzy logic is used to increase the flexibility of protein motifs. C2H2 Zinc Finger Protein and epidermal growth factor protein sequences are used to demonstrate the capability of the proposed algorithm in finding motifs. AVAILABILITY: This program is freely available for academic use by request.  相似文献   

15.
Tandem repeats occur frequently in biological sequences. They are important for studying genome evolution and human disease. A number of methods have been designed to detect a single tandem repeat in a sliding window. In this article, we focus on the case that an unknown number of tandem repeat segments of the same pattern are dispersively distributed in a sequence. We construct a probabilistic generative model for the tandem repeats, where the sequence pattern is represented by a motif matrix. A Bayesian approach is adopted to compute this model. Markov chain Monte Carlo (MCMC) algorithms are used to explore the posterior distribution as an effort to infer both the motif matrix of tandem repeats and the location of repeat segments. Reversible jump Markov chain Monte Carlo (RJMCMC) algorithms are used to address the transdimensional model selection problem raised by the variable number of repeat segments. Experiments on both synthetic data and real data show that this new approach is powerful in detecting dispersed short tandem repeats. As far as we know, it is the first work to adopt RJMCMC algorithms in the detection of tandem repeats.  相似文献   

16.
17.
18.
Exact Tandem Repeats Analyzer 1.0 (E-TRA) combines sequence motif searches with keywords such as ‘organs’, ‘tissues’, ‘cell lines’ and ‘development stages’ for finding simple exact tandem repeats as well as non-simple repeats. E-TRA has several advanced repeat search parameters/options compared to other repeat finder programs as it not only accepts GenBank, FASTA and expressed sequence tags (EST) sequence files, but also does analysis of multiple files with multiple sequences. The minimum and maximum tandem repeat motif lengths that E-TRA finds vary from one to one thousand. Advanced user defined parameters/options let the researchers use different minimum motif repeats search criteria for varying motif lengths simultaneously. One of the most interesting features of genomes is the presence of relatively short tandem repeats (TRs). These repeated DNA sequences are found in both prokaryotes and eukaryotes, distributed almost at random throughout the genome. Some of the tandem repeats play important roles in the regulation of gene expression whereas others do not have any known biological function as yet. Nevertheless, they have proven to be very beneficial in DNA profiling and genetic linkage analysis studies. To demonstrate the use of E-TRA, we used 5,465,605 human EST sequences derived from 18,814,550 GenBank EST sequences. Our results indicated that 12.44% (679,800) of the human EST sequences contained simple and non-simple repeat string patterns varying from one to 126 nucleotides in length. The results also revealed that human organs, tissues, cell lines and different developmental stages differed in number of repeats as well as repeat composition, indicating that the distribution of expressed tandem repeats among tissues or organs are not random, thus differing from the un-transcribed repeats found in genomes.  相似文献   

19.
The recently identified Nimrod superfamily is characterized by the presence of a special type of EGF repeat, the NIM repeat, located right after a typical CCXGY/W amino acid motif. On the basis of structural features, nimrod genes can be divided into three types. The proteins encoded by Draper-type genes have an EMI domain at the N-terminal part and only one copy of the NIM motif, followed by a variable number of EGF-like repeats. The products of Nimrod B-type and Nimrod C-type genes (including the eater gene) have different kinds of N-terminal domains, and lack EGF-like repeats but contain a variable number of NIM repeats. Draper and Nimrod C-type (but not Nimrod B-type) proteins carry a transmembrane domain. Several members of the superfamily were claimed to function as receptors in phagocytosis and/or binding of bacteria, which indicates an important role in the cellular immunity and the elimination of apoptotic cells. In this paper, the evolution of the Nimrod superfamily is studied with various methods on the level of genes and repeats. A hypothesis is presented in which the NIM repeat, along with the EMI domain, emerged by structural reorganizations at the end of an EGF-like repeat chain, suggesting a mechanism for the formation of novel types of repeats. The analyses revealed diverse evolutionary patterns in the sequences containing multiple NIM repeats. Although in the Nimrod B and Nimrod C proteins show characteristics of independent evolution, many internal NIM repeats in Eater sequences seem to have undergone concerted evolution. An analysis of the nimrod genes has been performed using phylogenetic and other methods and an evolutionary scenario of the origin and diversification of the Nimrod superfamily is proposed. Our study presents an intriguing example how the evolution of multigene families may contribute to the complexity of the innate immune response.  相似文献   

20.
桉树EST序列中微卫星含量及相关特征   总被引:6,自引:0,他引:6  
通过对桉树属(Eucalyptus)的10 000条EST序列进行分析, 在其中的1 499条序列上共发现1 775个微卫星重复序列。含有微卫星的EST序列约占序列总数的15%。此外, 还发现桉树EST序列所含微卫星长度的变异速率与重复单元长度呈负相关; 微卫星的丰度与重复单元长度也呈负相关(三碱基重复微卫星除外)。在桉树EST序列中, 重复单元长度为三碱基的微卫星最为丰富。三碱基重复单元微卫星的过度富集可能是由于遗传密码选择所致。在微卫星的丰度及长度变异方面, 桉树EST序列与杨树(Populus trichocarpa)基因组注释的转录序列随重复单元长度的变化呈现出相同的规律, 但桉树EST序列中微卫星频率及三碱基重复微卫星的含量显著偏低, 推测含微卫星的基因表达丰度极有可能低于不含微卫星的基因。通过对发现的所有微卫星位点进行引物设计, 并对设计的引物进行PCR检测, 结果表明所设计的引物具有极高的扩增成功率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号