首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: Pair-wise alignment of protein sequences and local similarity searches produce many false positives because of compositionally biased regions, also called low-complexity regions (LCRs), of amino acid residues. Masking and filtering such regions significantly improves the reliability of homology searches and, consequently, functional predictions. Most of the available algorithms are based on a statistical approach. We wished to investigate the structural properties of LCRs in biological sequences and develop an algorithm for filtering them. RESULTS: We present an algorithm for detecting and masking LCRs in protein sequences to improve the quality of database searches. We developed the algorithm based on the complexity analysis of subsequences delimited by a pair of identical, repeating subsequences. Given a protein sequence, the algorithm first computes the suffix tree of the sequence. It then collects repeating subsequences from the tree. Finally, the algorithm iteratively tests whether each subsequence delimited by a pair of repeating subsequences meets a given criteria. Test results with 1000 proteins from 20 families in Pfam show that the repeating subsequences are a good indicator for the low-complexity regions, and the algorithm based on such structural information strongly compete with others. AVAILABILITY: http://bioinfo.knu.ac.kr/research/CARD/ CONTACT: swshin@bioinfo.knu.ac.kr  相似文献   

2.
3.
The evolutionary expansion of CAG repeats in human triplet expansion disease genes is intriguing because of their deleterious phenotype. In the past, this expansion has been suggested to reflect a broad genomewide expansion of repeats, which would imply that mutational and evolutionary processes acting on repeats differ between species. Here, we tested this hypothesis by analyzing repeat- and flanking-sequence evolution in 28 repeat-containing genes that had been sequenced in humans and mice and by considering overall lengths and distributions of CAG repeats in the two species. We found no evidence that these repeats were longer in humans than in mice. We also found no evidence for preferential accumulation of CAG repeats in the human genome relative to mice from an analysis of the lengths of repeats identified in sequence databases. We then investigated whether sequence properties, such as base and amino acid composition and base substitution rates, showed any relationship to repeat evolution. We found that repeat-containing genes were enriched in certain amino acids, presumably as the result of selection, but that this did not reflect underlying biases in base composition. We also found that regions near repeats showed higher nonsynonymous substitution rates than the remainder of the gene and lower nonsynonymous rates in genes that contained a repeat in both the human and the mouse. Higher rates of nonsynonymous mutation in the neighborhood of repeats presumably reflect weaker purifying selection acting in these regions of the proteins, while the very low rate of nonsynonymous mutation in proteins containing a CAG repeat in both species presumably reflects a high level of purifying selection. Based on these observations, we propose that the mutational processes giving rise to polyglutamine repeats in human and murine proteins do not differ. Instead, we propose that the evolution of polyglutamine repeats in proteins results from an interplay between mutational processes and selection.  相似文献   

4.
Low copy repeats (LCRs) are stretches of duplicated DNA that are more than 1 kb in size and share a sequence similarity that exceeds 90%. Non-allelic homologous recombination (NAHR) between highly similar LCRs has been implicated in numerous genomic disorders. This study aimed at defining the impact of LCRs on the generation of balanced and unbalanced chromosomal rearrangements in mentally retarded patients. A cohort of 22 patients, preselected for the presence of submicroscopic imbalances, was analysed using submegabase resolution tiling path array CGH and the results were compared with a set of 41 patients with balanced translocations and breakpoints that were mapped to the BAC level by FISH. Our data indicate an accumulation of LCRs at breakpoints of both balanced and unbalanced rearrangements. LCRs with high sequence similarity in both breakpoint regions, suggesting NAHR as the most likely cause of rearrangement, were observed in 6/22 patients with chromosomal imbalances, but not in any of the balanced translocation cases studied. In case of chromosomal imbalances, the likelihood of NAHR seems to be inversely related to the size of the aberration. Our data also suggest the presence of additional mechanisms coinciding with or dependent on the presence of LCRs that may induce an increased instability at these chromosomal sites.  相似文献   

5.
运用生物信息学方法将家蚕浓核病毒中国(镇江)株的结构蛋白与其它类型家蚕浓核病毒的结构蛋白在理化特性、结构、功能等方面进行了比较分析。结果表明:家蚕浓核病毒结构蛋白是一类稳定的亲水性蛋白,BmDNV-ZJ与BmDNV-2的结构蛋白性质可能比较类似,而BmDNV-ZJ和BmDNV-1结构蛋白序列的理化性参数、序列内部重复片断以及折叠区域差异较大,表明这两种浓核病毒结构蛋白在性状、结构、功能上有较大差异。而BmDNV-ZJ和BmDNV-1结构蛋白序列中有3个不同的LCR功能区域。分子进化聚类分析可得到四大类浓核病毒结构蛋白。  相似文献   

6.
Staphylocoagulase detection is the hallmark of a Staphylococcus aureus infection. Ten different serotypes of staphylocoagulases have been reported to date. We determined the nucleotide sequences of seven staphylocoagulase genes (coa) and their surrounding regions to compare structures of all 10 staphylocoagulase serotypes, and we inferred their derivations. We found that all staphylocoagulases are comprised of six regions: signal sequence, D1 region, D2 region, central region, repeat region, and C-terminal sequence. Amino acids at both ends, 33 amino acids in the N terminal (the signal sequences and the seven N-terminal amino acids in the D1 region) and 5 amino acids in the C terminal, were exactly identical among the 10 serotypes. The central regions were conserved with identities between 80.6 and 94.1% and similarities between 82.8 and 94.6%. Repeat regions comprising tandem repeats of 27 amino acids with a 92% identity on average were polymorphic in the number of repeats. On the other hand, D1 regions other than the seven N-terminal amino acids and D2 regions were less homologous, with diverged identities from 41.5 to 84.5% and 47.0 to 88.9%, respectively, and similarities from 53.5 to 88.7% and 56.8 to 91.9%, respectively, although the predicted prothrombin-binding sites were conserved among them. In contrast, flanking regions of coa were highly homologous, with nucleotide identities of more than 97.1%. Phylogenetic relations among coa did not correlate with those among the flanking regions or housekeeping genes used for multilocus sequence typing. These data indicate that coa could be transmitted to S. aureus, while the less homologous regions in coa presumed to be responsible for different antigenicities might have evolved independently.  相似文献   

7.
MOTIVATION: Low-complexity or cryptically simple sequences are widespread in protein sequences but their evolution and function are poorly understood. To date methods for the detection of low complexity in proteins have been directed towards the filtering of such regions prior to sequence homology searches but not to the analysis of the regions per se. However, many of these regions are encoded by non-repetitive DNA sequences and may therefore result from selection acting on protein structure and/or function. RESULTS: We have developed a new tool, based on the SIMPLE algorithm, that facilitates the quantification of the amount of simple sequence in proteins and determines the type of short motifs that show clustering above a certain threshold. By modifying the sensitivity of the program simple sequence content can be studied at various levels, from highly organised tandem structures to complex combinations of repeats. We compare the relative amount of simplicity in different functional groups of yeast proteins and determine the level of clustering of the different amino acids in these proteins. AVAILABILITY: The program is available on request or online at http://www.biochem.ucl.ac.uk/bsm/SIMPLE.  相似文献   

8.

Background  

Blocks of duplicated genomic DNA sequence longer than 1000 base pairs are known as low copy repeats (LCRs). Identified by their sequence similarity, LCRs are abundant in the human genome, and are interesting because they may represent recent adaptive events, or potential future adaptive opportunities within the human lineage. Sequence analysis tools are needed, however, to decide whether these interpretations are likely, whether a particular set of LCRs represents nearly neutral drift creating junk DNA, or whether the appearance of LCRs reflects assembly error. Here we investigate an LCR family containing the sulfotransferase (SULT) 1A genes involved in drug metabolism, cancer, hormone regulation, and neurotransmitter biology as a first step for defining the problems that those tools must manage.  相似文献   

9.
An abundant class of secreted salivary polypeptides is characterized by the presence of identical and contiguous repeats of amino acid sequences within the polypeptide chains, and includes the proline-rich proteins. We discovered a new family of contiguous repeat polypeptides (CRPs) that is related to the proline-rich proteins but contains little proline. Analysis of salivary mRNAs and liver DNA by molecular cloning, DNA sequence determinations, and Northern and Southern blot hybridization revealed several closely related CRP mRNAs and at least 10 CRP-related genes. We further analyzed two CRP mRNAs of 850 and 920 nucleotides and the gene encoding the larger CRP mRNA. The two mRNAs contain the same 69-base repeats in their coding regions and are identical in their 5'- and 3'-untranslated tracts. However, they differ in the number of contiguous repeats (four versus five) and a segment at the 3' end of the coding region which encodes closely related but unique COOH termini of the CRPs. These structural features suggest a recent gene conversion. The CRP gene analyzed is divided into three exons that encode (i) 5'-untranslated tract and signal sequence, (ii) secreted polypeptide, and (iii) 3'-untranslated tract, respectively. CRP mRNA contains two open reading frames. The longer open reading frame encodes a CRP precursor with a signal sequence of 17 amino acids, four to five contiguous repeats of 23 amino acids, and a variable COOH region that begins with two segments related to the contiguous repeats. Immunochemical analysis of salivary gland slices with antisera raised against peptides corresponding to two regions of the larger open reading frame revealed intense staining only of the serous cells of the submandibular glands. 35S-Labeled oligonucleotides complementary to CRP mRNA specifically hybridized to the same cells.  相似文献   

10.
Tandem mass spectrometry fragments a large number of molecules of the same peptide sequence into charged molecules of prefix and suffix peptide subsequences and then measures mass/charge ratios of these ions. The de novo peptide sequencing problem is to reconstruct the peptide sequence from a given tandem mass spectral data of k ions. By implicitly transforming the spectral data into an NC-spectrum graph G (V, E) where /V/ = 2k + 2, we can solve this problem in O(/V//E/) time and O(/V/2) space using dynamic programming. For an ideal noise-free spectrum with only b- and y-ions, we improve the algorithm to O(/V/ + /E/) time and O(/V/) space. Our approach can be further used to discover a modified amino acid in O(/V//E/) time. The algorithms have been implemented and tested on experimental data.  相似文献   

11.
We present a graph-based method for the analysis of repeat families in a repeat library. We build a repeat domain graph that decomposes a repeat library into repeat domains, short subsequences shared by multiple repeat families, and reveals the mosaic structure of repeat families. Our method recovers documented mosaic repeat structures and suggests additional putative ones. Our method is useful for elucidating the evolutionary history of repeats and annotating de novo generated repeat libraries.  相似文献   

12.
Characterizing enzyme sequences and identifying their active sites is a very important task. The current experimental methods are too expensive and labor intensive to handle the rapidly accumulating protein sequences and structure data. Thus accurate, high-throughput in silico methods for identifying catalytic residues and enzyme function prediction are much needed. In this paper, we propose a novel sequence-based catalytic domain prediction method using a sequence clustering and an information-theoretic approaches. The first step is to perform the sequence clustering analysis of enzyme sequences from the same functional category (those with the same EC label). The clustering analysis is used to handle the problem of widely varying sequence similarity levels in enzyme sequences. The clustering analysis constructs a sequence graph where nodes are enzyme sequences and edges are a pair of sequences with a certain degree of sequence similarity, and uses graph properties, such as biconnected components and articulation points, to generate sequence segments common to the enzyme sequences. Then amino acid subsequences in the common shared regions are aligned and then an information theoretic approach called aggregated column related scoring scheme is performed to highlight potential active sites in enzyme sequences. The aggregated information content scoring scheme is shown to be effective to highlight residues of active sites effectively. The proposed method of combining the clustering and the aggregated information content scoring methods was successful in highlighting known catalytic sites in enzymes of Escherichia coli K12 in terms of the Catalytic Site Atlas database. Our method is shown to be not only accurate in predicting potential active sites in the enzyme sequences but also computationally efficient since the clustering approach utilizes two graph properties that can be computed in linear to the number of edges in the sequence graph and computation of mutual information does not require much time. We believe that the proposed method can be useful for identifying active sites of enzyme sequences from many genome projects.  相似文献   

13.
On the complexity measures of genetic sequences   总被引:7,自引:0,他引:7  
MOTIVATION: It is well known that the regulatory regions of genomes are highly repetitive. They are rich in direct, symmetric and complemented repeats, and there is no doubt about the functional significance of these repeats. Among known measures of complexity, the Ziv-Lempel complexity measure reflects most adequately repeats occurring in the text. But this measure does not take into account isomorphic repeats. By isomorphic repeats we mean fragments that are identical (or symmetric) modulo some permutation of the alphabet letters. RESULTS: In this paper, two complexity measures of symbolic sequences are proposed that generalize the Ziv-Lempel complexity measure by taking into account any isomorphic repeats in the text (rather than just direct repeats as in Ziv-Lempel). The first of them, the complexity vector, is designed for small alphabets such as the alphabet of nucleotides. The second is based on a search for the longest isomorphic fragment in the history of sequence synthesis and can be used for alphabets of arbitrary cardinality. These measures have been used for recognition of structural regularities in DNA sequences. Some interesting structures related to the regulatory region of the human growth hormone are reported.  相似文献   

14.
The amino acid sequences of chick and slime mould alpha-actinin each contain four repeats of approximately 122 residues. These repeats are homologous to the 18-22 repeats, each of approximately 106 residues, found in the alpha and beta subunits of spectrin and fodrin, and to the multiple repeats of approximately 110 residues found in the Duchenne muscular dystrophy protein (dystrophin). The repeats correspond to the elongated rod-like portion of these molecules. We present a multiple sequence alignment of 21 repeats from this superfamily (8 alpha-actinin and 13 spectrin/fodrin), based on optimal pairwise alignments, from which a characteristic consensus pattern of amino acid types is deduced. Trp 46 is invariant in all but one repeat, and physicochemical classes of amino acids are conserved at 25 other positions. Secondary structure prediction on both the alpha-actinin and spectrin repeats taken together with the distribution of proline residues in the sequences, strongly suggest that each repeated domain consists of a four-helix structure. Our predictions differ significantly from previous three-helix models based on analyses of fewer sequences. To determine possible interdomain regions, sites of limited proteolysis of the native chick alpha-actinin dimer were determined and located in the amino acid sequence. The majority of these sites were in corresponding positions in different repeats within a segment predicted as a long helix. We propose a model, consistent with the overall dimensions of the rod-like portions of the molecules, in which these long, probably interrupted helices, link adjacent domains.  相似文献   

15.
Protein interaction networks display approximate scale-free topology, in which hub proteins that interact with a large number of other proteins determine the overall organization of the network. In this study, we aim to determine whether hubs are distinguishable from other networked proteins by specific sequence features. Proteins of different connectednesses were compared in the interaction networks of Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, and Homo sapienswith respect to the distribution of predicted structural disorder, sequence repeats, low complexity regions, and chain length. Highly connected proteins ("hub proteins") contained significantly more of, and greater proportion of, these sequence features and tended to be longer overall as compared to less connected proteins. These sequence features provide two different functional means for realizing multiple interactions: (1) extended interaction surface and (2) flexibility and adaptability, providing a mechanism for the same region to bind distinct partners. Our view contradicts the prevailing view that scaling in protein interactomes arose from gene duplication and preferential attachment of equivalent proteins. We propose an alternative evolutionary network specialization process, in which certain components of the protein interactome improved their fitness for binding by becoming longer or accruing regions of disorder and/or internal repeats and have therefore become specialized in network organization.  相似文献   

16.
17.
Structural predictions for the central domain of dystrophin   总被引:10,自引:0,他引:10  
The amino acid sequence of dystrophin indicates that the molecule has globular N- and C-terminal domains separated by a long central rod domain. The central rod contains multiple repeats, about 100 amino acids long and of variable length. These diverge sufficiently in sequence that, in previous studies, only 14 of the most similar repeats have been aligned and analysed in any detail. We show here that a heptad pattern of hydrophobic residues is preserved across all repeats. Using the heptad pattern together with a consensus sequence template, we identified and aligned 25 repeats in the dystrophin rod sequence. Each repeat consists of a constant-length core helix of 54 residues, coupled via a short linker to a weakly conserved variable-length helix, and then via a second linker to the next core. The variable-length helix appears truncated in repeats 10 and 13 and extended in repeats 4 and 20. The extension of repeat 20 is particularly interesting since it corresponds to a hotspot of dystrophy-inducing mutations. Detailed modelling suggests that the classical Speicher-Marchesi [(1984) Nature 311, 177-180] model for spectrin may not be appropriate to dystrophin without some modification. We propose that whilst the repeating structural motif in dystrophin is probably a bead of triple coiled coil, this bead is twice as massive as, and out of phase with, those proposed for spectrin. Our model raises the possibility that the rod domain of dystrophin may confer elasticity on the molecule. Deletions which truncate this region would then reduce the extensibility of the molecule without affecting actin crosslinking, consistent with their typically producing the relatively benign Becker phenotype of muscular dystrophy.  相似文献   

18.
MOTIVATION: Repeats are ubiquitous in genomes and play important roles in evolution. Transposable elements are a common kind of repeat. Transposon insertions can be nested and make the task of identifying repeats difficult. RESULTS: We develop a novel iterative algorithm, called Greedier, to find repeats in a target genome given a repeat library. Greedier distinguishes itself from existing methods by taking into account the fragmentation of repeats. Each iteration consists of two passes. In the first pass, it identifies the local similarities between the repeat library and the target genome. Greedier then builds graphs from this comparison output. In each graph, a vertex denotes a similar subsequence pair. Edges denote pairs of subsequences that can be connected to form higher similarities. In the second pass, Greedier traverses these graphs greedily to find matches to individual repeat units in the repeat library. It computes a fitness value for each such match denoting the similarity of that match. Matches with fitness values greater than a cutoff are removed, and the rest of the genome is stitched together. The similarity cutoff is then gradually reduced, and the iteration is repeated until no hits are returned from the comparison. Our experiments on the Arabidopsis and rice genomes show that Greedier identifies approximately twice as many transposon bases as those found by cross_match and WindowMasker. Moreover, Greedier masks far fewer false positive bases than either cross_match or WindowMasker. In addition to masking repeats, Greedier also reports potential nested transposon structures.  相似文献   

19.
Genome architecture catalyzes nonrecurrent chromosomal rearrangements   总被引:18,自引:0,他引:18  
To investigate the potential involvement of genome architecture in nonrecurrent chromosome rearrangements, we analyzed the breakpoints of eight translocations and 18 unusual-sized deletions involving human proximal 17p. Surprisingly, we found that many deletion breakpoints occurred in low-copy repeats (LCRs); 13 were associated with novel large LCR17p structures, and 2 mapped within an LCR sequence (middle SMS-REP) within the Smith-Magenis syndrome (SMS) common deletion. Three translocation breakpoints involving 17p11 were found to be located within the centromeric alpha-satellite sequence D17Z1, three within a pericentromeric segment, and one at the distal SMS-REP. Remarkably, our analysis reveals that LCRs constitute >23% of the analyzed genome sequence in proximal 17p--an experimental observation two- to fourfold higher than predictions based on virtual analysis of the genome. Our data demonstrate that higher-order genomic architecture involving LCRs plays a significant role not only in recurrent chromosome rearrangements but also in translocations and unusual-sized deletions involving 17p.  相似文献   

20.
Genome sequencing revealed an extreme AT-rich genome and a profusion of asparagine repeats associated with low complexity regions (LCRs) in proteins of the malarial parasite Plasmodium falciparum. Despite their abundance, the function of these LCRs remains unclear. Because they occur in almost all families of plasmodial proteins, the occurrence of LCRs cannot be associated with any specific metabolic pathway; yet their accumulation must have given selective advantages to the parasite. Translation of these asparagine-rich LCRs demands extraordinarily high amounts of asparaginylated tRNAAsn. However, unlike other organisms, Plasmodium codon bias is not correlated to tRNA gene copy number. Here, we studied tRNAAsn accumulation as well as the catalytic capacities of the asparaginyl-tRNA synthetase of the parasite in vitro. We observed that asparaginylation in this parasite can be considered standard, which is expected to limit the availability of asparaginylated tRNAAsn in the cell and, in turn, slow down the ribosomal translation rate when decoding asparagine repeats. This observation strengthens our earlier hypothesis considering that asparagine rich sequences act as “tRNA sponges” and help cotranslational folding of parasite proteins. However, it also raises many questions about the mechanistic aspects of the synthesis of asparagine repeats and about their implications in the global control of protein expression throughout Plasmodium life cycle.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号