首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We introduce a novel approach for the detection of possible mutations leading to a reading frame (RF) shift in a gene. Deletions and insertions of DNA coding regions are considerable events for genes because an RF shift results in modifications of the extensive region of amino acid sequence coded by a gene. The suggested method is based on the phenomenon of triplet periodicity (TP) in coding regions of genes and its relative resistance to substitutions in DNA sequence. We attempted to extend 326 933 regions of continuous TP found in genes from the KEGG databank by considering possible insertions and deletions. We revealed totally 824 genes where such extension was possible and statistically significant. Then we generated amino acid sequences according to active (KEGG''s) and hypothetically ancient RFs in order to find confirmation of a shift at a protein level. Consequently, 64 sequences have protein similarities only for ancient RF, 176 only for active RF, 3 for both and 581 have no protein similarity at all. We aimed to have revealed lower bound for the number of genes in which a shift between RF and TP is possible. Further ways to increase the number of revealed RF shifts are discussed.  相似文献   

2.
As next-generation sequencing projects generate massive genome-wide sequence variation data, bioinformatics tools are being developed to provide computational predictions on the functional effects of sequence variations and narrow down the search of casual variants for disease phenotypes. Different classes of sequence variations at the nucleotide level are involved in human diseases, including substitutions, insertions, deletions, frameshifts, and non-sense mutations. Frameshifts and non-sense mutations are likely to cause a negative effect on protein function. Existing prediction tools primarily focus on studying the deleterious effects of single amino acid substitutions through examining amino acid conservation at the position of interest among related sequences, an approach that is not directly applicable to insertions or deletions. Here, we introduce a versatile alignment-based score as a new metric to predict the damaging effects of variations not limited to single amino acid substitutions but also in-frame insertions, deletions, and multiple amino acid substitutions. This alignment-based score measures the change in sequence similarity of a query sequence to a protein sequence homolog before and after the introduction of an amino acid variation to the query sequence. Our results showed that the scoring scheme performs well in separating disease-associated variants (n = 21,662) from common polymorphisms (n = 37,022) for UniProt human protein variations, and also in separating deleterious variants (n = 15,179) from neutral variants (n = 17,891) for UniProt non-human protein variations. In our approach, the area under the receiver operating characteristic curve (AUC) for the human and non-human protein variation datasets is ∼0.85. We also observed that the alignment-based score correlates with the deleteriousness of a sequence variation. In summary, we have developed a new algorithm, PROVEAN (Protein Variation Effect Analyzer), which provides a generalized approach to predict the functional effects of protein sequence variations including single or multiple amino acid substitutions, and in-frame insertions and deletions. The PROVEAN tool is available online at http://provean.jcvi.org.  相似文献   

3.
We have isolated almost full-length cDNA clones corresponding to human erythrocyte membrane sialoglycoproteins alpha (glycophorin A) and delta (glycophorin B). The predicted amino acid sequence of delta differs at two amino acid residues from the sequence determined by peptide sequencing. The sialoglycoprotein delta clone we have isolated contains an interrupting sequence within the region that gives rise to the cleaved N-terminal leader sequence for the protein and represents a product that is unlikely to be inserted into the erythrocyte membrane. Comparison of the cDNA sequences of alpha and delta shows very strong homology at the DNA level within the coding regions. The two mRNA sequences are closely related and differ by a number of clearly defined insertions and deletions.  相似文献   

4.
5.
A mathematical method has been developed in order to search for latent periodicity in protein amino-acid and other symbolical sequences using dynamic programming and random matrices. The method allows the detection of the latent periodicity with insertions and deletions at positions that are unknown beforehand. The developed method has been applied to search for the periodicity in the amino-acid sequences of several proteins and in the euro/dollar exchange rate since 2001. The presence of a long period with insertions and deletions in amino-acid sequences is shown. The period length of seven amino acids is observed in the proteins that contain supercoiled regions (a coiled-coil structure) as well as of six, five, or more amino acids. The existence of the period length of 6 and 7 days, as well as 24 and 25 h in the analyzed financial time series is observed; note that this periodicity is detectable only for insertions and deletions. The causes that underlie the occurrence of the latent periodicity with insertions and deletions in amino-acid sequences and financial time series are discussed.  相似文献   

6.
Typically, protein spatial structures are more conserved in evolution than amino acid sequences. However, the recent explosion of sequence and structure information accompanied by the development of powerful computational methods led to the accumulation of examples of homologous proteins with globally distinct structures. Significant sequence conservation, local structural resemblance, and functional similarity strongly indicate evolutionary relationships between these proteins despite pronounced structural differences at the fold level. Several mechanisms such as insertions/deletions/substitutions, circular permutations, and rearrangements in beta-sheet topologies account for the majority of detected structural irregularities. The existence of evolutionarily related proteins that possess different folds brings new challenges to the homology modeling techniques and the structure classification strategies and offers new opportunities for protein design in experimental studies.  相似文献   

7.
8.
Aspergillus niger produces several polygalacturonases that, with other enzymes, are involved in the degradation of pectin. One of the two previously characterized genes coding for the abundant polygalacturonases I and II (PGI and PGII) found in a commercial pectinase preparation was used as a probe to isolate five more genes by screening a genomic DNA library in phage lambda EMBL4 using conditions of moderate stringency. The products of these genes were detected in the culture medium of Aspergillus nidulans transformants on the basis of activity measurements and Western-blot analysis using a polyclonal antibody raised against PGI. These transformants were, with one exception, constructed using phage DNA. A. nidulans transformants secreted high amounts of PGI and PGII in comparison to the previously characterized A. niger transformants and a novel polygalacturonase (PGC) was produced at high levels by A. nidulans transformed with the subcloned pgaC gene. This gene was sequenced and the protein-coding region was found to be interrupted by three introns; the different intron/exon organization of the three sequenced A. niger polygalacturonase genes can be explained by the gain or loss of two single introns. The pgaC gene encodes a putative 383-amino-acid prepro-protein that is cleaved after a pair of basic amino acids and shows approximately 60% amino acid sequence similarity to the other polygalacturonases in the mature protein. The N-terminal amino acid sequences of the A. niger polygalacturonases display characteristic amino acid insertions or deletions that are also observed in polygalacturonases of phytopathogenic fungi. In the upstream regions of the A. niger polygalacturonase genes, a sequence of ten conserved nucleotides comprising a CCAAT sequence was found, which is likely to represent a binding site for a regulatory protein as it shows a high similarity to the yeast CYC1 upstream activation site recognized by the HAP2/3/4 activation complex.  相似文献   

9.
Latent amino acid repeats seem to be widespread in genetic sequences and to reflect their structure, function, and evolution. We have recently identified latent periodicity in more than 150 protein families including protein kinases and various nucleotide-binding proteins. The latent repeats in these families were correlated to their structure and evolution. However, a majority of known protein families were not identified with our latent periodicity search algorithm. The main presumable reason for this was the inability of our techniques to identify periodicities interspersed with insertions and deletions. We designed the new latent periodicity search algorithm, which is capable of taking into account insertions and deletions. As a result, we identified many novel cases of latent periodicity peculiar to protein families. Possible origins of the periodic structure of these families are discussed. Summarizing, we presume that latent periodicity is present in a substantial portion of known protein families. The latent periodicity matrices and the results of Swiss-Prot scans are available from http://bioinf.narod.ru/del/.  相似文献   

10.
It is becoming increasingly apparent from complete genome sequences that 16S rRNA data, as currently interpreted, does not provide an unambiguous picture of bacterial phylogeny. In contrast, we have found that analysis of insertions and deletions in the amino acid sequences of cytochrome c2 has some advantages in establishing relationships and that this approach may have broad utility in acquiring a better understanding of bacterial relationships. The amino acid sequences of cytochromes c2 and c556 have been determined in whole or in part from four strains of Rhodobacter sulfidophilus. The cytochrome c2 contains three- and eight-residue insertions as well as a single-residue deletion in common with the large cytochromes c2 but in contrast to the small cytochromes c2 and mitochondrial cytochromes. In addition, the Rb. sulfidophilus protein shares a rare six- to seven-residue insertion with other Rhodobacter cytochromes c2. The cytochrome c556 is a low-spin class II cytochrome c homologous to the greater family of cytochromes c', which are usually high-spin. The similarity of cytochrome c556 to other species of class II cytochromes is consistent with the relationships deduced from comparisons of cytochromes c2. Thus, our results do not support placement of Rb. sulfidophilus in a separate genus, Rhodovulum, which was proposed primarily on the basis of 16S rRNA sequences. Instead, the Rhodobacter cytochromes c2 are distinct from those of other genera and species of purple bacteria and show a different pattern of relationships among species than reported for 16S rRNA.  相似文献   

11.
Indels in the coding regions of a gene can either cause frameshifts or amino acid insertions/deletions. Frameshifting indels are indels that have a length that is not divisible by 3 and subsequently cause frameshifts. Indels that have a length divisible by 3 cause amino acid insertions/deletions or block substitutions; we call these 3n indels. The new amino acid changes resulting from 3n indels could potentially affect protein function. Therefore, we construct a SIFT Indel prediction algorithm for 3n indels which achieves 82% accuracy, 81% sensitivity, 82% specificity, 82% precision, 0.63 MCC, and 0.87 AUC by 10-fold cross-validation. We have previously published a prediction algorithm for frameshifting indels. The rules for the prediction of 3n indels are different from the rules for the prediction of frameshifting indels and reflect the biological differences of these two different types of variations. SIFT Indel was applied to human 3n indels from the 1000 Genomes Project and the Exome Sequencing Project. We found that common variants are less likely to be deleterious than rare variants. The SIFT indel prediction algorithm for 3n indels is available at http://sift-dna.org/  相似文献   

12.
Insertions and deletions are responsible for gaps in aligned nucleotide sequences, but they have been usually ignored when the number of nucleotide substitutions was estimated. We compared six sets of nuclear and mitochondrial noncoding DNA sequences of primates and obtained the estimates of the evolutionary rate of insertion and deletion. The maximum-parsimony principle was applied to locate insertions and deletions on a given phylogenetic tree. Deletions were about twice as frequent as insertions for nuclear DNA, and single-nucleotide insertions and deletions were the most frequent in all events. The rate of insertion and deletion was found to be rather constant among branches of the phylogenetic tree, and the rate (approximately 2.0/kb/Myr) for mitochondrial DNA was found to be much higher than that (approximately 0.2/kb/Myr) for nuclear DNA. The rates of nucleotide substitution were about 10 times higher than the rate of insertion and deletion for both nuclear and mitochondrial DNA.   相似文献   

13.
Crossassociation is a computer method of comparing protein sequences. It can help detect amino acid matches, deletions, insertions, and other similarities which would be hard to detect by eye. The method is to slide the sequences past each other one step at a time and to count the number of amino acids that match. At each overlap position, the program prints the percentage match and statistical significance measures of the matching. The null hypothesis for significance is the random arrangement of amino acids in the proportions found in the sequences under study. For most protein pairs, the expected proportion of matches is about 1/14. The method includes computation of three overall similarity measures between sequences which should have use in both evolutionary and taxonomic studies. The use of the method has been tested with actual and hypothetical sequences. Problems of recovering evolutionary relationships by this and related methods are discussed.  相似文献   

14.
15.
The target junction sequences of six independent Tn5 insertions into a 36-bp tandemly repeated DNA segment have been determined. In all instances Tn5 preferentially inserts near one end of the tandem repeat, but in four out of six cases the insertion is between different nucleotides. The target sequence shares some similarity (8 out of 11 bp) with the ends of Tn5. All six insertions are accompanied by duplication of 9 bp of target DNA. The data imply that, even though Tn5 appears to insert randomly on a macro scale, at the nucleotide sequence level insertion into target DNA, which has limited similarity to the Tn5 end reactive sequences, may be a preferred event.  相似文献   

16.
A new measure of subalignment similarity is introduced. Specifically, similaritys(l,c) is defined as the logarithm to the basep of the probability of findingc or fewer mismatches in a subalignment of lengthl, wherep is the probability of a match. Previous algorithms can not use this measure to find locally optimal subalignments because, unlike Needleman-Wunsch and Sellers similarities, this measure is nonlinear. A new pattern recognition algorithm is described for finding all locally optimal subalignments of two nucleotide sequences. The DD algorithm can uses(l, c) or any other reasonable similarity function to assess the relative interest of subalignments. The DD algorithm searches only the diagonal graph, which lacks insertions and deletions. This search strategy greatly decreases the computation time and does not require an arbitrary choice of gap cost. The paths of the resulting DD graph usually draw attention to likely locations for insertions and deletions. A heuristic formula is derived for estimating significance levels fors(l, c) in the context of the lengths of the two aligned sequences. The DD algorithm has been used to find interesting subalignments between the nucleotide sequences for human and murine interleukin 2.  相似文献   

17.
The information decomposition (ID) method has been used for searching dinucleotide periodicities, including latent ones, in plant genomes. In nucleotide sequences of genomes of various plants from the Gen-Bank database, 14 766 sequences with a periodicity of two nucleotides have been found at a high level of statistical significance. Classification of the periodicity matrices of the detected DNA sequences has yielded 141 classes of dinucleotide periodicity. Since ID does not detect periodicities with nucleotide deletions or insertions, modified profile analysis (MPA) has been applied to the obtained classes to reveal DNA sequences with dinucleotide periodicities containing nucleotide deletions and insertions. Combined use of ID and MPA has permitted the detection of 80 396 DNA sequences with dinucleotide periodicities in the genomes of various plants. The biological role of dinucleotide periodicity in the detected sequences is discussed.  相似文献   

18.
Insertions and deletions of nucleotides in the genes encoding the variable domains of antibodies are natural components of the hypermutation process, which may expand the available repertoire of hypervariable loop lengths and conformations. Although insertion of amino acids has also been utilized in antibody engineering, little is known about the functional consequences of such modifications. To investigate this further, we have introduced single-codon insertions and deletions as well as more complex modifications in the complementarity-determining regions of human antibody fragments with different specificities. Our results demonstrate that single amino acid insertions and deletions are generally well tolerated and permit production of stably folded proteins, often with retained antigen recognition, despite the fact that the thus modified loops carry amino acids that are disallowed at key residue positions in canonical loops of the corresponding length or are of a length not associated with a known canonical structure. We have thus shown that single-codon insertions and deletions can efficiently be utilized to expand structure and sequence space of the antigen-binding site beyond what is encoded by the germline gene repertoire.  相似文献   

19.
The concept of the phase shift of triplet periodicity (TP) was used for searching potential DNA insertions in genes from 17 bacterial genomes. A mathematical algorithm for detection of these insertions has been developed. This approach can detect potential insertions and deletions with lengths that are not multiples of three bases, especially insertions of relatively large DNA fragments (>100 bases). New similarity measure between triplet matrixes was employed to improve the sensitivity for detecting the TP phase shift. Sequences of 17,220 bacterial genes with each consisting of more than 1,200 bases were analyzed, and the presence of a TP phase shift has been shown in ~16% of analysed genes (2,809 genes), which is about 4 times more than that detected in our previous work. We propose that shifts of the TP phase may indicate the shifts of reading frame in genes after insertions of the DNA fragments with lengths that are not multiples of three bases. A relationship between the phase shifts of TP and the frame shifts in genes is discussed.  相似文献   

20.
Aita T  Husimi Y  Nishigaki K 《Bio Systems》2011,106(2-3):67-75
To measure the similarity or dissimilarity between two given biological sequences, several papers proposed metrics based on the "word-composition vector". The essence of these metrics is as follows. First, we count the appearance frequencies of all the K-tuple words throughout each of two given sequences. Then, the two given sequences are transformed into their respective word-composition vectors. Next, the distance metrics, for example the angle between the two vectors, are calculated. A significant issue is to determine the optimal word size K. With a mathematical model of mutational events (including substitutions, insertions, deletions and duplications) that occur in sequences, we analyzed how the angle between the composition vectors depends on the mutational events. We also considered the optimal word size (=resolution) from our original approach. Our results were verified by computational experiments using artificially generated sequences, amino acid sequences of hemoglobin and nucleotide sequences of 16S ribosomal RNA.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号