首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
P McCaldon  P Argos 《Proteins》1988,4(2):99-122
We have examined oligopeptides with lengths ranging from 2 to 11 residues in protein sequences that show no obvious evolutionary relationship. All sequences in the Protein Identification Resource database were carefully classified by sensitive homology searches into superfamilies to obtain unbiased oligopeptide counts. The results, contrary to previous studies, show clear prejudices in protein sequences. The oligopeptide preferences were used to help decide the significance of sequence homologies and to improve the more general methods for detecting protein coding regions within nucleotide sequences.  相似文献   

3.
Summary The size distribution of 411 randomly selected mammalian exons was investigated. This distribution was found to be unimodal with a frequency maximum of 120 bp. Detailed analysis of the distribution demonstrated that larger exons (>150 bp) have a high goodness of fit to the size distribution of open reading frames (ORFs) in a random sequence, i.e., (61/64)t in which t is the number of triplets. Based on this observation, the general character of the total exon size distribution suggested that this could be defined by a theoretical distribution by superimposing a sigmoid function on the ORF generating function, i.e., (61/64)t×fs(t)×E in which fs(t) is a sigmoid function and E is a constant. We tested this distribution for fitness to the exon distribution using two sigmoid functions. fs(t)=(t) and fs(t)=Bekt/1+Bekt. In both cases a very high goodness of fit was attained. It is concluded that exons have been generated from ORFs in random sequences, that ORFs larger than 150 bp have been selected, irrespective of size, as exons, and that a lower size limit exists below which the probability of an ORF being selected as an exon is very low. These results provide evidence at the molecular level to support the ideas that (1) larger exons have been selected from random ORFs without primary correlation to structural or functional properties at the protein level, (2) there exists a restriction on smaller ORFs to be selected as exons, and (3) the interrupted coding sequences found in eukaryotes represent the ancient form of gene organization that existed prior to the divergence of prokaryotes and eukaryotes.  相似文献   

4.
An automated algorithm is presented that delineates protein sequence fragments which display similarity. The method incorporates a selection of a number of local nonoverlapping sequence alignments with the highest similarity scores and a graphtheoretical approach to elucidate the consistent start and end points of the fragments comprising one or more ensembles of related subsequences. The procedure allows the simultaneous identification of different types of repeats within one sequence. A multiple alignment of the resulting fragments is performed and a consensus sequence derived from the ensemble(s). Finally, a profile is constructed form the multiple alignment to detect possible and more distant members within the sequence. The method tolerates mutations in the repeats as well as insertions and deletions. The sequence spans between the various repeats or repeat clusters may be of different lengths. The technique has been applied to a number of proteins where the repeating fragments have been derived from information additional to the protein sequences. © 1993 Wiley-Liss, Inc.  相似文献   

5.
Alignment of protein sequences by their profiles   总被引:7,自引:0,他引:7  
The accuracy of an alignment between two protein sequences can be improved by including other detectably related sequences in the comparison. We optimize and benchmark such an approach that relies on aligning two multiple sequence alignments, each one including one of the two protein sequences. Thirteen different protocols for creating and comparing profiles corresponding to the multiple sequence alignments are implemented in the SALIGN command of MODELLER. A test set of 200 pairwise, structure-based alignments with sequence identities below 40% is used to benchmark the 13 protocols as well as a number of previously described sequence alignment methods, including heuristic pairwise sequence alignment by BLAST, pairwise sequence alignment by global dynamic programming with an affine gap penalty function by the ALIGN command of MODELLER, sequence-profile alignment by PSI-BLAST, Hidden Markov Model methods implemented in SAM and LOBSTER, pairwise sequence alignment relying on predicted local structure by SEA, and multiple sequence alignment by CLUSTALW and COMPASS. The alignment accuracies of the best new protocols were significantly better than those of the other tested methods. For example, the fraction of the correctly aligned residues relative to the structure-based alignment by the best protocol is 56%, which can be compared with the accuracies of 26%, 42%, 43%, 48%, 50%, 49%, 43%, and 43% for the other methods, respectively. The new method is currently applied to large-scale comparative protein structure modeling of all known sequences.  相似文献   

6.
Low-complexity sequences are extremely abundant in eukaryotic proteins for reasons that remain unclear. One hypothesis is that they contribute to the formation of novel coding sequences, facilitating the generation of novel protein functions. Here, we test this hypothesis by examining the content of low-complexity sequences in proteins of different age. We show that recently emerged proteins contain more low-complexity sequences than older proteins and that these sequences often form functional domains. These data are consistent with the idea that low-complexity sequences may play a key role in the emergence of novel genes.  相似文献   

7.
Signature sequences are contiguous patterns of amino acids 10-50 residues long that are associated with a particular structure or function in proteins. These may be of three types (by our nomenclature): superfamily signatures, remnant homologies, and motifs. We have performed a systematic search through a database of protein sequences to automatically and preferentially find remnant homologies and motifs. This was accomplished in three steps: 1. We generated a nonredundant sequence database. 2. We used BLAST3 (Altschul and Lipman, Proc. Natl. Acad. Sci. U.S.A. 87:5509-5513, 1990) to generate local pairwise and triplet sequence alignments for every protein in the database vs. every other. 3. We selected "interesting" alignments and grouped them into clusters. We find that most of the clusters contain segments from proteins which share a common structure or function. Many of them correspond to signatures previously noted in the literature. We discuss three previously recognized motifs in detail (FAD/NAD-binding, ATP/GTP-binding, and cytochrome b5-like domains) to demonstrate how the alignments generated by our procedure are consistent with previous work and make structural and functional sense. We also discuss two signatures (for N-acetyltransferases and glycerol-phosphate binding) which to our knowledge have not been previously recognized.  相似文献   

8.
Fliess A  Motro B  Unger R 《Proteins》2002,48(2):377-387
An important question in protein evolution is to what extent proteins may have undergone swaps (switches of domain or fragment order) during evolution. Such events might have occurred in several forms: Swaps of short fragments, swaps of structural and functional motifs, or recombination of domains in multidomain proteins. This question is important for the theoretical understanding of the evolution of proteins, and has practical implications for using swaps as a design tool in protein engineering. In order to analyze the question systematically, we conducted a large scale survey of possible swaps and permutations among all pairs of protein from the Swissport database. A swap is defined as a specific kind of sequence mutation between two proteins in which two fragments that appear in both sequences have different relative order in the two sequences. For example, aXbYc and dYeXf are defined as a swap, where X and Y represent sequence fragments that switched their order. Identifying such swaps is difficult using standard sequence comparison packages. One of the main problems in the analysis stems from the fact that many sequences contain repeats, which may be identified as false-positive swaps. We have used two different approaches to detect pairs of proteins with swaps. The first approach is based on the predefined list of domains in Pfam. We identified all the proteins that share at least two domains and analyzed their relative order, looking for pairs in which the order of these domains was switched. We designed an algorithm to distinguish between real swaps and duplications. In the second approach, we used Blast to detect pairs of proteins that share several fragments. Then, we used an automatic procedure to select pairs that are likely to contain swaps. Those pairs were analyzed visually, using a graphical tool, to eliminate duplications. Combining these approaches, about 140 different cases of swaps in the Swissprot database were found (after eliminating multiple pairs within the same family). Some of the cases have been described in the literature, but many are novel examples. Although each new example identified may be interesting to analyze, our main conclusion is that cases of swaps are rare in protein evolution. This observation is at odds with the common view that proteins are very modular to the point that modules (e.g., domains) can be shuffled between proteins with minimal constraints. Our study suggests that sequential constraints, i.e., the relative order between domains, are highly conserved.  相似文献   

9.
10.
ABSTRACT: Co-evolving positions within protein sequences have been used as spatial constraints to develop a computational approach for modeling membrane protein structures.  相似文献   

11.
High divergence in protein sequences makes the detection of distant protein relationships through homology-based approaches challenging. Grouping protein sequences into families, through similarities in either sequence or 3-D structure, facilitates in the improved recognition of protein relationships. In addition, strategically designed protein-like sequences have been shown to bridge distant structural domain families by serving as artificial linkers. In this study, we have augmented a search database of known protein domain families with such designed sequences, with the intention of providing functional clues to domain families of unknown structure. When assessed using representative query sequences from each family, we obtain a success rate of 94% in protein domain families of known structure. Further, we demonstrate that the augmented search space enabled fold recognition for 582 families with no structural information available a priori. Additionally, we were able to provide reliable functional relationships for 610 orphan families. We discuss the application of our method in predicting functional roles through select examples for DUF4922, DUF5131, and DUF5085. Our approach also detects new associations between families that were previously not known to be related, as demonstrated through new sub-groups of the RNA polymerase domain among three distinct RNA viruses. Taken together, designed sequences-augmented search databases direct the detection of meaningful relationships between distant protein families. In turn, they enable fold recognition and offer reliable pointers to potential functional sites that may be probed further through direct mutagenesis studies.  相似文献   

12.
Summary We have sequenced cDNA clones representing each of the three distinct groups of storage proteins of the cotton seed. Characteristics of their mRNAs and derived proteins are given. Dot matrix analysis of the nucleotide and amino acid sequences shows that 2 of these groups of proteins have a great deal of vestigial homology at low stringency and should be considered subfamilies of a single storage protein gene family. The remaining group is quite distinct and should be considered a separate multigene family. It also can be divived into 2 subfamilies based on the presence or absence of glycosyl residues and other sequence differences.These proteins are processed to smaller species during embryogenesis, and all of the mature storage proteins of cotton can be traced back to these 2 gene families.In view of these relationships we propose that these 2 families be called the and globulins of cotton storage proteins, each comprised of an A and B subfamily.  相似文献   

13.
How to characterize short protein sequences to make an effective connection to their functions is an unsolved problem. Here we propose to map the physicochemical properties of each amino acid onto unit spheres so that each protein sequence can be represented quantitatively. We demonstrate the usefulness of this representation by applying it to the prediction of cell penetrating peptides. We show that its combination with traditional composition features yields the best performance across different datasets, among several methods compared. For the convenience of users, a web server has been established for automatic calculations of the proposed features at http://biophy.dzu.edu.cn/SNumD/ .  相似文献   

14.
The overall function of a multi‐domain protein is determined by the functional and structural interplay of its constituent domains. Traditional sequence alignment‐based methods commonly utilize domain‐level information and provide classification only at the level of domains. Such methods are not capable of taking into account the contributions of other domains in the proteins, and domain‐linker regions and classify multi‐domain proteins. An alignment‐free protein sequence comparison tool, CLAP (CLAssification of Proteins) was previously developed in our laboratory to especially handle multi‐domain protein sequences without a requirement of defining domain boundaries and sequential order of domains. Through this method we aim to achieve a biologically meaningful classification scheme for multi‐domain protein sequences. In this article, CLAP‐based classification has been explored on 5 datasets of multi‐domain proteins and we present detailed analysis for proteins containing (1) Tyrosine phosphatase and (2) SH3 domain. At the domain‐level CLAP‐based classification scheme resulted in a clustering similar to that obtained from an alignment‐based method. CLAP‐based clusters obtained for full‐length datasets were shown to comprise of proteins with similar functions and domain architectures. Our study demonstrates that multi‐domain proteins could be classified effectively by considering full‐length sequences without a requirement of identification of domains in the sequence.  相似文献   

15.
给出了蛋白质序列的一种六维表示方法,根据这种表示方法有3种不同表示形式,利用这3种形式来构造距离矩阵的信息熵,然后通过信息熵向量的欧式距离、夹角来比较序列之间的相似性。  相似文献   

16.
The aim of this paper is to give measurements indicative of evolutional stages of the species. Two types of statistics of trinucleotides in coding regions are analysed for 27 species. The first one is the codon space, the nucleotide ratio for each of the three codon positions. We apply principal component analysis on this space and extract two principal components faithfully describing the original distribution of the codon space. The first principal component corresponds to the GC content. The second principal component classifies the species into three evolutional groups, Archaea, Bacteria and Eukaryota. The second statistics is the real and theoretical frequency of amino acids. The real frequency of an amino acid in a coding sequence is its frequency in the translated protein. The theoretical frequency is the expected frequency calculated from the ratio of nucleotides. We introduce the discrepancy between these two frequencies as an index of non-randomness of nucleotides in the sequence. This index of non-randomness divides the species into two groups: eukaryotes having smaller non-randomness (i.e. being more random) and prokaryotes having higher non-randomness.  相似文献   

17.
18.
In prokaryotic cells, 3′–5′ exonucleases can attenuate messenger RNA (mRNA) directionally from the direction of the 3′–5′ untranslated region (UTR), and thus improving the stability of mRNAs without influencing normal cell growth and metabolism is a key challenge for protein production and metabolic engineering. Herein, we significantly improved mRNA stability by using synthetic repetitive extragenic palindromic (REP) sequences as an effective mRNA stabilizer in two typical prokaryotic microbes, namely, Escherichia coli for the production of cyclodextrin glucosyltransferase (CGTase) and Corynebacterium glutamicum for the production of N-acetylglucosamine (GlcNAc). First, we performed a high-throughput screen to select 4 out of 380 REP sequences generated by randomizing 6 nonconservative bases in the REP sequence designed as the degenerate base “N.” Secondly, the REP sequence was inserted at several different positions after the stop codon of the CGTase-encoding gene. We found that mRNA stability was improved only when the space between the REP sequence and stop codon was longer than 12 base pairs (bp). Then, by reconstructing the spacer sequence and secondary structure of the REP sequence, a REP sequence with 8 bp in a stem-loop was obtained, and the CGTase activity increased from 210.6 to 291.5 U/ml. Furthermore, when this REP sequence was added to the 3′-UTR of glucosamine-6-phosphate N-acetyltransferase 1 ( GNA1), which is a gene encoding a key enzyme GNA1 in the GlcNAc synthesis pathway, the GNA1 activity was increased from 524.8 to 890.7 U/mg, and the GlcNAc titer was increased from 4.1 to 6.0 g/L in C. glutamicum. These findings suggest that the REP sequence plays an important function as an mRNA stabilizer in prokaryotic cells to stabilize its 3′-terminus of the mRNA by blocking the processing action of the 3′–5′ exonuclease. Overall, this study provides new insight for the high-efficiency overexpression of target genes and pathway fine-tuning in bacteria.  相似文献   

19.
We describe the results of a procedure for maximizing the number of sequences that can be reliably linked to a protein of known three-dimensional structure. Unlike other methods, which try to increase sensitivity through the use of fold recognition software, we only use conventional sequence alignment tools, but apply them in a manner that significantly increases the number of relationships detected. We analyzed 11 genomes and found that, depending on the genome, between 23 and 32% of the ORFs had significant matches to proteins of known structure. In all cases, the aligned region consisted of either >100 residues or >50% of the smaller sequence. Slightly higher percentages could be attained if smaller motifs were also included. This is significantly higher than most previously reported methods, even those that have a fold-recognition component. We survey the biochemical and structural characteristics of the most frequently occurring proteins, and discuss the extent to which alignment methods can realistically assign function to gene products.  相似文献   

20.
We analyzed occurrences of bases in 20,352 introns, exons of 25,574 protein-coding genes, and among the three codon positions in the protein-coding sequences. The nucleotide sequences originated from the whole spectrum of organisms from bacteria to primates. The analysis revealed the following: (1) In most exons, adenine dominates over thymine. In other words, adenine and thymine are distributed in an asymmetric way between the exon and the complementary strand, and the coding sequence is mostly located in the adenine-rich strand. (2) Thymine dominates over adenine not only in the strand complementary to the exon but also in introns. (3) A general bias is further revealed in the distribution of adenine and thymine among the three codon positions in the exons, where adenine dominates over thymine in the second and mainly the first codon position while the reverse holds in the third codon position. The product (A1/T1) × (A2/T2) × (T3/A3) is smaller than one in only a few analyzed genes. Correspondence to: J. Kypr  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号