首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
A suite of tests to evaluate the statistical significance of protein sequence similarities is developed for use in data bank searches. The tests are based on the Wilbur-Lipman word-search algorithm, and take into account the sequence lengths and compositions, and optionally the weighting of amino acid matches. The method is extended to allow for the existence of a sequence insertion/deletion within the region of similarity. The accuracy of statistical distributions underlying the tests is validated using randomly generated sequences and real sequences selected at random from the data banks. A computer program to perform the tests is briefly described.  相似文献   

2.
A simple procedure is described for finding similarities between proteins using nucleotide sequence databases. The approach is illustrated by several examples of previously unknown correspondences with important biological implications: Drosophila elongation factor Tu is shown to be encoded by two genes that are differently expressed during development; a cluster of three Drosophila genes likely encode maltases; a flesh-fly fat body protein resembles the hypothesized Drosophila alcohol dehydrogenase ancestral protein; an unknown protein encoded at the multifunctional E. coli hisT locus resembles aspartate beta-semialdehyde dehydrogenase; and the E. coli tyrR protein is related to nitrogen regulatory proteins. These and other matches were discovered using a personal computer of the type available in most laboratories collecting DNA sequence data. As relatively few sequences were sampled to find these matches, it is likely that much of the existing data has not been adequately examined.  相似文献   

3.
To learn more about the evolutionary origins of Escherichia coli genes, we surveyed systematically for extended sequence similarities among the 1,264 amino acid sequences encoded by chromosomal genes of E. coli K-12 in SwissProt release 26 by using the FASTA program and imposing the following criteria: (i) alignment of segments at least 100 amino acids long and (ii) at least 20% amino acid identity. Altogether, 624 extended alignments meeting the two criteria were identified, corresponding to 577 protein sequences (45.6% of the 1,264 E. coli protein sequences) that had an extended alignment with at least one other E. coli protein sequence. To exclude alignments of questionable biological significance, we imposed a high threshold on the number of gaps allowed in each of the 624 extended alignments, giving us a subset of 464 proteins. The population of 464 alignments has the following characteristics expressed as median values of the group: 254 amino acids in the alignment, representing 86% of the length of the protein, 33% of the amino acids in the alignment being identical, and 1.1 gaps introduced per 100 amino acids of alignment. Where functions are known, nearly all pairs consist of functionally related proteins. This implies that the sequence similarity we detected has biological meaning and did not arise by chance. That a major fraction of E. coli proteins form extended alignments strongly suggests the predominance of duplication and divergence of ancestral genes in the evolution of E. coli genes. The range of degrees of similarity shows that some genes originated more recently than others. There is no evidence of genome doubling in the past, since map distances between genes of sequence-related proteins show no coherent pattern of favored separations.  相似文献   

4.
We discuss the statistical significance of local similarities found between DNA sequences, and illustrate the procedure with reference to the Queen and Korn algorithm. If the longest similarity found for two sequences has length L, this length is said to be significant at the 5% level if there is a probability of no more than 0.05 of finding a length of L or greater between a pair of sequences consisting of randomly chosen bases with the same overall base frequencies. The distribution of longest lengths is related to that of lengths from any particular pair of starting positions on the two sequences. For our implementation of the Queen and Korn algorithm, this latter distribution is constructed by combining the five different blocks of bases that may be added to extend a similarity. A table is given to assess the significance of longest similarities in sequences of length up to 1000 bases. Quite long similarities are expected to occur by chance alone. The critical values we calculate for assessing significance are preferable to expected numbers of similarities used by some commercial computer packages.  相似文献   

5.
We have determined the nucleotide (nt) and deduced amino acid (aa) sequence of a unique 115-kDa Mycoplasma hyorhinis protein (P115) with an N-terminal region containing a highly conserved consensus sequence characteristics of nt-binding domains of several ATPase and GTPase enzymes. However, P115 lacked additional conserved features characteristic of some classes of nt-binding proteins. Based on the hydropathy profile of the deduced aa sequence, the absence of a leader peptide, its exclusive partitioning into the hydrophilic phase during Triton X-114 phase fractionation of M. hyorhinis, and immunofluorescence analysis indicating no surface-exposed domains, it was concluded that P115 is a cytoplasmic protein lacking intrinsic membrane interaction. M. hyorhinis P115 appears to be a species-specific protein, since it was not detected in any other mycoplasmal or bacterial species examined with specific antibody or genomic probes. Since genetic systems for direct mutational analysis are currently unavailable in this organism, sequence analysis provides critical information in establishing the possible function of this protein. Moreover, the nt sequence encoding P115 reported here supports a previously proposed model, based on synthesis of P115-related proteins in Escherichia coli, suggesting that multiple polypeptide products can be generated from mycoplasma genes by promiscuous translation initiation in this heterologous expression system.  相似文献   

6.
We have identified two single-copy genes from the model legume Medicago truncatula (MtENOD16 and 20) whose expression can be correlated with early stages of root nodulation and whose predicted coding sequences are partially homologous to both pea/vetch ENOD5 and soybean N315/ENOD55. Database searching and sequence alignment have defined the encoded early nodulins as a distinct sub-family of phytocyanin-related proteins, although the absence of key ligands implies that they are unlikely to bind copper. Molecular modelling based on known phytocyanin structure has been used to predict the 3-dimensional conformation of the principle globular domain of MtENOD16/20. Additional structural features common to both early nodulin and phytocyanin precursors include an N-terminal transit peptide, a highly variable (hydroxy)proline-rich sequence which probably undergoes extensive post-translational modification, and a hydrophobic C-terminal tail.  相似文献   

7.
A method is developed, based on word-searching, which providesa rapid test for the statistical significance of DNA sequencesimilarities for use in databank searching. The method makesallowance for the lengths and dinucleotide compositions of thesequences being compared. A way is also described to calculatethe power of the test, i.e. the probability of detecting a givensimilarity as being statistically significant. The effects onthe power of the test of the scoring method, word length, sequencelength, and sequence composition are examined. A novel scoringmethod is shown to be superior to the method currently usedin most word-searching algorithms. Received on August 3, 1988; accepted on December 12, 1988  相似文献   

8.
The complete amino acid sequence of human retinal S-antigen (48 kDa protein), a retinal protein involved in the visual process has been determined by cDNA sequencing. The largest cDNA was 1590 base pairs (bp) and it contained an entire coding sequence. The similarity of nucleotide sequence between the human and bovine is approximately 80%. The predicted amino acid sequence indicates that human S-antigen has 405 residues and its molecular mass is 45050 Da. The amino acid sequence homology between human and bovine is 81%. There is no overall sequence similarity between S-antigen and other proteins listed in the National Biomedical Research Foundation (NBRF) protein data base. However, local regions of sequence homology with alpha-transducin (T alpha) are apparent including the putative rhodopsin binding and phosphoryl binding sites. In addition, human S-antigen has sequences identical to bovine uveitopathogenic sites, indicating that some types of human uveitis may in part be related to the animal model of experimental autoimmune uveitis (EAU).  相似文献   

9.
The concept of a flexible protein sequence pattern is defined. In contrast to conventional pattern matching, template or sequence alignment methods, flexible patterns allow residue patterns typical of a complete protein fold to be developed in terms of residue positions (elements), separated by gaps of defined range. An efficient dynamic programming algorithm is presented to enable the best alignment(s) of a pattern with a sequence to be identified. The flexible pattern method is evaluated in detail by reference to the globin protein family, and by comparison to alignment techniques that exploit single sequence, multiple sequence and secondary structural information. A flexible pattern derived from seven globins aligned on structural criteria successfully discriminates all 345 globins from non-globins in the Protein Identification Resource database. Furthermore, a pattern that uses helical regions from just human alpha-haemoglobin identified 337 globins compared to 318 for the best non-pattern global alignment method. Patterns derived from successively fewer, yet more highly conserved positions in a structural alignment of seven globins show that as few as 38 residue positions (25 buried hydrophobic, 4 exposed and 9 others) may be used to uniquely identify the globin fold. The study suggests that flexible patterns gain discriminating power both by discarding regions known to vary within the protein family, and by defining gaps within specific ranges. Flexible patterns therefore provide a convenient and powerful bridge between regular expression pattern matching techniques and more conventional local and global sequence comparison algorithms.  相似文献   

10.
Relations between protein sequence and structure and their significance   总被引:1,自引:0,他引:1  
The relation between amino acid sequence and local structure in proteins is investigated. The local structures considered are either the four classes of secondary structure (H, E, T and C) or four classes of local conformations defined using measures of conformational similarity based on distances between C alpha atoms. The classes are obtained by applying an automatic clustering procedure to short polypeptide fragments of uniform length from a database of 75 known protein structures. The thrust of our investigation consists of systematically searching the database for simple amino acid patterns of the type Gly-X-Ala-X-X-Val, where X denotes an arbitrary residue. Patterns that are nearly always associated with the same structure are retained. Finding many such associations, we then evaluate by a statistical approach how many among them are non-random and compare the results for different definitions of local structure. A similar comparison is made for the predictive value of retained associations, which is assessed using an internal test based on dividing the database into "learning" and "test" subsets. While we find that local structures defined by conformational similarity are not superior to secondary structure for prediction purposes, they help us gain insight into the factors that influence the predictive value of derived associations. A major conclusion is that the number of retained associations is in large excess over the number expected from a random correlation between sequence and structure, irrespective of how local conformation is defined. However, only a very small number of these associations can be earmarked as reliable using statistical criteria, due to the limited size of the database. We find, for instance, that the pattern Ala-Ala-X-X-Lys reliably characterizes helix, and the pattern Val-X-Val-X-X-X-Ala reliably characterizes extended structure and beta-strand. The possibility is discussed that these and other reliable associations correspond to regions of the polypeptide chain whose conformations are locally determined and that these regions may play a role in folding.  相似文献   

11.
We found a 2S storage albumin from the seed of tomato ( Lycopersicon esculentum L. cv. Cherry) that cross-reacted with antiserum to the fruit lectin, and named it Lec2SA. According to its size and basicity, Lec2SA was classified into four isoforms. These isoforms have an M(r) of approximately 12,000, and are composed of a small subunit (M(r) 4,000) and a large subunit (M(r) 8,000) linked by disulfide bonds. The complete amino acid sequence of Lec2SA was determined. The small subunit was composed of 32 amino acids, whereas the large subunit contained 70 amino acids with a pyroglutamine as the N-terminal residue. The sequence of Lec2SA was similar to that of 2S albumins from different plants, such as Brazil nut and castor beans. Furthermore, a sequence similarity was found between the large subunit of Lec2SA and the peptide sequence from tomato lectin. Although these similarities were found, Lec2SA did not show hemagglutinating activity or sugar-chain-binding activity, indicating that Lec2SA lacks the carbohydrate-binding domain. These results suggest that tomato lectin is a chimeric lectin sharing the seed storage protein-like domain that is incorporated into the gene encoding tomato lectin through gene fusion.  相似文献   

12.
The class 1 protein is a major protein of the outer membrane of Neisseria meningitidis, and an important immunodeterminant in humans. The complete nucleotide sequence for the structural gene of a class 1 protein has been determined. The sequence predicts a protein of 374 amino acids, preceded by a typical signal peptide of 19 residues. The hydropathy profile of the predicted protein sequence resembles that of the Escherichia coli and gonococcal porins. The predicted protein sequence of the class 1 protein exhibits considerable structural similarity to the gonococcal porins PIA and PIB. Western blot studies also reveal immunologically conserved domains between the class 1 protein, PIA and PIB. A restriction fragment from the class 1 gene hybridizes to gonococcal genomic fragments in Southern blots. In addition to the class 1 gene coding region there is a large open reading frame on the opposite strand.  相似文献   

13.
14.
The profile method, for detecting distantly related proteins by sequence comparison, has been extended to incorporate secondary structure information from known X-ray structures. The sequence of a known structure is aligned to sequences of other members of a given folding class. From the known structure, the secondary structure (alpha-helix, beta-strand or "other") is assigned to each position of the aligned sequences. As in the standard profile method, a position-dependent scoring table, termed a profile, is calculated from the aligned sequences. However, rather than using the standard Dayhoff mutation table in calculating the profile, we use distinct amino acid mutation tables for residues in alpha-helices, beta-strands or other secondary structures to calculate the profile. In addition, we also distinguish between internal and external residues. With this new secondary structure-based profile method, we created a profile for eight-stranded, antiparallel beta barrels of the insecticyanin folding class. It is based on the sequences of retinol-binding protein, insecticyanin and beta-lactoglobulin. Scanning the sequence database with this profile, it was possible to detect the sequence of avidin. The structure of streptavidin is known, and it appears to be distantly related to the antiparallel beta barrels. Also detected is the sequence of complement component C8, which we therefore predict to be a member of this folding class.  相似文献   

15.
MOTIVATION: How critical is the sequence order information in predicting protein secondary structure segments? We tried to get a rough insight on it from a theoretical approach using both a prediction algorithm and structural fragments from Protein Databank (PDB). RESULTS: Using reverse protein sequences and PDB structural fragments, we theoretically estimated the significance of the order for protein secondary structure and prediction. On average: (1) 79% of protein sequence segments resulted in the same prediction in both normal and reverse directions, which indicated a relatively high conservation of secondary structure propensity in the reverse direction; (2) the reversed sequence prediction alone performed less accurately than the normal forward sequence prediction, but comparably high (2% difference); (3) the commonly predicted regions showed a slightly higher prediction accuracy (4%) than the normal sequences prediction; and (4) structural fragments which have counterparts in reverse direction in the same protein showed a comparable degree of secondary structure conservation (73% identity with reversed structures on average for pentamers). CONTACT: jong@biosophy.org; dietmann@ebi.ac.uk; heger@ebi.ac.uk; holm@ebi.ac.uk  相似文献   

16.
To improve the recognition of weak similarities between proteins a method of aligning two sequence profiles is proposed. It is shown that exploring the sequence space in the vicinity of the sequence with unknown properties significantly improves the performance of sequence alignment methods. Consistent with the previous observations the recognition sensitivity and alignment accuracy obtained by a profile–profile alignment method can be as much as 30% higher compared to the sequence–profile alignment method. It is demonstrated that the choice of score function and the diversity of the test profile are very important factors for achieving the maximum performance of the method, whereas the optimum range of these parameters depends on the level of similarity to be recognized.  相似文献   

17.
The multicatalytic proteinase complex is a high molecular weight nonlysosomal proteinase which is composed of many different types of subunit. As part of a study of the possible relationships between subunits, polypeptides derived from the multicatalytic proteinase from rat liver have been subjected to N-terminal amino acid sequence analysis. Although several of the subunits are blocked at their N-termini, sequences have been obtained for 7 of the polypeptides. Each of the 7 sequences is unique but they show considerable sequence similarity, suggesting that the proteins are encoded by members of the same gene family.  相似文献   

18.
SUMMARY: Genalyzer is a software tool designed for the interactive visualization of sequence matches between DNA or protein sequences. It provides visualizations on different levels of granularity, from complete overviews via zoomed regions to alignments of particular matching substrings. Genalyzer can efficiently handle very large datasets, allowing to display tens of thousands of matches between sequences of tens of millions of bases. AVAILABILITY: Genalyzer is available free of charge for non-commercial research institutions. For more details, see http://www.genalyzer.de  相似文献   

19.
A methodology is proposed to solve a difficult modeling problem related to the recently sequenced P39 protein. This sequence shares no similarity with any known 3D structure, but a fold is proposed by several threading tools. The difficulty in aligning the target sequence on one of the proposed template structures is overcome by combining the results of several available prediction methods and by refining a rational consensus between them. In silico validation of the obtained model and a preliminary cross-check with experimental features allow us to state that this borderline prediction is at least reasonable. This model raises relevant hypotheses on the main structural features of the protein and allows the design of site-directed mutations. Knowing the genetic context of the P39 reading frame, we are now able to suggest a function for the P39 protein: it would act as a periplasmic substrate-binding protein.  相似文献   

20.
A model for statistical significance of local similarities in structure   总被引:3,自引:0,他引:3  
Structural biology can provide three-dimensional structures for proteins of unknown function. When sequence or structure comparisons fail to suggest a function, insights can come from discovery of functionally important local structural patterns. Existing methods to detect such patterns lack rigorous statistics needed for widespread application. Here, we derive a formula to calculate statistical significance of the root-mean-square deviation between atoms in such patterns. When combined with a database search method, our statistics permit true functional or structural patterns in different folds to be discerned from noise. The approach is highly complementary to fold comparison for providing functional clues for new structures, and is key for the detection of recurrences of any new pattern.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号