首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Freyhult E  Cui Y  Nilsson O  Ardell DH 《Biochimie》2007,89(10):1276-1288
There are at least 21 subfunctional classes of tRNAs in most cells that, despite a very highly conserved and compact common structure, must interact specifically with different cliques of proteins or cause grave organismal consequences. Protein recognition of specific tRNA substrates is achieved in part through class-restricted tRNA features called tRNA identity determinants. In earlier work we used TFAM, a statistical classifier of tRNA function, to show evidence of unexpectedly large diversity among bacteria in tRNA identity determinants. We also created a data reduction technique called function logos to visualize identity determinants for a given taxon. Here we show evidence that determinants for lysylated isoleucine tRNAs are not the same in Proteobacteria as in other bacterial groups including the Cyanobacteria. Consistent with this, the lysylating biosynthetic enzyme TilS lacks a C-terminal domain in Cyanobacteria that is present in Proteobacteria. We present here, using function logos, a map estimating all potential identity determinants generally operational in Cyanobacteria and Proteobacteria. To further isolate the differences in potential tRNA identity determinants between Proteobacteria and Cyanobacteria, we created two new data reduction visualizations to contrast sequence and function logos between two taxa. One, called Information Difference logos (ID logos), shows the evolutionary gain or retention of functional information associated to features in one lineage. The other, Kullback-Leibler divergence Difference logos (KLD logos), shows recruitments or shifts in the functional associations of features, especially those informative in both lineages. We used these new logos to specifically isolate and visualize the differences in potential tRNA identity determinants between Proteobacteria and Cyanobacteria. Our graphical results point to numerous differences in potential tRNA identity determinants between these groups. Although more differences in general are explained by shifts in functional association rather than gains or losses, the apparent identity differences in lysylated isoleucine tRNAs appear to have evolved through both mechanisms.  相似文献   

2.
Lin HN  Notredame C  Chang JM  Sung TY  Hsu WL 《PloS one》2011,6(12):e27872
Most sequence alignment tools can successfully align protein sequences with higher levels of sequence identity. The accuracy of corresponding structure alignment, however, decreases rapidly when considering distantly related sequences (<20% identity). In this range of identity, alignments optimized so as to maximize sequence similarity are often inaccurate from a structural point of view. Over the last two decades, most multiple protein aligners have been optimized for their capacity to reproduce structure-based alignments while using sequence information. Methods currently available differ essentially in the similarity measurement between aligned residues using substitution matrices, Fourier transform, sophisticated profile-profile functions, or consistency-based approaches, more recently.In this paper, we present a flexible similarity measure for residue pairs to improve the quality of protein sequence alignment. Our approach, called SymAlign, relies on the identification of conserved words found across a sizeable fraction of the considered dataset, and supported by evolutionary analysis. These words are then used to define a position specific substitution matrix that better reflects the biological significance of local similarity. The experiment results show that the SymAlign scoring scheme can be incorporated within T-Coffee to improve sequence alignment accuracy. We also demonstrate that SymAlign is less sensitive to the presence of structurally non-similar proteins. In the analysis of the relationship between sequence identity and structure similarity, SymAlign can better differentiate structurally similar proteins from non- similar proteins. We show that protein sequence alignments can be significantly improved using a similarity estimation based on weighted n-grams. In our analysis of the alignments thus produced, sequence conservation becomes a better indicator of structural similarity. SymAlign also provides alignment visualization that can display sub-optimal alignments on dot-matrices. The visualization makes it easy to identify well-supported alternative alignments that may not have been identified by dynamic programming. SymAlign is available at http://bio-cluster.iis.sinica.edu.tw/SymAlign/.  相似文献   

3.
4.
We have examined conserved protein motifs in the non-coding, intergenic regions ("pseudomotif patterns") and surveyed their occurrence in the fly, worm, yeast and human genomes (chromosomes 21 and 22 only). To identify these patterns, we masked out annotated genes, pseudogenes and repeat regions from the raw genomic sequence and then compared the remaining sequence, in six-frame translation, against 1319 patterns from the PROSITE database. For each pseudomotif pattern, the absolute number of occurrences is not very informative unless compared against a statistical expectation; consequently, we calculated the expected occurrence of each pattern using a Poisson model and verified this with simulations. Using a p-value cut-off of 0.01, we found 67 pseudomotif patterns over-represented in fly intergenic regions, 34 in worm, 21 in human and six in yeast. These include the zinc finger, leucine zipper, nucleotide-binding motif and EGF domain. Many of the over-represented patterns were common to two or more organisms, but there were a few that were unique to specific ones. Furthermore, we found more over-represented patterns in the fly than in the worm, although the fly has fewer pseudogenes. This puzzling observation can be explained by a higher deletion rate in the fly genome. We also surveyed under-represented patterns, finding 23 in the fly, 12 in the worm, 18 in human and two in yeast. If intergenic sequences were truly random, we would expect an equal number of over and under-represented patterns. The fact that for each organism the number of over-represented patterns is greater than the number of under-represented ones implies that a fraction of the intergenic regions consist of ancient protein fragments that, due to accumulated disablements, have become unrecognizable by conventional techniques for gene and pseudogene identification. Moreover, we find that in aggregate the over-represented pseudomotif patterns occupy a substantial fraction of the intergenic regions. Further information is available at http://pseudogene.org  相似文献   

5.
6.
7.
Comparative modeling methods can consistently produce reliable structural models for protein sequences with more than 25% sequence identity to proteins with known structure. However, there is a good chance that also sequences with lower sequence identity have their structural components represented in structural databases. To this end, we present a novel fragment-based method using sets of structurally similar local fragments of proteins. The approach differs from other fragment-based methods that use only single backbone fragments. Instead, we use a library of groups containing sets of sequence fragments with geometrically similar local structures and extract sequence related properties to assign these specific geometrical conformations to target sequences. We test the ability of the approach to recognize correct SCOP folds for 273 sequences from the 49 most popular folds. 49% of these sequences have the correct fold as their top prediction, while 82% have the correct fold in one of the top five predictions. Moreover, the approach shows no performance reduction on a subset of sequence targets with less than 10% sequence identity to any protein used to build the library.  相似文献   

8.
In the study of in silico functional genomics, improving the performance of protein function prediction is the ultimate goal for identifying proteins associated with defined cellular functions. The classical prediction approach is to employ pairwise sequence alignments. However this method often faces difficulties when no statistically significant homologous sequences are identified. An alternative way is to predict protein function from sequence-derived features using machine learning. In this case the choice of possible features which can be derived from the sequence is of vital importance to ensure adequate discrimination to predict function. In this paper we have successfully selected biologically significant features for protein function prediction. This was performed using a new feature selection method (FrankSum) that avoids data distribution assumptions, uses a data independent measurement (p-value) within the feature, identifies redundancy between features and uses an appropriate ranking criterion for feature selection. We have shown that classifiers generated from features selected by FrankSum outperforms classifiers generated from full feature sets, randomly selected features and features selected from the Wrapper method. We have also shown the features are concordant across all species and top ranking features are biologically informative. We conclude that feature selection is vital for successful protein function prediction and FrankSum is one of the feature selection methods that can be applied successfully to such a domain.  相似文献   

9.
Practical limits of function prediction   总被引:15,自引:0,他引:15  
Devos D  Valencia A 《Proteins》2000,41(1):98-107
  相似文献   

10.
SUMMARY: The availability of advanced profile-profile comparison tools, such as PRC or HHsearch demands sophisticated visualization tools not presently available. We introduce an approach built upon the concept of HMM logos. The method illustrates the similarities of pairs of protein family profiles in an intuitive way. Two HMM logos, one for each profile, are drawn one upon the other. The aligned states are then highlighted and connected. AVAILABILITY: A web interface offering online creation of pairwise HMM logos is available at http://www.sanger.ac.uk/Software/analysis/logomat-p. Furthermore, software developers may download a Perl package that includes methods for creation of pairwise HMM logos locally. CONTACT: bsb@sanger.ac.uk.  相似文献   

11.
It is commonly believed that similarities between the sequences of two proteins infer similarities between their structures. Sequence alignments reliably recognize pairs of protein of similar structures provided that the percentage sequence identity between their two sequences is sufficiently high. This distinction, however, is statistically less reliable when the percentage sequence identity is lower than 30% and little is known then about the detailed relationship between the two measures of similarity. Here, we investigate the inverse correlation between structural similarity and sequence similarity on 12 protein structure families. We define the structure similarity between two proteins as the cRMS distance between their structures. The sequence similarity for a pair of proteins is measured as the mean distance between the sequences in the subsets of sequence space compatible with their structures. We obtain an approximation of the sequence space compatible with a protein by designing a collection of protein sequences both stable and specific to the structure of that protein. Using these measures of sequence and structure similarities, we find that structural changes within a protein family are linearly related to changes in sequence similarity.  相似文献   

12.
Locating sequences compatible with a protein structural fold is the well‐known inverse protein‐folding problem. While significant progress has been made, the success rate of protein design remains low. As a result, a library of designed sequences or profile of sequences is currently employed for guiding experimental screening or directed evolution. Sequence profiles can be computationally predicted by iterative mutations of a random sequence to produce energy‐optimized sequences, or by combining sequences of structurally similar fragments in a template library. The latter approach is computationally more efficient but yields less accurate profiles than the former because of lacking tertiary structural information. Here we present a method called SPIN that predicts Sequence Profiles by Integrated Neural network based on fragment‐derived sequence profiles and structure‐derived energy profiles. SPIN improves over the fragment‐derived profile by 6.7% (from 23.6 to 30.3%) in sequence identity between predicted and wild‐type sequences. The method also reduces the number of residues in low complex regions by 15.7% and has a significantly better balance of hydrophilic and hydrophobic residues at protein surface. The accuracy of sequence profiles obtained is comparable to those generated from the protein design program RosettaDesign 3.5. This highly efficient method for predicting sequence profiles from structures will be useful as a single‐body scoring term for improving scoring functions used in protein design and fold recognition. It also complements protein design programs in guiding experimental design of the sequence library for screening and directed evolution of designed sequences. The SPIN server is available at http://sparks‐lab.org . Proteins 2014; 82:2565–2573. © 2014 Wiley Periodicals, Inc.  相似文献   

13.
MOTIVATION: To devise a method that, unlike available methods, directly measures variations in phylogenetic signals in gene sequences that result from recombination, tests the significance of the signal variations and distinguishes misleading signals. RESULTS: We have developed a method, that we call 'sister-scanning', for assessing phylogenetic and compositional signals in the various patterns of identity that occur between four nucleotide sequences. A Monte Carlo randomization is done for all columns (positions) within a window and Z-scores are obtained for four real sequences or three real sequences with an outlier that is also randomized. The usefulness of the approach is demonstrated using tobamovirus and luteovirus sequences. Contradictory phylogenetic signals were distinguished in both datasets, as were regions of sequence that contained no clear signal or potentially misleading signals related to compositional similarities. In the tobamovirus dataset, contradictory phylogenetic signals were separated by coding sequences up to a kilobase long that contained no clear signal. Our re-analysis of this dataset using sister-scanning also yielded the first evidence known to us of an inter-species recombination site within a viral RNA-dependent RNA polymerase gene together with evidence of an unusual pattern of conservation in the three codon positions.  相似文献   

14.
Protein design aims at designing new protein molecules of desired structure and functionality. One of the major obstacles to large-scale protein design are the extensive time and manpower requirements for experimental validation of designed sequences. Recent advances in protein structure prediction have provided potentials for an automated assessment of the designed sequences via folding simulations. We present a new protocol for protein design and validation. The sequence space is initially searched by Monte Carlo sampling guided by a public atomic potential, with candidate sequences selected by the clustering of sequence decoys. The designed sequences are then assessed by I-TASSER folding simulations, which generate full-length atomic structural models by the iterative assembly of threading fragments. The protocol is tested on 52 nonhomologous single-domain proteins, with an average sequence identity of 24% between the designed sequences and the native sequences. Despite this low sequence identity, three-dimensional models predicted for the first designed sequence have an RMSD of < 2 Å to the target structure in 62% of cases. This percentage increases to 77% if we consider the three-dimensional models from the top 10 designed sequences. Such a striking consistency between the target structure and the structural prediction from nonhomologous sequences, despite the fact that the design and folding algorithms adopt completely different force fields, indicates that the design algorithm captures the features essential to the global fold of the target. On average, the designed sequences have a free energy that is 0.39 kcal/(mol residue) lower than in the native sequences, potentially affording a greater stability to synthesized target folds.  相似文献   

15.
16.
We have used NMR spectroscopy and limited proteolysis to characterize the structural properties of the Parkinson's disease-related protein alpha-synuclein in lipid and detergent micelle environments. We show that the lipid or micelle surface-bound portion of the molecule adopts a continuously helical structure with a single break. Modeling alphaS as an ideal alpha-helix reveals a hydrophobic surface that winds around the helix axis in a right-handed fashion. This feature is typical of 11-mer repeat containing sequences that adopt right-handed coiled coil conformations. In order to bind a flat or convex lipid surface, however, an unbroken helical alphaS structure would need to adopt an unusual, slightly unwound, alpha11/3 helix conformation (three complete turns per 11 residues). The break we observe in the alphaS helix may allow the protein to avoid this unusual conformation by adopting two shorter stretches of typical alpha-helical structure. However, a quantitative analysis suggests the possibility that the alpha11/3 conformation may in fact exist in lipid-bound alphaS. We discuss how structural features of helical 11-mer repeats could play a role in the reversible lipid binding function of alpha-synuclein and generalize this argument to include the 11-mer repeat-containing apolipoproteins, which also require the ability to release readily from lipid surfaces. A search of protein sequence databases confirms that synuclein-like 11-mer repeats are present in other proteins that bind lipids reversibly and predicts such a role for a number of hypothetical proteins of unknown function.  相似文献   

17.
SUMMARY: LogoBar is a Java application to display protein sequence logos. In our software gaps are accounted for when calculating the information content present at each residue position in a multiple alignment. The resulting logo is displayed as a graph consisting of bars, although traditional letter representation is also possible. Amino acids are displayed from the bottom up with decreasing frequencies i.e. the most abundant residue is placed at the bottom of the logo. The bars can be color-coded according to user specifications. Gaps in the alignment are also displayed, either on top or at the bottom of the logo. Furthermore, residues can either be arranged according to their relative abundance or grouped according to user criteria to emphasize the conserved nature of particular positions. AVAILABILITY: LogoBar and further documentation is available at http://www.biosci.ki.se/groups/tbu/logobar/  相似文献   

18.
Many dissimilar protein sequences fold into similar structures. A central and persistent challenge facing protein structural analysis is the discrimination between homology and convergence for structurally similar domains that lack significant sequence similarity. Classic examples are the OB-fold and SH3 domains, both small, modular beta-barrel protein superfolds. The similarities among these domains have variously been attributed to common descent or to convergent evolution. Using a sequence profile-based phylogenetic technique, we analyzed all structurally characterized OB-fold, SH3, and PDZ domains with less than 40% mutual sequence identity. An all-against-all, profile-versus-profile analysis of these domains revealed many previously undetectable significant interrelationships. The matrices of scores were used to infer phylogenies based on our derivation of the relationships between sequence similarity E-values and evolutionary distances. The resulting clades of domains correlate remarkably well with biological function, as opposed to structural similarity, indicating that the functionally distinct sub-families within these superfolds are homologous. This method extends phylogenetics into the challenging "twilight zone" of sequence similarity, providing the first objective resolution of deep evolutionary relationships among distant protein families.  相似文献   

19.
The signal for the termination of protein synthesis in procaryotes.   总被引:24,自引:14,他引:10       下载免费PDF全文
The sequences around the stop codons of 862 Escherichia coli genes have been analysed to identify any additional features which contribute to the signal for the termination of protein synthesis. Highly significant deviations from the expected nucleotide distribution were observed, both before and after the stop codon. Immediately prior to UAA stop codons in E. coli there is a preference for codons of the form NAR (any base, adenine, purine), and in particular those that code for glutamine or the basic amino acids. In contrast, codons for threonine or branched nonpolar amino acids were under-represented. Uridine was over-represented in the nucleotide position immediately following all three stop codons, whereas adenine and cytosine were under-represented. This pattern is accentuated in highly expressed genes, but is not as marked in either lowly expressed genes or those that terminate in UAG, the codon specifically recognised by polypeptide chain release factor-1. These observations suggest that for the efficient termination of protein synthesis in E. coli, the 'stop signal' may be a tetranucleotide, rather than simply a tri-nucleotide codon, and that polypeptide chain release factor-2 recognises this extended signal. The sequence following stop codons was analysed in genes from several other procaryotes and bacteriophages. Salmonella typhimurium, Bacillus subtilis, bacteriophages and the methanogenic archaebacteria showed a similar bias to E. coli.  相似文献   

20.
This paper presents a top-down strategy to detect features in genomic sequences. The strategy's core is to exploit dictionary-based compression algorithms and analyse the content of the automatically generated dictionary. We classify the different over-represented segments and in the case study we correlate them to experimentally identified or theoretically forecasted biological features. A large spectrum analysis reveals that the only feature co-located with the a priori extracted segments is the torsional flexibility of DNA, while non-B DNA configurations are anti-localized and other features are mostly independent of the extracted sequences. This analysis unravels complex relationships between the linguistic structures investigated under our approach and some known biological features.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号