首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Homology detection and protein structure prediction are central themes in bioinformatics. Establishment of relationship between protein sequences or prediction of their structure by sequence comparison methods finds limitations when there is low sequence similarity. Recent works demonstrate that the use of profiles improves homology detection and protein structure prediction. Profiles can be inferred from protein multiple alignments using different approaches. The "Conservatism-of-Conservatism" is an effective profile analysis method to identify structural features between proteins having the same fold but no detectable sequence similarity. The information obtained from protein multiple alignments varies according to the amino acid classification employed to calculate the profile. In this work, we calculated entropy profiles from PSI-BLAST-derived multiple alignments and used different amino acid classifications summarizing almost 500 different attributes. These entropy profiles were converted into pseudocodes which were compared using the FASTA program with an ad-hoc matrix. We tested the performance of our method to identify relationships between proteins with similar fold using a nonredundant subset of sequences having less than 40% of identity. We then compared our results using Coverage Versus Error per query curves, to those obtained by methods like PSI-BLAST, COMPASS and HHSEARCH. Our method, named HIP (Homology Identification with Profiles) presented higher accuracy detecting relationships between proteins with the same fold. The use of different amino acid classifications reflecting a large number of amino acid attributes, improved the recognition of distantly related folds. We propose the use of pseudocodes representing profile information as a fast and powerful tool for homology detection, fold assignment and analysis of evolutionary information enclosed in protein profiles.  相似文献   

2.
Protein design experiments have shown that the use of specific subsets of amino acids can produce foldable proteins. This prompts the question of whether there is a minimal amino acid alphabet which could be used to fold all proteins. In this work we make an analogy between sequence patterns which produce foldable sequences and those which make it possible to detect structural homologs by aligning sequences, and use it to suggest the possible size of such a reduced alphabet. We estimate that reduced alphabets containing 10-12 letters can be used to design foldable sequences for a large number of protein families. This estimate is based on the observation that there is little loss of the information necessary to pick out structural homologs in a clustered protein sequence database when a suitable reduction of the amino acid alphabet from 20 to 10 letters is made, but that this information is rapidly degraded when further reductions in the alphabet are made.  相似文献   

3.
Subcellular location is an important functional annotation of proteins. An automatic, reliable and efficient prediction system for protein subcellular localization is necessary for large-scale genome analysis. This paper describes a protein subcellular localization method which extracts features from protein profiles rather than from amino acid sequences. The protein profile represents a protein family, discards part of the sequence information that is not conserved throughout the family and therefore is more sensitive than the amino acid sequence. The amino acid compositions of whole profile and the N-terminus of the profile are extracted, respectively, to train and test the probabilistic neural network classifiers. On two benchmark datasets, the overall accuracies of the proposed method reach 89.1% and 68.9%, respectively. The prediction results show that the proposed method perform better than those methods based on amino acid sequences. The prediction results of the proposed method are also compared with Subloc on two redundance-reduced datasets.  相似文献   

4.
A fragment of human gene for pregnancy-specific beta 1-glycoprotein(s), recently identified CEA family member(s), has been cloned. Analyses of nucleotide and deduced amino acid sequences revealed that it carried, from 5' to 3' direction, exons IA, IB, IIA, IIB, C3, C1 and C2, the first four encoding peptides distinct from but highly similar to domains of PS beta Gs. The lack of consensus 3' splice site sequence ahead of IB indicated that it was an abortive exon, which would explain the peculiar domain construction of PS beta Gs, i.e. N-IA-IIA-IIB-C1, 2 or 3. Apparently, the multiple C-terminal sequences for a PS beta G were generated by alternative splicing among C1, C2 and C3 exons. Furthermore, sequences which overlapped partly with Cexons, were found to be similar to parts of 3'-UTR of CEA and NCA, indicating further the close relationship of CEA/NCA and PS beta G subfamily genes.  相似文献   

5.
6.
The effectiveness of sequence alignment in detecting structural homology among protein sequences decreases markedly when pairwise sequence identity is low (the so‐called “twilight zone” problem of sequence alignment). Alternative sequence comparison strategies able to detect structural kinship among highly divergent sequences are necessary to address this need. Among them are alignment‐free methods, which use global sequence properties (such as amino acid composition) to identify structural homology in a rapid and straightforward way. We explore the viability of using tetramer sequence fragment composition profiles in finding structural relationships that lie undetected by traditional alignment. We establish a strategy to recast any given protein sequence into a tetramer sequence fragment composition profile, using a series of amino acid clustering steps that have been optimized for mutual information. Our method has the effect of compressing the set of 160,000 unique tetramers (if using the 20‐letter amino acid alphabet) into a more tractable number of reduced tetramers (~15–30), so that a meaningful tetramer composition profile can be constructed. We test remote homology detection at the topology and fold superfamily levels using a comprehensive set of fold homologs, culled from the CATH database that share low pairwise sequence similarity. Using the receiver‐operating characteristic measure, we demonstrate potentially significant improvement in using information‐optimized reduced tetramer composition, over methods relying only on the raw amino acid composition or on traditional sequence alignment, in homology detection at or below the “twilight zone”. Proteins 2010. © 2010 Wiley‐Liss, Inc.  相似文献   

7.
Shih CH  Chang CM  Lin YS  Lo WC  Hwang JK 《Proteins》2012,80(6):1647-1657
The knowledge of conserved sequences in proteins is valuable in identifying functionally or structurally important residues. Generating the conservation profile of a sequence requires aligning families of homologous sequences and having knowledge of their evolutionary relationships. Here, we report that the conservation profile at the residue level can be quantitatively derived from a single protein structure with only backbone information. We found that the reciprocal packing density profiles of protein structures closely resemble their sequence conservation profiles. For a set of 554 nonhomologous enzymes, 74% (408/554) of the proteins have a correlation coefficient > 0.5 between these two profiles. Our results indicate that the three-dimensional structure, instead of being a mere scaffold for positioning amino acid residues, exerts such strong evolutionary constraints on the residues of the protein that its profile of sequence conservation essentially reflects that of its structural characteristics.  相似文献   

8.
9.
A catalogue of splice junction sequences.   总被引:763,自引:139,他引:624       下载免费PDF全文
Splice junction sequences from a large number of nuclear and viral genes encoding protein have been collected. The sequence CAAG/GTAGAGT was found to be a consensus of 139 exon-intron boundaries (or donor sequences) and (TC)nNCTAG/G was found to be a consensus of 130 intron-exon boundaries (or acceptor sequences). The possible role of splice junction sequences as signals for processing is discussed.  相似文献   

10.
Predicting accurate fragments from sequence has recently become a critical step for protein structure modeling, as protein fragment assembly techniques are presently among the most efficient approaches for de novo prediction. A key step in these approaches is, given the sequence of a protein to model, the identification of relevant fragments - candidate fragments - from a collection of the available 3D structures. These fragments can then be assembled to produce a model of the complete structure of the protein of interest. The search for candidate fragments is classically achieved by considering local sequence similarity using profile comparison, or threading approaches. In the present study, we introduce a new profile comparison approach that, instead of using amino acid profiles, is based on the use of predicted structural alphabet profiles, where structural alphabet profiles contain information related to the 3D local shapes associated with the sequences. We show that structural alphabet profile-profile comparison can be used efficiently to retrieve accurate structural fragments, and we introduce a fully new protocol for the detection of candidate fragments. It identifies fragments specific of each position of the sequence and of size varying between 6 and 27 amino-acids. We find it outperforms present state of the art approaches in terms (i) of the accuracy of the fragments identified, (ii) the rate of true positives identified, while having a high coverage score. We illustrate the relevance of the approach on complete target sets of the two previous Critical Assessment of Techniques for Protein Structure Prediction (CASP) rounds 9 and 10. A web server for the approach is freely available at http://bioserv.rpbs.univ-paris-diderot.fr/SAFrag.  相似文献   

11.
12.
Melo F  Marti-Renom MA 《Proteins》2006,63(4):986-995
Reduced or simplified amino acid alphabets group the 20 naturally occurring amino acids into a smaller number of representative protein residues. To date, several reduced amino acid alphabets have been proposed, which have been derived and optimized by a variety of methods. The resulting reduced amino acid alphabets have been applied to pattern recognition, generation of consensus sequences from multiple alignments, protein folding, and protein structure prediction. In this work, amino acid substitution matrices and statistical potentials were derived based on several reduced amino acid alphabets and their performance assessed in a large benchmark for the tasks of sequence alignment and fold assessment of protein structure models, using as a reference frame the standard alphabet of 20 amino acids. The results showed that a large reduction in the total number of residue types does not necessarily translate into a significant loss of discriminative power for sequence alignment and fold assessment. Therefore, some definitions of a few residue types are able to encode most of the relevant sequence/structure information that is present in the 20 standard amino acids. Based on these results, we suggest that the use of reduced amino acid alphabets may allow to increasing the accuracy of current substitution matrices and statistical potentials for the prediction of protein structure of remote homologs.  相似文献   

13.
A cDNA library for Myzus persicae has served to identify sequences coding for cuticular proteins (CPs) with RR-1 and RR-2 consensus. Two putative CPs showed a common RR-2 chitin binding domain (CBD) but differed in their C and N terminals. Two other predicted CPs showed a typical RR-1 CBD but differed in size and sequence of the C and N terminals. An additional sequence encoding for a protein that showed terminal amino acid repeats similar to those of putative CPs from M. persicae, but lacked the R & R consensus, was also described. A comparison of the sequences obtained from the cDNA library with those attained from the genomic DNA, confirmed their identity as cuticular proteins genes. Presence of introns was revealed in the Mpcp4 and Mpcp5 genes coding for CPs with an RR-1 consensus. The Mpcp4 has a single large intron, while the Mpcp5 has two shorter ones. Introns were not found in the Mpcp2 and Mpcp3 genes encoding for CPs with RR-2 consensus. Differences were also noticed for 3' UTR and 5' UTR of both the RR-1 and RR-2 CPs. CPs genes were expressed in bacteria, and the resulting protein was identified as a CP by amino acid sequencing.  相似文献   

14.
Naturally-occurring phytases having the required level of thermostability for application in animal feeding have not been found in nature thus far. We decided to de novo construct consensus phytases using primary protein sequence comparisons. A consensus enzyme based on 13 fungal phytase sequences had normal catalytic properties, but showed an unexpected 15-22 degrees C increase in unfolding temperature compared with each of its parents. As a first step towards understanding the molecular basis of increased heat resistance, the crystal structure of consensus phytase was determined and compared with that of Aspergillus niger phytase. Aspergillus niger phytase unfolds at much lower temperatures. In most cases, consensus residues were indeed expected, based on comparisons of both three-dimensional structures, to contribute more to phytase stabilization than non-consensus amino acids. For some consensus amino acids, predicted by structural comparisons to destabilize the protein, mutational analysis was performed. Interestingly, these consensus residues in fact increased the unfolding temperature of the consensus phytase. In summary, for fungal phytases apparently an unexpected direct link between protein sequence conservation and protein stability exists.  相似文献   

15.
An iron-only hydrogenase was partially purified and characterized from Desulfovibrio fructosovorans wild-type strain. The enzyme exhibits a molecular mass of 56 kDa and is composed of two distinct subunits HydA and HydB (46 and 13 kDa, respectively). The N-terminal amino acid sequences of the two subunits of the enzyme were determined with the aim of designing degenerate oligonucleotides. Direct and inverse polymerase chain reaction techniques were used to clone the hydrogenase encoding genes. A 9-nucleotide region located 75 bp upstream from the translational start codon of the D. fructosovorans hydA gene was found to be highly conserved. The analysis of the deduced amino acid sequence of these genes showed the presence of a signal sequence located in the small subunit, exhibiting the consensus sequence which is likely to be involved in the specific export mechanism of hydrogenases. Two ferredoxin-like motives involved in the coordination of [4Fe-4S] clusters were identified in the N-terminal domain of the large subunit. The amino acid sequence of the [Fe] hydrogenase from D. fructosovorans was compared with the amino acid sequences from eight other hydrogenases (cytoplasmic and periplasmic). These enzymes share an overall 18% identity and 28% similarity. The identity reached 73% and 69% when the D. fructosovorans hydrogenase sequence was compared with the hydrogenase sequences from Desulfovibrio vulgaris Hildenborough and Desulfovibrio vulgaris oxamicus Monticello, respectively.  相似文献   

16.
The information required to generate a protein structure is contained in its amino acid sequence, but how three-dimensional information is mapped onto a linear sequence is still incompletely understood. Multiple structure alignments of similar protein structures have been used to investigate conserved sequence features but contradictory results have been obtained, due, in large part, to the absence of subjective criteria to be used in the construction of sequence profiles and in the quantitative comparison of alignment results. Here, we report a new procedure for multiple structure alignment and use it to construct structure-based sequence profiles for similar proteins. The definition of "similar" is based on the structural alignment procedure and on the protein structural distance (PSD) described in paper I of this series, which offers an objective measure for protein structure relationships. Our approach is tested in two well-studied groups of proteins; serine proteases and Ig-like proteins. It is demonstrated that the quality of a sequence profile generated by a multiple structure alignment is quite sensitive to the PSD used as a threshold for the inclusion of proteins in the alignment. Specifically, if the proteins included in the aligned set are too distant in structure from one another, there will be a dilution of information and patterns that are relevant to a subset of the proteins are likely to be lost.In order to understand better how the same three-dimensional information can be encoded in seemingly unrelated sequences, structure-based sequence profiles are constructed for subsets of proteins belonging to nine superfolds. We identify patterns of relatively conserved residues in each subset of proteins. It is demonstrated that the most conserved residues are generally located in the regions where tertiary interactions occur and that are relatively conserved in structure. Nevertheless, the conservation patterns are relatively weak in all cases studied, indicating that structure-determining factors that do not require a particular sequential arrangement of amino acids, such as secondary structure propensities and hydrophobic interactions, are important in encoding protein fold information. In general, we find that similar structures can fold without having a set of highly conserved residue clusters or a well-conserved sequence profile; indeed, in some cases there is no apparent conservation pattern common to structures with the same fold. Thus, when a group of proteins exhibits a common and well-defined sequence pattern, it is more likely that these sequences have a close evolutionary relationship rather than the similarities having arisen from the structural requirements of a given fold.  相似文献   

17.
Protein sequence world is considerably larger than structure world. In consequence, numerous non-related sequences may adopt similar 3D folds and different kinds of amino acids may thus be found in similar 3D structures. By grouping together the 20 amino acids into a smaller number of representative residues with similar features, sequence world simplification may be achieved. This clustering hence defines a reduced amino acid alphabet (reduced AAA). Numerous works have shown that protein 3D structures are composed of a limited number of building blocks, defining a structural alphabet. We previously identified such an alphabet composed of 16 representative structural motifs (5-residues length) called Protein Blocks (PBs). This alphabet permits to translate the structure (3D) in sequence of PBs (1D). Based on these two concepts, reduced AAA and PBs, we analyzed the distributions of the different kinds of amino acids and their equivalences in the structural context. Different reduced sets were considered. Recurrent amino acid associations were found in all the local structures while other were specific of some local structures (PBs) (e.g Cysteine, Histidine, Threonine and Serine for the alpha-helix Ncap). Some similar associations are found in other reduced AAAs, e.g Ile with Val, or hydrophobic aromatic residues Trp with Phe and Tyr. We put into evidence interesting alternative associations. This highlights the dependence on the information considered (sequence or structure). This approach, equivalent to a substitution matrix, could be useful for designing protein sequence with different features (for instance adaptation to environment) while preserving mainly the 3D fold.  相似文献   

18.
Distant homologies between proteins are often discovered only after three-dimensional structures of both proteins are solved. The sequence divergence for such proteins can be so large that simple comparison of their sequences fails to identify any similarity. New generation of sensitive alignment tools use averaged sequences of entire homologous families (profiles) to detect such homologies. Several algorithms, including the newest generation of BLAST algorithms and BASIC, an algorithm used in our group to assign fold predictions for proteins from several genomes, are compared to each other on the large set of structurally similar proteins with little sequence similarity. Proteins in the benchmark are classified according to the level of their similarity, which allows us to demonstrate that most of the improvement of the new algorithms is achieved for proteins with strong functional similarities, with almost no progress in recognizing distant fold similarities. It is also shown that details of profile calculation strongly influence its sensitivity in recognizing distant homologies. The most important choice is how to include information from diverging members of the family, avoiding generating false predictions, while accounting for entire sequence divergence within a family. PSI-BLAST takes a conservative approach, deriving a profile from core members of the family, providing a solid improvement without almost any false predictions. BASIC strives for better sensitivity by increasing the weight of divergent family members and paying the price in lower reliability. A new FFAS algorithm introduced here uses a new procedure for profile generation that takes into account all the relations within the family and matches BASIC sensitivity with PSI-BLAST like reliability.  相似文献   

19.
Many surface proteins which are covalently linked to the cell wall of gram-positive bacteria have a consensus C-terminal motif, Leu-Pro-X-Thr-Gly (LPXTG). This sequence is cleaved, and the processed protein is attached to an amino group of a cross-bridge in the peptidoglycan by a specific enzyme called sortase. Using the type strain of Streptococcus suis, NCTC 10234, we found five genes encoding proteins that were homologous to sortases of other bacteria and determined the nucleotide sequences of the genetic regions. One gene, designated srtA, was linked to gyrA, as were the sortase and sortase-like genes of other streptococci. Three genes, designated srtB, srtC, and srtD, were tandemly clustered in a different location, where there were three segments of directly repeated sequences of approximately 110 bp in close vicinity. The remaining gene, designated srtE, was located separately on the chromosome with a pseudogene which may encode a transposase. The deduced amino acid sequences of the five Srt proteins showed 18 to 31% identity with the sortases of Streptococcus gordonii and Staphylococcus aureus, except that SrtA of S. suis had 65% identity with that of S. gordonii. Isogenic mutants deficient for srtA, srtBCD, or srtE were generated by allelic exchanges. The protein fraction which was released from partially purified cell walls by digestion with N-acetylmuramidase was profiled by two-dimensional gel electrophoresis. More than 15 of the protein spots were missing in the profile of the srtA mutant compared with that of the parent strain, and this phenotype was completely complemented by srtA cloned from S. suis. Four genes encoding proteins corresponding to such spots were identified and sequenced. The deduced translational products of the four genes possessed the LPXTG motif in their C-terminal regions. On the other hand, the protein spots that were missing in the srtA mutant appeared in the profiles of the srtBCD and srtE mutants. These results provide evidence that the cell wall sorting system involving srtA is also present in S. suis.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号