首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 0 毫秒


Classification of protein sequences is a central problem in computational biology. Currently, among computational methods discriminative kernel-based approaches provide the most accurate results. However, kernel-based methods often lack an interpretable model for analysis of discriminative sequence features, and predictions on new sequences usually are computationally expensive.  相似文献   

Koike R  Kinoshita K  Kidera A 《Proteins》2007,66(3):655-663
Dynamic programming (DP) and its heuristic algorithms are the most fundamental methods for similarity searches of amino acid sequences. Their detection power has been improved by including supplemental information, such as homologous sequences in the profile method. Here, we describe a method, probabilistic alignment (PA), that gives improved detection power, but similarly to the original DP, uses only a pair of amino acid sequences. Receiver operating characteristic (ROC) analysis demonstrated that the PA method is far superior to BLAST, and that its sensitivity and selectivity approach to those of PSI-BLAST. Particularly for orphan proteins having few homologues in the database, PA exhibits much better performance than PSI-BLAST. On the basis of this observation, we applied the PA method to a homology search of two orphan proteins, Latexin and Resuscitation-promoting factor domain. Their molecular functions have been described based on structural similarities, but sequence homologues have not been identified by PSI-BLAST. PA successfully detected sequence homologues for the two proteins and confirmed that the observed structural similarities are the result of an evolutional relationship.  相似文献   

STRUCTFAST is a novel profile-profile alignment algorithm capable of detecting weak similarities between protein sequences. The increased sensitivity and accuracy of the STRUCTFAST method are achieved through several unique features. First, the algorithm utilizes a novel dynamic programming engine capable of incorporating important information from a structural family directly into the alignment process. Second, the algorithm employs a rigorous analytical formula for profile-profile scoring to overcome the limitations of ad hoc scoring functions that require adjustable parameter training. Third, the algorithm employs Convergent Island Statistics (CIS) to compute the statistical significance of alignment scores independently for each pair of sequences. STRUCTFAST routinely produces alignments that meet or exceed the quality obtained by an expert human homology modeler, as evidenced by its performance in the latest CAFASP4 and CASP6 blind prediction benchmark experiments.  相似文献   

Summary: We present a large-scale implementation of the RANKPROPprotein homology ranking algorithm in the form of an openlyaccessible web server. We use the NRDB40 PSI-BLAST all-versus-allprotein similarity network of 1.1 million proteins to constructthe graph for the RANKPROP algorithm, whereas previously, resultswere only reported for a database of 108 000 proteins. We alsodescribe two algorithmic improvements to the original algorithm,including propagation from multiple homologs of the query andbetter normalization of ranking scores, that lead to higheraccuracy and to scores with a probabilistic interpretation. Availability: The RANKPROP web server and source code are availableat http://rankprop.gs.washington.edu Contact: iain{at}nec-labs.com; noble{at}gs.washington.edu Associate Editor: Burkhard Rost  相似文献   

The effectiveness of sequence alignment in detecting structural homology among protein sequences decreases markedly when pairwise sequence identity is low (the so‐called “twilight zone” problem of sequence alignment). Alternative sequence comparison strategies able to detect structural kinship among highly divergent sequences are necessary to address this need. Among them are alignment‐free methods, which use global sequence properties (such as amino acid composition) to identify structural homology in a rapid and straightforward way. We explore the viability of using tetramer sequence fragment composition profiles in finding structural relationships that lie undetected by traditional alignment. We establish a strategy to recast any given protein sequence into a tetramer sequence fragment composition profile, using a series of amino acid clustering steps that have been optimized for mutual information. Our method has the effect of compressing the set of 160,000 unique tetramers (if using the 20‐letter amino acid alphabet) into a more tractable number of reduced tetramers (~15–30), so that a meaningful tetramer composition profile can be constructed. We test remote homology detection at the topology and fold superfamily levels using a comprehensive set of fold homologs, culled from the CATH database that share low pairwise sequence similarity. Using the receiver‐operating characteristic measure, we demonstrate potentially significant improvement in using information‐optimized reduced tetramer composition, over methods relying only on the raw amino acid composition or on traditional sequence alignment, in homology detection at or below the “twilight zone”. Proteins 2010. © 2010 Wiley‐Liss, Inc.  相似文献   

The global connectivities in very large protein similarity networks contain traces of evolution among the proteins for detecting protein remote evolutionary relations or structural similarities. To investigate how well a protein network captures the evolutionary information, a key limitation is the intensive computation of pairwise sequence similarities needed to construct very large protein networks. In this article, we introduce label propagation on low-rank kernel approximation (LP-LOKA) for searching massively large protein networks. LP-LOKA propagates initial protein similarities in a low-rank graph by Nyström approximation without computing all pairwise similarities. With scalable parallel implementations based on distributed-memory using message-passing interface and Apache-Hadoop/Spark on cloud, LP-LOKA can search protein networks with one million proteins or more. In the experiments on Swiss-Prot/ADDA/CASP data, LP-LOKA significantly improved protein ranking over the widely used HMM-HMM or profile-sequence alignment methods utilizing large protein networks. It was observed that the larger the protein similarity network, the better the performance, especially on relatively small protein superfamilies and folds. The results suggest that computing massively large protein network is necessary to meet the growing need of annotating proteins from newly sequenced species and LP-LOKA is both scalable and accurate for searching massively large protein networks.  相似文献   

Bernsel A  Viklund H  Elofsson A 《Proteins》2008,71(3):1387-1399
Compared with globular proteins, transmembrane proteins are surrounded by a more intricate environment and, consequently, amino acid composition varies between the different compartments. Existing algorithms for homology detection are generally developed with globular proteins in mind and may not be optimal to detect distant homology between transmembrane proteins. Here, we introduce a new profile-profile based alignment method for remote homology detection of transmembrane proteins in a hidden Markov model framework that takes advantage of the sequence constraints placed by the hydrophobic interior of the membrane. We expect that, for distant membrane protein homologs, even if the sequences have diverged too far to be recognized, the hydrophobicity pattern and the transmembrane topology are better conserved. By using this information in parallel with sequence information, we show that both sensitivity and specificity can be substantially improved for remote homology detection in two independent test sets. In addition, we show that alignment quality can be improved for the most distant homologs in a public dataset of membrane protein structures. Applying the method to the Pfam domain database, we are able to suggest new putative evolutionary relationships for a few relatively uncharacterized protein domain families, of which several are confirmed by other methods. The method is called Searcher for Homology Relationships of Integral Membrane Proteins (SHRIMP) and is available for download at http://www.sbc.su.se/shrimp/.  相似文献   

Frenkel ZM  Trifonov EN 《Proteins》2007,67(2):271-284
A new method is proposed to reveal apparent evolutionary relationships between protein fragments with similar 3D structures by finding "intermediate" sequences in the proteomic database. Instead of looking for homologies and intermediates for a whole protein domain, we build a chain of intermediate short sequences, which allows one to link similar structural modules of proteins belonging to the same or different families. Several such chains of intermediates can be combined into an evolutionary tree of structural protein modules. All calculations were made for protein fragments of 20 aa residues. Three evolutionary trees for different module structures are described. The aim of the paper is to introduce the new method and to demonstrate its potential for protein structural predictions. The approach also opens new perspectives for protein evolution studies.  相似文献   

Karlin D  Belshaw R 《PloS one》2012,7(3):e31719
Paramyxovirinae are a large group of viruses that includes measles virus and parainfluenza viruses. The viral Phosphoprotein (P) plays a central role in viral replication. It is composed of a highly variable, disordered N-terminus and a conserved C-terminus. A second viral protein alternatively expressed, the V protein, also contains the N-terminus of P, fused to a zinc finger. We suspected that, despite their high variability, the N-termini of P/V might all be homologous; however, using standard approaches, we could previously identify sequence conservation only in some Paramyxovirinae. We now compared the N-termini using sensitive sequence similarity search programs, able to detect residual similarities unnoticeable by conventional approaches. We discovered that all Paramyxovirinae share a short sequence motif in their first 40 amino acids, which we called soyuz1. Despite its short length (11-16aa), several arguments allow us to conclude that soyuz1 probably evolved by homologous descent, unlike linear motifs. Conservation across such evolutionary distances suggests that soyuz1 plays a crucial role and experimental data suggest that it binds the viral nucleoprotein to prevent its illegitimate self-assembly. In some Paramyxovirinae, the N-terminus of P/V contains a second motif, soyuz2, which might play a role in blocking interferon signaling. Finally, we discovered that the P of related Mononegavirales contain similarly overlooked motifs in their N-termini, and that their C-termini share a previously unnoticed structural similarity suggesting a common origin. Our results suggest several testable hypotheses regarding the replication of Mononegavirales and suggest that disordered regions with little overall sequence similarity, common in viral and eukaryotic proteins, might contain currently overlooked motifs (intermediate in length between linear motifs and disordered domains) that could be detected simply by comparing orthologous proteins.  相似文献   

Structural alignments often reveal relationships between proteins that cannot be detected using sequence alignment alone. However, profile search methods based entirely on structural alignments alone have not been found to be effective in finding remote homologs. Here, we explore the role of structural information in remote homolog detection and sequence alignment. To this end, we develop a series of hybrid multidimensional alignment profiles that combine sequence, secondary and tertiary structure information into hybrid profiles. Sequence-based profiles are profiles whose position-specific scoring matrix is derived from sequence alignment alone; structure-based profiles are those derived from multiple structure alignments. We compare pure sequence-based profiles to pure structure-based profiles, as well as to hybrid profiles that use combined sequence-and-structure-based profiles, where sequence-based profiles are used in loop/motif regions and structural information is used in core structural regions. All of the hybrid methods offer significant improvement over simple profile-to-profile alignment. We demonstrate that both sequence-based and structure-based profiles contribute to remote homology detection and alignment accuracy, and that each contains some unique information. We discuss the implications of these results for further improvements in amino acid sequence and structural analysis.  相似文献   


Number of naturally occurring primary sequences of proteins is an infinitesimally small subset of the possible number of primary sequences that can be synthesized using 20 amino acids. Prevailing views ascribe this to slow and incremental mutational/selection evolutionary mechanisms. However, considering the large number of avenues available in form of diversity of emerging/evolving and/or disappearing living systems for exploring the primary sequence space over the evolutionary time scale of ~3.5 billion years, this remains a conjecture. Therefore, to investigate primary sequence space limitations, we carried out a systematic study for finding primary sequences absent in nature. We report the discovery of the smallest peptide sequence “Cysteine-Glutamine-Tryptophan-Tryptophan” that is not found in over half-a-million curated protein sequences in the Uniprot (Swiss-Prot) database. Additionally, we report a library of 83605 pentapeptides that are not found in any of the known protein sequences. Compositional analyses of these absent primary sequences yield a remarkably strong power relationship between the percentage occurrence of individual amino acids in all known protein sequences and their respective frequency of occurrence in the absent peptides, regardless of their specific position in the sequences. If random evolutionary mechanisms were responsible for limitations to the primary sequence space, then one would not expect any relationship between compositions of available and absent primary sequences. Thus, we conclusively show that stoichiometric constraints on amino acids limit the primary sequence space of proteins in nature. We discuss the possibly profound implications of our findings in both evolutionary and synthetic biology.

Communicated by Ramaswamy H. Sarma  相似文献   

The N-terminal sequence (residues 1-101) of trypsin-link protein from cartilage proteoglycan complex is reported: it presents structural homologies with the poly-Ig receptor and immunoglobulin domains.  相似文献   

The development of remote homology detection methods is a challenging area in Bioinformatics. Sequence analysis-based approaches that address this problem have employed the use of profiles, templates and Hidden Markov Models (HMMs). These methods often face limitations due to poor sequence similarities and non-uniform sequence dispersion in protein sequence space. Search procedures are often asymmetrical due to over or under-representation of some protein families and outliers often remain undetected. Intermediate sequences that share high similarities with more than one protein can help overcome such problems. Methods such as MulPSSM and Cascade PSI-BLAST that employ intermediate sequences achieve better coverage of members in searches. Others employ peptide modules or conserved patterns of motifs or residues and are effective in overcoming dependencies on high sequence similarity to establish homology by using conserved patterns in searches. We review some of these recent methods developed in India in the recent past.  相似文献   

Many strains of Streptococcus pyogenes are known to express a receptor for IgA. The complete nucleotide sequence of the gene for such a receptor, protein Arp4, has been determined. The deduced amino acid sequence of 386 residues includes a signal sequence of 41 amino acids and a putative membrane anchor region, both of which are homologous to similar regions in other streptococcal surface proteins. The processed form of the IgA receptor has a length of 345 amino acids and a calculated molecular weight of 39544. The N-terminal sequence of the processed form is different from that previously found for a similar IgA receptor isolated from a S. pyogenes strain of type M60. The sequence of protein Arp4 shows extensive homology to the C-terminal half of streptococcal M proteins, but not to the streptococcal IgG receptor protein G or staphlyococcal protein A. Apart from the membrane anchor, this homology includes a sequence of 119 amino acid residues containing three repeated units and a 54-residue sequence without repeats. The protein expressed in Escherichia coli is found in the periplasmic space, in which it constitutes the major protein. Protein Arp4 is the first example of a surface protein that has both immunoglobulin-binding capacity and structural features characteristic of M proteins.  相似文献   

A method for optimally locating gaps in the amino acid sequences of homologous proteins is presented. The method involves three steps: (1) demonstration that the sequences are indeed homologous, (2) location of regions where the homologous pairing is reasonably certain, and (3) location of gaps between these regions so as to minimize the total number of mutations required to account for the differences between the two sequences. The major virtues of this procedure are that the assertion of homology does not depend upon the prior introduction of gaps and that a genetic rather than a chemical test is the basis for asserting a genetic relationship.This project received support from grants from NSF (GB-7486) and NIH (NB 04545-06).  相似文献   

Summary The previously reported nucleotide sequence of the spoOA coding region of Bacillus subtilis suggested that the protein is initiated with either of two possible initiation codons, ATG and GTG, 84 base pairs apart. To determine which codon is utilized as an initiator in B. subtilis, we constructed a fusion gene in which the promoter and NH2-terminal region of the spoOA gene was connected to the chloramphenicol acetyltransferase gene (cat gene). After introduction of the plasmid carrying the spoOA-cat fusion gene into B. subtilis cells, the fusion protein was purified by affinity chromatography. The sequence of NH2-terminal amino acids of the fusion protein was determined and the result established that the GTG codon is utilized as an initiator in B. subtilis.Comparison of the amino acid sequences revealed a marked homology between the spoOA (NH2-terminal half) and spoOF proteins. A less striking but significant homology was also found between the spoOA (COOH-terminal half) and spoOB proteins. This suggests the presence of a common functional domain structure for these proteins that are supposed to play key regulatory roles in sporulation.  相似文献   

Cloned DNA from the larval serum protein one (LSP-1) genes was hybridized to polytene chromosomes of D. melanogaster. The ratio of grains deposited over any two of the three LSP-1 genes with any one LSP-1 subunit probe was constant. Varying the gene dose of any one LSP-1 subunit relative to the others by up to six fold gave a linear relationship of grain ratios to gene ratios. We show that these constant ratios closely reflect the extent of sequence homology between the genes as determined by heteroduplex mapping (Smith et al., 1981) and thermal denaturation studies. The results obtained demonstrate that the LSP-1 subunit genes are present in equal copies in the genome.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号