首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The effectiveness of sequence alignment in detecting structural homology among protein sequences decreases markedly when pairwise sequence identity is low (the so‐called “twilight zone” problem of sequence alignment). Alternative sequence comparison strategies able to detect structural kinship among highly divergent sequences are necessary to address this need. Among them are alignment‐free methods, which use global sequence properties (such as amino acid composition) to identify structural homology in a rapid and straightforward way. We explore the viability of using tetramer sequence fragment composition profiles in finding structural relationships that lie undetected by traditional alignment. We establish a strategy to recast any given protein sequence into a tetramer sequence fragment composition profile, using a series of amino acid clustering steps that have been optimized for mutual information. Our method has the effect of compressing the set of 160,000 unique tetramers (if using the 20‐letter amino acid alphabet) into a more tractable number of reduced tetramers (~15–30), so that a meaningful tetramer composition profile can be constructed. We test remote homology detection at the topology and fold superfamily levels using a comprehensive set of fold homologs, culled from the CATH database that share low pairwise sequence similarity. Using the receiver‐operating characteristic measure, we demonstrate potentially significant improvement in using information‐optimized reduced tetramer composition, over methods relying only on the raw amino acid composition or on traditional sequence alignment, in homology detection at or below the “twilight zone”. Proteins 2010. © 2010 Wiley‐Liss, Inc.  相似文献   

2.
3.
Li T  Fan K  Wang J  Wang W 《Protein engineering》2003,16(5):323-330
It is well known that there are some similarities among various naturally occurring amino acids. Thus, the complexity in protein systems could be reduced by sorting these amino acids with similarities into groups and then protein sequences can be simplified by reduced alphabets. This paper discusses how to group similar amino acids and whether there is a minimal amino acid alphabet by which proteins can be folded. Various reduced alphabets are obtained by reserving the maximal information for the simplified protein sequence compared with the parent sequence using global sequence alignment. With these reduced alphabets and simplified similarity matrices, we achieve recognition of the protein fold based on the similarity score of the sequence alignment. The coverage in dataset SCOP40 for various levels of reduction on the amino acid types is obtained, which is the number of homologous pairs detected by program BLAST to the number marked by SCOP40. For the reduced alphabets containing 10 types of amino acids, the ability to detect distantly related folds remains almost at the same level as that by the alphabet of 20 types of amino acids, which implies that 10 types of amino acids may be the degree of freedom for characterizing the complexity in proteins.  相似文献   

4.
MOTIVATION: Pairwise local sequence alignment is commonly used to search data bases for sequences related to some query sequence. Alignments are obtained using a scoring matrix that takes into account the different frequencies of occurrence of the various types of amino acid substitutions. Software like BLAST provides the user with a set of scoring matrices available to choose from, and in the literature it is sometimes recommended to try several scoring matrices on the sequences of interest. The significance of an alignment is usually assessed by looking at E-values and p-values. While sequence lengths and data base sizes enter the standard calculations of significance, it is much less common to take the use of several scoring matrices on the same sequences into account. Altschul proposed corrections of the p-value that account for the simultaneous use of an infinite number of PAM matrices. Here we consider the more realistic situation where the user may choose from a finite set of popular PAM and BLOSUM matrices, in particular the ones available in BLAST. It turns out that the significance of a result can be considerably overestimated, if a set of substitution matrices is used in an alignment problem and the most significant alignment is then quoted. RESULTS: Based on extensive simulations, we study the multiple testing problem that occurs when several scoring matrices for local sequence alignment are used. We consider a simple Bonferroni correction of the p-values and investigate its accuracy. Finally, we propose a more accurate correction based on extreme value distributions fitted to the maximum of the normalized scores obtained from different scoring matrices. For various sets of matrices we provide correction factors which can be easily applied to adjust p- and E-values reported by software packages.  相似文献   

5.
Efficient methods for multiple sequence alignment with guaranteed error bounds   总被引:11,自引:0,他引:11  
Multiple string (sequence) alignment is a difficult and important problem in computational biology, where it is central in two related tasks: finding highly conserved subregions or embedded patterns of a set of biological sequences (strings of DNA, RNA or amino acids), and inferring the evolutionary history of a set of taxa from their associated biological sequences. Several precise measures have been proposed for evaluating the goodness of a multiple alignment, but no efficient methods are known which compute the optimal alignment for any of these measures in any but small cases. In this paper, we consider two previously proposed measures, and given two computationaly efficient multiple alignment methods (one for each measure) whose deviation from the optimal value isguaranteed to be less than a factor of two. This is the novel feature of these methods, but the methods have additional virtues as well. For both methods, the guaranteed bounds are much smaller than two when the number of strings is small (1.33 for three strings of any length); for one of the methods we give a related randomized method which is much faster and which gives, with high probability, multiple alignments with fairly small error bounds; and for the other measure, the method given yields a non-obviouslower bound on the value of the optimal alignment.  相似文献   

6.
Protein sequence alignments are more reliable the shorter the evolutionary distance. Here, we align distantly related proteins using many closely spaced intermediate sequences as stepping stones. Such transitive alignments can be generated between any two proteins in a connected set, whether they are direct or indirect sequence neighbors in the underlying library of pairwise alignments. We have implemented a greedy algorithm, MaxFlow, using a novel consistency score to estimate the relative likelihood of alternative paths of transitive alignment. In contrast to traditional profile models of amino acid preferences, MaxFlow models the probability that two positions are structurally equivalent and retains high information content across large distances in sequence space. Thus, MaxFlow is able to identify sparse and narrow active-site sequence signatures which are embedded in high-entropy sequence segments in the structure based multiple alignment of large diverse enzyme superfamilies. In a challenging benchmark based on the urease superfamily, MaxFlow yields better reliability and double coverage compared to available sequence alignment software. This promises to increase information returns from functional and structural genomics, where reliable sequence alignment is a bottleneck to transferring the functional or structural characterization of model proteins to entire protein superfamilies.  相似文献   

7.
8.
MSAT     
This article describes the development of a new method for multiple sequence alignment based on fold-level protein structure alignments, which provides an improvement in accuracy compared with the most commonly used sequence-only-based techniques. This method integrates the widely used, progressive multiple sequence alignment approach ClustalW with the Topology of Protein Structure (TOPS) topology-based alignment algorithm. The TOPS approach produces a structural alignment for the input protein set by using a topology-based pattern discovery program, providing a set of matched sequence regions that can be used to guide a sequence alignment using ClustalW. The resulting alignments are more reliable than a sequence-only alignment, as determined by 20-fold cross-validation with a set of 106 protein examples from the CATH database, distributed in seven superfold families. The method is particularly effective for sets of proteins that have similar structures at the fold level but low sequence identity. The aim of this research is to contribute towards bridging the gap between protein sequence and structure analysis, in the hope that this can be used to assist the understanding of the relationship between sequence, structure and function. The tool is available at http://balabio.dcs.gla.ac.uk/msat/.  相似文献   

9.
Antibodies that bind to protein surfaces of interest can be used to report the three-dimensional structure of the protein as follows: Proteins are composed of linear polypeptide chains that fold together in complex spatial patterns to create the native protein structure. These folded structures form binding sites for antibodies. Antibody binding sites are typically "assembled" on the protein surface from segments that are far apart in the primary amino acid sequence of the target proteins. Short amino acid probe sequences that bind to the active region of each antibody can be used as witnesses to the antibody epitope surface and these probes can be efficiently selected from random sequence peptide libraries. This paper presents a new method to align these antibody epitopes to discontinuous regions of the one-dimensional amino acid sequence of a target protein. Such alignments of the epitopes indicate how segments of the protein sequence must be folded together in space and thus provide long-range constraints for solving the 3-D protein structure. This new antibody-based approach is applicable to the large fraction of proteins that are refractory to current approaches for structure determination and has the additional advantage of requiring very small amounts of the target protein. The binding site of an antibody is a surface, not just a continuous linear sequence, so the epitope mapping alignment problem is outside the scope of classical string alignment algorithms, such as Smith-Waterman. We formalize the alignment problem that is at the heart of this new approach, prove that the epitope mapping alignment problem is NP-complete, and give some initial results using a branch-and-bound algorithm to map two real-life cases. Initial results for two validation cases are presented for a graph-based protein surface neighbor mapping procedure that promises to provide additional spatial proximity information for the amino acid residues on the protein surface.  相似文献   

10.
A set of "similarity-parameters" was calculated that reflects the influence of the proteinogenic amino acids on the structure of the protein backbone. The parameters were derived from a detailed analysis of the amino acid specific main-chain torsion angle distributions as they are found in proteins (highly resolved protein structures from the Brookhaven Protein Data Bank). The purpose of these parameters is threefold: (1) they should help in estimating the structural effect of an amino acid substitution during the design of new mutants in protein-engineering; (2) in modeling by homology they should mark places in the protein where changes in the folding are expected; and (3) they should form a scoring matrix in protein sequence alignment superior to identity scoring. The usability of the "structure derived correlation matrix (SCM)" for these purposes is assessed and demonstrated for some examples in the paper.  相似文献   

11.
Alignment of nucleotide and/or amino acid sequences is a fundamental component of sequence‐based molecular phylogenetic studies. Here we examined how different alignment methods affect the phylogenetic trees that are inferred from the alignments. We used simulations to determine how alignment errors can lead to systematic biases that affect phylogenetic inference from those sequences. We compared four approaches to sequence alignment: progressive pairwise alignment, simultaneous multiple alignment of sequence fragments, local pairwise alignment and direct optimization. When taking into account branch support, implied alignments produced by direct optimization were found to show the most extreme behaviour (based on the alignment programs for which nearly equivalent alignment parameters could be set) in that they provided the strongest support for the correct tree in the simulations in which it was easy to resolve the correct tree and the strongest support for the incorrect tree in our long‐branch‐attraction simulations. When applied to alignment‐sensitive process partitions with different histories, direct optimization showed the strongest mutual influence between the process partitions when they were aligned and phylogenetically analysed together, which makes detecting recombination more difficult. Simultaneous alignment performed well relative to direct optimization and progressive pairwise alignment across all simulations. Rather than relying upon methods that integrate alignment and tree search into a single step without accounting for alignment uncertainty, as with implied alignments, we suggest that simultaneous alignment using the similarity criterion, within the context of information available on biological processes and function, be applied whenever possible for sequence‐based phylogenetic analyses.  相似文献   

12.
For applications such as comparative modelling one major issue is the reliability of sequence alignments. Reliable regions in alignments can be predicted using sub-optimal alignments of the same pair of sequences. Here we show that reliable regions in alignments can also be predicted from multiple sequence profile information alone.Alignments were created for a set of remotely related pairs of proteins using five different test methods. Structural alignments were used to assess the quality of the alignments and the aligned positions were scored using information from the observed frequencies of amino acid residues in sequence profiles pre-generated for each template structure. High-scoring regions of these profile-derived alignment scores were a good predictor of reliably aligned regions.These profile-derived alignment scores are easy to obtain and are applicable to any alignment method. They can be used to detect those regions of alignments that are reliably aligned and to help predict the quality of an alignment. For those residues within secondary structure elements, the regions predicted as reliably aligned agreed with the structural alignments for between 92% and 97.4% of the residues. In loop regions just under 92% of the residues predicted to be reliable agreed with the structural alignments. The percentage of residues predicted as reliable ranged from 32.1% for helix residues to 52.8% for strand residues.This information could also be used to help predict conserved binding sites from sequence alignments. Residues in the template that were identified as binding sites, that aligned to an identical amino acid residue and where the sequence alignment agreed with the structural alignment were in highly conserved, high scoring regions over 80% of the time. This suggests that many binding sites that are present in both target and template sequences are in sequence-conserved regions and that there is the possibility of translating reliability to binding site prediction.  相似文献   

13.
BLMT     
Statistical analysis of amino acid and nucleotide sequences, especially sequence alignment, is one of the most commonly performed tasks in modern molecular biology. However, for many tasks in bioinformatics, the requirement for the features in an alignment to be consecutive is restrictive and "n-grams" (aka k-tuples) have been used as features instead. N-grams are usually short nucleotide or amino acid sequences of length n, but the unit for a gram may be chosen arbitrarily. The n-gram concept is borrowed from language technologies where n-grams of words form the fundamental units in statistical language models. Despite the demonstrated utility of n-gram statistics for the biology domain, there is currently no publicly accessible generic tool for the efficient calculation of such statistics. Most sequence analysis tools will disregard matches because of the lack of statistical significance in finding short sequences. This article presents the integrated Biological Language Modeling Toolkit (BLMT) that allows efficient calculation of n-gram statistics for arbitrary sequence datasets. AVAILABILITY: BLMT can be downloaded from http://www.cs.cmu.edu/~blmt/source and installed for standalone use on any Unix platform or Unix shell emulation such as Cygwin on the Windows platform. Specific tools and usage details are described in a "readme" file. The n-gram computations carried out by the BLMT are part of a broader set of tools borrowed from language technologies and modified for statistical analysis of biological sequences; these are available at http://flan.blm.cs.cmu.edu/.  相似文献   

14.
MOTIVATION: Membrane domain prediction has recently been re-evaluated by several groups, suggesting that the accuracy of existing methods is still rather limited. In this work, we revisit this problem and propose novel methods for prediction of alpha-helical as well as beta-sheet transmembrane (TM) domains. The new approach is based on a compact representation of an amino acid residue and its environment, which consists of predicted solvent accessibility and secondary structure of each amino acid. A recently introduced method for solvent accessibility prediction trained on a set of soluble proteins is used here to indicate segments of residues that are predicted not to be accessible to water and, therefore, may be 'buried' in the membrane. While evolutionary profiles in the form of a multiple alignment are used to derive these simple 'structural profiles', they are not used explicitly for the membrane domain prediction and the overall number of parameters in the model is significantly reduced. This offers the possibility of a more reliable estimation of the free parameters in the model with a limited number of experimentally resolved membrane protein structures. RESULTS: Using cross-validated training on available sets of structurally resolved and non-redundant alpha and beta membrane proteins, we demonstrate that membrane domain prediction methods based on such a compact representation outperform approaches that utilize explicitly evolutionary profiles and multiple alignments. Moreover, using an external evaluation by the TMH Benchmark server we show that our final prediction protocol for the TM helix prediction is competitive with the state-of-the-art methods, achieving per-residue accuracy of approximately 89% and per-segment accuracy of approximately 80% on the set of high resolution structures used by the TMH Benchmark server. At the same time the observed rates of confusion with signal peptides and globular proteins are the lowest among the tested methods. The new method is available online at http://minnou.cchmc.org.  相似文献   

15.
A sensitive technique for protein sequence motif recognition based on neural networks has been developed. It involves three major steps. (1) At each appropriate alignment position of a set of N matched sequences, a set of N aligned oligopeptides is specified with preselected window length. N neural nets are subsequently and successively trained on N-1 amino acid spans after eliminating each ith oligopeptide. A test for recognition of each of the ith spans is performed. The average neural net recognition over N such trials is used as a measure of conservation for the particular windowed region of the multiple alignment. This process is repeated for all possible spans of given length in the multiple alignment. (2) The M most conserved regions are regarded as motifs and the oligopeptides within each are used to train intensively M individual neural networks. (3) The M networks are then applied in a search for related primary structures in a databank of known protein sequences. The oligopeptide spans in the database sequence with strongest neural net output for each of the M networks are saved and then scored according to the output signals and the proper combination that follows the expected N- to C-terminal sequence order. The motifs from the database with highest similarity scores can then be used to retrain the M neural nets, which can be subsequently utilized for further searches in the databank, thus providing even greater sensitivity to recognize distant familial proteins. This technique was successfully applied to the integrase, DNA-polymerase and immunoglobulin families.  相似文献   

16.
The biological role, biochemical function, and structure of uncharacterized protein sequences is often inferred from their similarity to known proteins. A constant goal is to increase the reliability, sensitivity, and accuracy of alignment techniques to enable the detection of increasingly distant relationships. Development, tuning, and testing of these methods benefit from appropriate benchmarks for the assessment of alignment accuracy.Here, we describe a benchmark protocol to estimate sequence-to-sequence and sequence-to-structure alignment accuracy. The protocol consists of structurally related pairs of proteins and procedures to evaluate alignment accuracy over the whole set. The set of protein pairs covers all the currently known fold types. The benchmark is challenging in the sense that it consists of proteins lacking clear sequence similarity.Correct target alignments are derived from the three-dimensional structures of these pairs by rigid body superposition. An evaluation engine computes the accuracy of alignments obtained from a particular algorithm in terms of alignment shifts with respect to the structure derived alignments. Using this benchmark we estimate that the best results can be obtained from a combination of amino acid residue substitution matrices and knowledge-based potentials.  相似文献   

17.
A general protein sequence alignment methodology for detecting a priori unknown common structural and functional regions is described. The method proposed in this paper is based on two basic requirements for a meaningful alignment. First, each sequence or segment of a sequence is characterized by a multivariate physicochemical profile. Second, the alignment is performed by considering all the sequences simultaneously, and the algorithm detects those regions that form a set of similar profiles. In order to test the structural meaning of the alignment obtained from the sequences, quantitative comparisons are performed with structurally conserved regions (SCR) determined from the X-ray structures of three serine proteases. Results suggest that the limits of the SCR may be predicted from the similarities between the physicochemical profiles of the sequences. The procedures are not completely automated. The final step requires a visual screening of alternative pathways in order to determine an optimal alignment.  相似文献   

18.
A package for the creation and processing of multiple sequencealignment is described. There is no limit on the lengths ofthe processed nucleotide or amino acid sequences, and the numberof sequences in the alignment is also unlimited. The main groupsof functions are: a semi–automatic alignment editor; awide set of functions for technical processing of alignments;nucleotide alignment mapping and translation; and similaritysearch functions. A user-friendly interface and a set of generallyused file actions provide a special operational subsystem foreveryday tasks  相似文献   

19.
20.
To learn more about the evolutionary origins of Escherichia coli genes, we surveyed systematically for extended sequence similarities among the 1,264 amino acid sequences encoded by chromosomal genes of E. coli K-12 in SwissProt release 26 by using the FASTA program and imposing the following criteria: (i) alignment of segments at least 100 amino acids long and (ii) at least 20% amino acid identity. Altogether, 624 extended alignments meeting the two criteria were identified, corresponding to 577 protein sequences (45.6% of the 1,264 E. coli protein sequences) that had an extended alignment with at least one other E. coli protein sequence. To exclude alignments of questionable biological significance, we imposed a high threshold on the number of gaps allowed in each of the 624 extended alignments, giving us a subset of 464 proteins. The population of 464 alignments has the following characteristics expressed as median values of the group: 254 amino acids in the alignment, representing 86% of the length of the protein, 33% of the amino acids in the alignment being identical, and 1.1 gaps introduced per 100 amino acids of alignment. Where functions are known, nearly all pairs consist of functionally related proteins. This implies that the sequence similarity we detected has biological meaning and did not arise by chance. That a major fraction of E. coli proteins form extended alignments strongly suggests the predominance of duplication and divergence of ancestral genes in the evolution of E. coli genes. The range of degrees of similarity shows that some genes originated more recently than others. There is no evidence of genome doubling in the past, since map distances between genes of sequence-related proteins show no coherent pattern of favored separations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号