首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
MOTIVATION: Protein sequence comparison methods are routinely used to infer the intricate network of evolutionary relationships found within the rapidly growing library of protein sequences, and thereby to predict the structure and function of uncharacterized proteins. In the present study, we detail an improved statistical benchmark of pairwise protein sequence comparison algorithms. We use bootstrap resampling techniques to determine standard statistical errors and to estimate the confidence of our conclusions. We show that the underlying structure within benchmark databases causes Efron's standard, non-parametric bootstrap to be biased. Consequently, the standard bootstrap underpredicts average performance when used in the context of evaluating sequence comparison methods. We have developed, as an alternative, an unbiased statistical evaluation based on the Bayesian bootstrap, a resampling method operationally similar to the standard bootstrap. RESULTS: We apply our analysis to the comparative study of amino acid substitution matrix families and find that using modern matrices results in a small, but statistically significant improvement in remote homology detection compared with the classic PAM and BLOSUM matrices. AVAILABILITY: The sequence sets and code for performing these analyses are available from http://compbio.berkeley.edu/. Contact: brenner@compbio.berkeley.edu.  相似文献   

2.
MOTIVATION: No general theory guides the selection of gap penalties for local sequence alignment. We empirically determined the most effective gap penalties for protein sequence similarity searches with substitution matrices over a range of target evolutionary distances from 20 to 200 Point Accepted Mutations (PAMs). RESULTS: We embedded real and simulated homologs of protein sequences into a database and searched the database to determine the gap penalties that produced the best statistical significance for the distant homologs. The most effective penalty for the first residue in a gap (q+r) changes as a function of evolutionary distance, while the gap extension penalty for additional residues (r) does not. For these data, the optimal gap penalties for a given matrix scaled in 1/3 bit units (e.g. BLOSUM50, PAM200) are q=25-0.1 * (target PAM distance), r=5. Our results provide an empirical basis for selection of gap penalties and demonstrate how optimal gap penalties behave as a function of the target evolutionary distance of the substitution matrix. These gap penalties can improve expectation values by at least one order of magnitude when searching with short sequences, and improve the alignment of proteins containing short sequences repeated in tandem.  相似文献   

3.
4.
With whole-genome sequences being completed at an increasing rate, it is important to develop and assess tools to analyze them. Following annotation of the protein content of a genome, one can compare sequences with previously characterized homologous genes to detect novel functions within specific proteins in the evolution of the newly sequenced genome. One common statistical method to detect such changes is to compare the ratios of nonsynonymous (K(a)) to synonymous (K(s)) nucleotide substitution rates. Here, the effects of several parameters that can influence this calculation (sequence reconstruction method, phylogenetic tree branch length weighting, GC content, and codon bias) are examined. Also, two new alternative measures of adaptive evolution, the point accepted mutations (PAM)/neutral evolutionary distance (NED) ratio and the sequence space assessment (SSA) statistic are presented. All of these methods are compared using two sequence families: the recent divergence of leptin orthologs in primates, and the more ancient divergence of the deoxyribonucleoside kinase family. The examination of these and other measures to detect changes of gene function along branches of a phylogenetic tree will become increasingly important in the postgenomic era.  相似文献   

5.
Bernsel A  Viklund H  Elofsson A 《Proteins》2008,71(3):1387-1399
Compared with globular proteins, transmembrane proteins are surrounded by a more intricate environment and, consequently, amino acid composition varies between the different compartments. Existing algorithms for homology detection are generally developed with globular proteins in mind and may not be optimal to detect distant homology between transmembrane proteins. Here, we introduce a new profile-profile based alignment method for remote homology detection of transmembrane proteins in a hidden Markov model framework that takes advantage of the sequence constraints placed by the hydrophobic interior of the membrane. We expect that, for distant membrane protein homologs, even if the sequences have diverged too far to be recognized, the hydrophobicity pattern and the transmembrane topology are better conserved. By using this information in parallel with sequence information, we show that both sensitivity and specificity can be substantially improved for remote homology detection in two independent test sets. In addition, we show that alignment quality can be improved for the most distant homologs in a public dataset of membrane protein structures. Applying the method to the Pfam domain database, we are able to suggest new putative evolutionary relationships for a few relatively uncharacterized protein domain families, of which several are confirmed by other methods. The method is called Searcher for Homology Relationships of Integral Membrane Proteins (SHRIMP) and is available for download at http://www.sbc.su.se/shrimp/.  相似文献   

6.
When preparing data sets of amino acid or nucleotide sequences it is necessary to exclude redundant or homologous sequences in order to avoid overestimating the predictive performance of an algorithm. For some time methods for doing this have been available in the area of protein structure prediction. We have developed a similar procedure based on pair-wise alignments for sequences with functional sites. We show how a correlation coefficient between sequence similarity and functional homology can be used to compare the efficiency of different similarity measures and choose a nonarbitrary threshold value for excluding redundant sequences. The impact of the choice of scoring matrix used in the alignments is examined. We demonstrate that the parameter determining the quality of the correlation is the relative entropy of the matrix, rather than the assumed (PAM or identity) substitution model. Results are presented for the case of prediction of cleavage sites in signal peptides. By inspection of the false positives, several errors in the database were found. The procedure presented may be used as a general outline for finding a problem-specific similarity measure and threshold value for analysis of other functional amino acid or nucleotide sequence patterns.  相似文献   

7.
Zhang Z  Wang Y  Wang L  Gao P 《PloS one》2010,5(12):e14316

Background

In the process of protein evolution, sequence variations within protein families can cause changes in protein structures and functions. However, structures tend to be more conserved than sequences and functions. This leads to an intriguing question: what is the evolutionary mechanism by which sequence variations produce structural changes? To investigate this question, we focused on the most common types of sequence variations: amino acid substitutions and insertions/deletions (indels). Here their combined effects on protein structure evolution within protein families are studied.

Results

Sequence-structure correlation analysis on 75 homologous structure families (from SCOP) that contain 20 or more non-redundant structures shows that in most of these families there is, statistically, a bilinear correlation between the amount of substitutions and indels versus the degree of structure variations. Bilinear regression of percent sequence non-identity (PNI) and standardized number of gaps (SNG) versus RMSD was performed. The coefficients from the regression analysis could be used to estimate the structure changes caused by each unit of substitution (structural substitution sensitivity, SSS) and by each unit of indel (structural indel sensitivity, SIDS). An analysis on 52 families with high bilinear fitting multiple correlation coefficients and statistically significant regression coefficients showed that SSS is mainly constrained by disulfide bonds, which almost have no effects on SIDS.

Conclusions

Structural changes in homologous protein families could be rationally explained by a bilinear model combining amino acid substitutions and indels. These results may further improve our understanding of the evolutionary mechanisms of protein structures.  相似文献   

8.
Searching databases for distant homologues using alignments instead of individual sequences increases the power of detection. However, most methods assume that protein evolution proceeds in a regular fashion, with the inferred tree of sequences providing a good estimation of the evolutionary process. We investigated the combined HMMER search results from random alignment subsets (with three sequences each) drawn from the parent alignment (Rand-shuffle algorithm), using the SCOP structural classification to determine true similarities. At false-positive rates of 5%, the Rand-shuffle algorithm improved HMMER's sensitivity, with a 37.5% greater sensitivity compared with HMMER alone, when easily identified similarities (identifiable by BLAST) were excluded from consideration. An extension of the Rand-shuffle algorithm (Ali-shuffle) weighted towards more informative sequence subsets. This approach improved the performance over HMMER alone and PSI-BLAST, particularly at higher false-positive rates. The improvements in performance of these sequence sub-sampling methods may reflect lower sensitivity to alignment error and irregular evolutionary patterns. The Ali-shuffle and Rand-shuffle sequence homology search programs are available by request from the authors.  相似文献   

9.
Nicholas HB  Deerfield DW  Ropelewski AJ 《BioTechniques》2000,28(6):1174-8, 1180, 1182 passim
We provide a detailed overview of the choices inherent in performing a sequence database search, including the choice of algorithm, substitution matrix and gap model. Each of these choices has implications that can be described as restrictions on the underlying model of sequence evolution, the expected degree of divergence between the query sequence and the database sequences (if one uses an evolutionary based matrix), as well as the sensitivity and selectivity of the search. We conclude with a series of recommendations for researchers performing these searches based on our experience and literature studies.  相似文献   

10.
11.
Substitution matrices have been useful for sequence alignment and protein sequence comparisons. The BLOSUM series of matrices, which had been derived from a database of alignments of protein blocks, improved the accuracy of alignments previously obtained from the PAM-type matrices estimated from only closely related sequences. Although BLOSUM matrices are scoring matrices now widely used for protein sequence alignments, they do not describe an evolutionary model. BLOSUM matrices do not permit the estimation of the actual number of amino acid substitutions between sequences by correcting for multiple hits. The method presented here uses the Blocks database of protein alignments, along with the additivity of evolutionary distances, to approximate the amino acid substitution probabilities as a function of actual evolutionary distance. The PMB (Probability Matrix from Blocks) defines a new evolutionary model for protein evolution that can be used for evolutionary analyses of protein sequences. Our model is directly derived from, and thus compatible with, the BLOSUM matrices. The model has the additional advantage of being easily implemented.  相似文献   

12.
MOTIVATION: The observed correlations between pairs of homologous protein sequences are typically explained in terms of a Markovian dynamic of amino acid substitution. This model assumes that every location on the protein sequence has the same background distribution of amino acids, an assumption that is incompatible with the observed heterogeneity of protein amino acid profiles and with the success of profile multiple sequence alignment. RESULTS: We propose an alternative model of amino acid replacement during protein evolution based upon the assumption that the variation of the amino acid background distribution from one residue to the next is sufficient to explain the observed sequence correlations of homologs. The resulting dynamical model of independent replacements drawn from heterogeneous backgrounds is simple and consistent, and provides a unified homology match score for sequence-sequence, sequence-profile and profile-profile alignment.  相似文献   

13.
Amino acid sequence determination is the most reliable and powerful tool to identify a protein or to classify a new one by comparison of its primary structure with already known sequences. A rapid and simple purification procedure is an essential pre-requisite for routine sequence determination. Structural characterization of llama whey proteins was undertaken for evolutionary as well as economic purposes. N-terminal sequence analyses directly on an immobilon polyvinylidene difluoride (PVDF) membrane, following Western blotting of both native and SDS-denatured llama whey proteins after polyacrylamide gel electrophoresis, revealed three different forms of glycosylated alpha-lactalbumin, and a protein with a high degree of homology with a camel whey protein of unknown function. Furthermore, by immunoblotting techniques, the electrophoretic band corresponding to serum albumin was identified.  相似文献   

14.
Phylogenetic tree reconstruction is traditionally based on multiple sequence alignments (MSAs) and heavily depends on the validity of this information bottleneck. With increasing sequence divergence, the quality of MSAs decays quickly. Alignment-free methods, on the other hand, are based on abstract string comparisons and avoid potential alignment problems. However, in general they are not biologically motivated and ignore our knowledge about the evolution of sequences. Thus, it is still a major open question how to define an evolutionary distance metric between divergent sequences that makes use of indel information and known substitution models without the need for a multiple alignment. Here we propose a new evolutionary distance metric to close this gap. It uses finite-state transducers to create a biologically motivated similarity score which models substitutions and indels, and does not depend on a multiple sequence alignment. The sequence similarity score is defined in analogy to pairwise alignments and additionally has the positive semi-definite property. We describe its derivation and show in simulation studies and real-world examples that it is more accurate in reconstructing phylogenies than competing methods. The result is a new and accurate way of determining evolutionary distances in and beyond the twilight zone of sequence alignments that is suitable for large datasets.  相似文献   

15.
Jun Gao  Zhijun Li 《Biopolymers》2010,93(4):340-347
It is widely accepted that a protein's sequence determines its structure. The surprising finding that proteins of distant sequence can adopt similar 3D structures has raised interesting questions regarding underlying conserved properties that are essential for protein folding and stability. Uncovering the conserved properties may shed light on the folding mechanism of proteins and help with the development of computational tools for protein structure prediction. We compiled and analyzed a structure pair dataset of 66 high‐resolution and low sequence identity (16–38%) soluble proteins. Structure deviation for each pair was confirmed by calculating its Cα SiMax value and comparing its potential energy per residue. Analysis of favorable inter‐residue interactions for each structure pair indicated that the average number of inter‐residue interactions within each structure represents a conserved feature of homologous structures of distant sequence. Detailed comparison of individual types of interactions showed that the average number of either hydrophobic or hydrogen bonding interactions remains unchanged for each structure pair. These findings should be of help to improving the quality of homology models based on templates of low sequence identity, thus broadening the application of homology modeling techniques for protein studies. © 2009 Wiley Periodicals, Inc. Biopolymers 93: 340–347, 2010. This article was originally published online as an accepted preprint. The “Published Online” date corresponds to the preprint version. You can request a copy of the preprint by emailing the Biopolymers editorial office at biopolymers@wiley.com  相似文献   

16.
The mouse VHIII subgroup is composed of four families which share sequence homology. We isolated a VH germ-line genomic clone, which cross hybridizes with a cDNA probe from one of these families, derived from a myeloma secreting an antigalactan antibody. We report here the nucleotide sequence of the cross hybridizing gene and show that very likely it has an anti-sheep red blood cell specificity. Comparison of its nucleotide sequence with those of the three other VHIII families shows that these genes share segmental homologies of variable lengths. This suggests that interchanges of sequence blocks between VH genes could be an important evolutionary mechanism for diversifying the germ-line repertoire. The strong homology (82%) with human VHIII genes suggests that efficient antibody sequences are strongly conserved. This conservation of homology is particularly striking when compared to the more limited homology (63%) between mouse and human C kappa genes.  相似文献   

17.
Pairwise local sequence alignment methods have been the prevailing technique to identify homologous nucleotides between related species. However, existing methods that identify and align all homologous nucleotides in one or more genomes have suffered from poor scalability and limited accuracy. We propose a novel method that couples a gapped extension heuristic with an efficient filtration method for identifying interspersed repeats in genome sequences. During gapped extension, we use the MUSCLE implementation of progressive global multiple alignment with iterative refinement. The resulting gapped extensions potentially contain alignments of unrelated sequence. We detect and remove such undesirable alignments using a hidden Markov model (HMM) to predict the posterior probability of homology. The HMM emission frequencies for nucleotide substitutions can be derived from any time-reversible nucleotide substitution matrix. We evaluate the performance of our method and previous approaches on a hybrid data set of real genomic DNA with simulated interspersed repeats. Our method outperforms a related method in terms of sensitivity, positive predictive value, and localizing boundaries of homology. The described methods have been implemented in freely available software, Repeatoire, available from: http://wwwabi.snv.jussieu.fr/public/Repeatoire.  相似文献   

18.
Over the years, there have been claims that evolution proceeds according to systematically different processes over different timescales and that protein evolution behaves in a non-Markovian manner. On the other hand, Markov models are fundamental to many applications in evolutionary studies. Apparent non-Markovian or time-dependent behavior has been attributed to influence of the genetic code at short timescales and dominance of physicochemical properties of the amino acids at long timescales. However, any long time period is simply the accumulation of many short time periods, and it remains unclear why evolution should appear to act systematically differently across the range of timescales studied. We show that the observed time-dependent behavior can be explained qualitatively by modeling protein sequence evolution as an aggregated Markov process (AMP): a time-homogeneous Markovian substitution model observed only at the level of the amino acids encoded by the protein-coding DNA sequence. The study of AMPs sheds new light on the relationship between amino acid-level and codon-level models of sequence evolution, and our results suggest that protein evolution should be modeled at the codon level rather than using amino acid substitution models.  相似文献   

19.
We derive an expectation maximization algorithm for maximum-likelihood training of substitution rate matrices from multiple sequence alignments. The algorithm can be used to train hidden substitution models, where the structural context of a residue is treated as a hidden variable that can evolve over time. We used the algorithm to train hidden substitution matrices on protein alignments in the Pfam database. Measuring the accuracy of multiple alignment algorithms with reference to BAliBASE (a database of structural reference alignments) our substitution matrices consistently outperform the PAM series, with the improvement steadily increasing as up to four hidden site classes are added. We discuss several applications of this algorithm in bioinformatics.  相似文献   

20.
Peptidylglycine alpha-amidating monooxygenase (PAM: EC 1.14.17.3) is a bifunctional protein which catalyzes the COOH-terminal amidation of bioactive peptides; the NH2-terminal monooxygenase and mid-region lyase act in sequence to perform the peptide alpha-amidation reaction. Alternative splicing of the single PAM gene gives rise to mRNAs generating PAM proteins with and without a putative transmembrane domain, with and without a linker region between the two enzymes, and forms containing only the monooxygenase domain. The expression, endoproteolytic processing, storage, and secretion of this secretory granule-associated protein were examined after stable transfection of AtT-20 mouse pituitary cells with naturally occurring and truncated PAM proteins. The transfected proteins were examined using enzyme assays, subcellular fractionation, Western blotting, and immunocytochemistry. Western blots of crude membrane and soluble fractions of transfected cells demonstrated that all PAM proteins were endoproteolytically processed. When the linker region was present between the monooxygenase and lyase domains, monofunctional soluble enzymes were generated from bifunctional PAM proteins; without the linker region, bifunctional enzymes were generated. Soluble forms of PAM expressed in AtT-20 cells and soluble proteins generated through selective endoproteolysis of membrane-associated PAM were secreted in an active form into the medium; secretion of the transfected proteins and endogenous hormone were stimulated in parallel by secretagogues. PAM proteins were localized by immunocytochemistry in the perinuclear region near the Golgi apparatus and in secretory granules, with the greatest intensity of staining in the perinuclear region in cell lines expressing integral membrane forms of PAM. Monofunctional and bifunctional PAM proteins that were soluble or membrane-associated were all packaged into regulated secretory granules in AtT-20 cells.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号