共查询到20条相似文献,搜索用时 60 毫秒
1.
Dirk Metzler 《Journal of computational biology》2006,13(4):882-896
We examine a Poisson heuristic for judging the significance of local sequence alignments with gaps. Model parameters are estimated directly from the sequences to be aligned, so that laborious prior simulation studies or database comparisons for the estimation of parameters describing the connection between score and E-value are unnecessary. Simulation studies give evidence that this method gives reasonable results even when the usual assumptions like the independence of sequence positions are violated. 相似文献
2.
Mott R 《Journal of molecular biology》2000,300(3):649-659
A simple general approximation for the distribution of gapped local alignment scores is presented, suitable for assessing significance of comparisons between two protein sequences or a sequence and a profile. The approximation takes account of the scoring scheme (i.e. gap penalty and substitution matrix or profile), sequence composition and length. Use of this formula means it is unnecessary to fit an extreme-value distribution to simulations or to the results of databank searches. The method is based on the theoretical ideas introduced by R. Mott and R. Tribe in 1999. Extensive simulation studies show that score-thresholds produced by the method are accurate to within +/-5 % 95 % of the time. We also investigate factors which effect the accuracy of alignment statistics, and show that any method based on asymptotic theory is limited because asymptotic behaviour is not strictly achieved for many real protein sequences, due to extreme composition effects. Consequently, it may not be practicable to find a general formula that is significantly more accurate until the sub-asymptotic behaviour of alignments is better understood. 相似文献
3.
A multitude of motif-finding tools have been published, which can generally be assigned to one of three classes: expectation-maximization, Gibbs-sampling or enumeration. Irrespective of this grouping, most motif detection tools only take into account similarities across ungapped sequence regions, possibly causing short motifs located peripherally and in varying distance to a 'core' motif to be missed. We present a new method, adding to the set of expectation-maximization approaches, that permits the use of gapped alignments for motif elucidation. Availability: The program is available for download from: http://bioinfoserver.rsbs.anu.edu.au/downloads/mclip.jar. Supplementary information: http://bioinfoserver.rsbs.anu.edu.au/utils/mclip/info.php. 相似文献
4.
We present a novel maximum-likelihood-based algorithm for estimating the distribution of alignment scores from the scores of unrelated sequences in a database search. Using a new method for measuring the accuracy of p-values, we show that our maximum-likelihood-based algorithm is more accurate than existing regression-based and lookup table methods. We explore a more sophisticated way of modeling and estimating the score distributions (using a two-component mixture model and expectation maximization), but conclude that this does not improve significantly over simply ignoring scores with small E-values during estimation. Finally, we measure the classification accuracy of p-values estimated in different ways and observe that inaccurate p-values can, somewhat paradoxically, lead to higher classification accuracy. We explain this paradox and argue that statistical accuracy, not classification accuracy, should be the primary criterion in comparisons of similarity search methods that return p-values that adjust for target sequence length. 相似文献
5.
Making sense of score statistics for sequence alignments 总被引:1,自引:0,他引:1
The search for similarity between two biological sequences lies at the core of many applications in bioinformatics. This paper aims to highlight a few of the principles that should be kept in mind when evaluating the statistical significance of alignments between sequences. The extreme value distribution is first introduced, which in most cases describes the distribution of alignment scores between a query and a database. The effects of the similarity matrix and gap penalty values on the score distribution are then examined, and it is shown that the alignment statistics can undergo an abrupt phase transition. A few types of random sequence databases used in the estimation of statistical significance are presented, and the statistics employed by the BLAST, FASTA and PRSS programs are compared. Finally the different strategies used to assess the statistical significance of the matches produced by profiles and hidden Markov models are presented. 相似文献
6.
Stefan Wolfsheimer Bernd Burghardt Alexander K Hartmann 《Algorithms for molecular biology : AMB》2007,2(1):9
Background
The optimal score for ungapped local alignments of infinitely long random sequences is known to follow a Gumbel extreme value distribution. Less is known about the important case, where gaps are allowed. For this case, the distribution is only known empirically in the high-probability region, which is biologically less relevant. 相似文献7.
8.
Jaroszewski L Li W Godzik A 《Protein science : a publication of the Protein Society》2002,11(7):1702-1713
A major bottleneck in comparative modeling is the alignment quality; this is especially true for proteins whose distant relationships could be reliably recognized only by recent advances in fold recognition. The best algorithms excel in recognizing distant homologs but often produce incorrect alignments for over 50% of protein pairs in large fold-prediction benchmarks. The alignments obtained by sequence-sequence or sequence-structure matching algorithms differ significantly from the structural alignments. To study this problem, we developed a simplified method to explicitly enumerate all possible alignments for a pair of proteins. This allowed us to estimate the number of significantly different alignments for a given scoring method that score better than the structural alignment. Using several examples of distantly related proteins, we show that for standard sequence-sequence alignment methods, the number of significantly different alignments is usually large, often about 10(10) alternatives. This distance decreases when the alignment method is improved, but the number is still too large for the brute force enumeration approach. More effective strategies were needed, so we evaluated and compared two well-known approaches for searching the space of suboptimal alignments. We combined their best features and produced a hybrid method, which yielded alignments that surpassed the original alignments for about 50% of protein pairs with minimal computational effort. 相似文献
9.
PROMALS: towards accurate multiple sequence alignments of distantly related proteins 总被引:7,自引:0,他引:7
MOTIVATION: Accurate multiple sequence alignments are essential in protein structure modeling, functional prediction and efficient planning of experiments. Although the alignment problem has attracted considerable attention, preparation of high-quality alignments for distantly related sequences remains a difficult task. RESULTS: We developed PROMALS, a multiple alignment method that shows promising results for protein homologs with sequence identity below 10%, aligning close to half of the amino acid residues correctly on average. This is about three times more accurate than traditional pairwise sequence alignment methods. PROMALS algorithm derives its strength from several sources: (i) sequence database searches to retrieve additional homologs; (ii) accurate secondary structure prediction; (iii) a hidden Markov model that uses a novel combined scoring of amino acids and secondary structures; (iv) probabilistic consistency-based scoring applied to progressive alignment of profiles. Compared to the best alignment methods that do not use secondary structure prediction and database searches (e.g. MUMMALS, ProbCons and MAFFT), PROMALS is up to 30% more accurate, with improvement being most prominent for highly divergent homologs. Compared to SPEM and HHalign, which also employ database searches and secondary structure prediction, PROMALS shows an accuracy improvement of several percent. AVAILABILITY: The PROMALS web server is available at: http://prodata.swmed.edu/promals/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. 相似文献
10.
The accuracy of the alignments of protein sequences depends on the score matrix and gap penalties used in performing the alignment. Most score functions are designed to find homologs in the various databases rather than to generate accurate alignments between known homologs. We describe the optimization of a score function for the purpose of generating accurate alignments, as evaluated by using a coordinate root-mean-square deviation (RMSD)-based merit function. We show that the resulting score matrix, which we call STROMA, generates more accurate alignments than other commonly used score matrices, and this difference is not due to differences in the gap penalties. In fact, in contrast to most of the other matrices, the alignment accuracies with STROMA are relatively insensitive to the choice of gap penalty parameters. 相似文献
11.
MOTIVATION: Recent studies have revealed the importance of considering quality scores of reads generated by next-generation sequence (NGS) platforms in various downstream analyses. It is also known that probabilistic alignments based on marginal probabilities (e.g. aligned-column and/or gap probabilities) provide more accurate alignment than conventional maximum score-based alignment. There exists, however, no study about probabilistic alignment that considers quality scores explicitly, although the method is expected to be useful in SNP/indel callers and bisulfite mapping, because accurate estimation of aligned columns or gaps is important in those analyses. RESULTS: In this study, we propose methods of probabilistic alignment that consider quality scores of (one of) the sequences as well as a usual score matrix. The method is based on posterior decoding techniques in which various marginal probabilities are computed from a probabilistic model of alignments with quality scores, and can arbitrarily trade-off sensitivity and positive predictive value (PPV) of prediction (aligned columns and gaps). The method is directly applicable to read mapping (alignment) toward accurate detection of SNPs and indels. Several computational experiments indicated that probabilistic alignments can estimate aligned columns and gaps accurately, compared with other mapping algorithms e.g. SHRiMP2, Stampy, BWA and Novoalign. The study also suggested that our approach yields favorable precision for SNP/indel calling. 相似文献
12.
13.
Probabilistic scoring measures for profile-profile comparison yield more accurate short seed alignments 总被引:1,自引:0,他引:1
MOTIVATION: The development of powerful automatic methods for the comparison of protein sequences has become increasingly important. Profile-to-profile comparisons allow for the use of broader information about protein families, resulting in more sensitive and accurate comparisons of distantly related sequences. A key part in the comparison of two profiles is the method for the calculation of scores for the position matches. A number of methods based on various theoretical considerations have been proposed. We implemented several previously reported scoring functions as well as our own functions, and compared them on the basis of their ability to produce accurate short ungapped alignments of a given length. RESULTS: Our results suggest that the family of the probabilistic methods (log-odds based methods and prof_sim) may be the more appropriate choice for the generation of initial 'seeds' as the first step to produce local profile-profile alignments. The most effective scoring systems were the closely related modifications of functions previously implemented in the COMPASS and Picasso methods. 相似文献
14.
Toward an accurate theoretical framework for describing ensembles for proteins under strongly denaturing conditions 下载免费PDF全文
Our focus is on an appropriate theoretical framework for describing highly denatured proteins. In high concentrations of denaturants, proteins behave like polymers in a good solvent and ensembles for denatured proteins can be modeled by ignoring all interactions except excluded volume (EV) effects. To assay conformational preferences of highly denatured proteins, we quantify a variety of properties for EV-limit ensembles of 23 two-state proteins. We find that modeled denatured proteins can be best described as follows. Average shapes are consistent with prolate ellipsoids. Ensembles are characterized by large correlated fluctuations. Sequence-specific conformational preferences are restricted to local length scales that span five to nine residues. Beyond local length scales, chain properties follow well-defined power laws that are expected for generic polymers in the EV limit. The average available volume is filled inefficiently, and cavities of all sizes are found within the interiors of denatured proteins. All properties characterized from simulated ensembles match predictions from rigorous field theories. We use our results to resolve between conflicting proposals for structure in ensembles for highly denatured states. 相似文献
15.
Predrag Kukic Claudio Mirabello Giuseppe Tradigo Ian Walsh Pierangelo Veltri Gianluca Pollastri 《BMC bioinformatics》2014,15(1):1-15
Background
Protein inter-residue contact maps provide a translation and rotation invariant topological representation of a protein. They can be used as an intermediary step in protein structure predictions. However, the prediction of contact maps represents an unbalanced problem as far fewer examples of contacts than non-contacts exist in a protein structure. In this study we explore the possibility of completely eliminating the unbalanced nature of the contact map prediction problem by predicting real-value distances between residues. Predicting full inter-residue distance maps and applying them in protein structure predictions has been relatively unexplored in the past.Results
We initially demonstrate that the use of native-like distance maps is able to reproduce 3D structures almost identical to the targets, giving an average RMSD of 0.5Å. In addition, the corrupted physical maps with an introduced random error of ±6Å are able to reconstruct the targets within an average RMSD of 2Å. After demonstrating the reconstruction potential of distance maps, we develop two classes of predictors using two-dimensional recursive neural networks: an ab initio predictor that relies only on the protein sequence and evolutionary information, and a template-based predictor in which additional structural homology information is provided. We find that the ab initio predictor is able to reproduce distances with an RMSD of 6Å, regardless of the evolutionary content provided. Furthermore, we show that the template-based predictor exploits both sequence and structure information even in cases of dubious homology and outperforms the best template hit with a clear margin of up to 3.7Å. Lastly, we demonstrate the ability of the two predictors to reconstruct the CASP9 targets shorter than 200 residues producing the results similar to the state of the machine learning art approach implemented in the Distill server.Conclusions
The methodology presented here, if complemented by more complex reconstruction protocols, can represent a possible path to improve machine learning algorithms for 3D protein structure prediction. Moreover, it can be used as an intermediary step in protein structure predictions either on its own or complemented by NMR restraints. 相似文献16.
17.
18.
Enhanced statistics for local alignment of multiple alignments improves prediction of protein function and structure 总被引:2,自引:0,他引:2
MOTIVATION: Improved comparisons of multiple sequence alignments (profiles) with other profiles can identify subtle relationships between protein families and motifs significantly beyond the resolution of sequence-based comparisons. RESULTS: The local alignment of multiple alignments (LAMA) method was modified to estimate alignment score significance by applying a new measure based on Fisher's combining method. To verify the new procedure, we used known protein structures, sequence annotations and cyclical relations consistency analysis (CYRCA) sets of consistently aligned blocks. Using the new significance measure improved the sensitivity of LAMA without altering its selectivity. The program performed better than other profile-to-profile methods (COMPASS and Prof_sim) and a sequence-to-profile method (PSI-BLAST). The testing was large scale and used several parameters, including pseudo-counts profile calculations and local ungapped blocks or more extended gapped profiles. This comparison provides guidelines to the relative advantages of each method for different cases. We demonstrate and discuss the unique advantages of using block multiple alignments of protein motifs. 相似文献
19.
MOTIVATION: Different automatic methods of sequence alignments are routinely used as a starting point for homology searches and function inference. Confidence in an alignment probability is one of the major fundamentals of massive automatic genome-scale pairwise comparisons, for clustering of putative orthologs and paralogs, sequenced genome annotation or multiple-genomic tree constructions. Extreme value distribution based on the Karlin-Altschul model, usually advised for large-scale comparisons are not always valid, particularly in the case of comparisons of non-biased with nucleotide-biased genomes (such that of Plasmodium falciparum). Z-values estimates based on Monte Carlo technics, can be calculated experimentally for any alignment output, whatever the method used. Empirically, a Z-value higher than approximately 8 is supposed reasonable to assess that an alignment score is significant, but this arbitrary figure was never theoretically justified. RESULTS: In this paper, we used the Bienaymé-Chebyshev inequality to demonstrate a theorem of the upper limit of an alignment score probability (or P-value). This theorem implies that a computed Z-value is a statistical test, a single-linkage clustering criterion and that 1/Z-value(2) is an upper limit to the probability of an alignment score whatever the actual probability law is. Therefore, this study provides the missing theoretical link between a Z-value cut-off used for an automatic clustering of putative orthologs and/or paralogs, and the corresponding statistical risk in such genome-scale comparisons (using non-biased or biased genomes). 相似文献
20.
Liu K Warnow TJ Holder MT Nelesen SM Yu J Stamatakis AP Linder CR 《Systematic biology》2012,61(1):90-106
Highly accurate estimation of phylogenetic trees for large data sets is difficult, in part because multiple sequence alignments must be accurate for phylogeny estimation methods to be accurate. Coestimation of alignments and trees has been attempted but currently only SATé estimates reasonably accurate trees and alignments for large data sets in practical time frames (Liu K., Raghavan S., Nelesen S., Linder C.R., Warnow T. 2009b. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science. 324:1561-1564). Here, we present a modification to the original SATé algorithm that improves upon SATé (which we now call SATé-I) in terms of speed and of phylogenetic and alignment accuracy. SATé-II uses a different divide-and-conquer strategy than SATé-I and so produces smaller more closely related subsets than SATé-I; as a result, SATé-II produces more accurate alignments and trees, can analyze larger data sets, and runs more efficiently than SATé-I. Generally, SATé is a metamethod that takes an existing multiple sequence alignment method as an input parameter and boosts the quality of that alignment method. SATé-II-boosted alignment methods are significantly more accurate than their unboosted versions, and trees based upon these improved alignments are more accurate than trees based upon the original alignments. Because SATé-I used maximum likelihood (ML) methods that treat gaps as missing data to estimate trees and because we found a correlation between the quality of tree/alignment pairs and ML scores, we explored the degree to which SATé's performance depends on using ML with gaps treated as missing data to determine the best tree/alignment pair. We present two lines of evidence that using ML with gaps treated as missing data to optimize the alignment and tree produces very poor results. First, we show that the optimization problem where a set of unaligned DNA sequences is given and the output is the tree and alignment of those sequences that maximize likelihood under the Jukes-Cantor model is uninformative in the worst possible sense. For all inputs, all trees optimize the likelihood score. Second, we show that a greedy heuristic that uses GTR+Gamma ML to optimize the alignment and the tree can produce very poor alignments and trees. Therefore, the excellent performance of SATé-II and SATé-I is not because ML is used as an optimization criterion for choosing the best tree/alignment pair but rather due to the particular divide-and-conquer realignment techniques employed. 相似文献