共查询到20条相似文献,搜索用时 31 毫秒
1.
Richard Mott 《Bulletin of mathematical biology》1992,54(1):59-75
A method is described for estimating the distribution and hence testing the statistical significance of sequence similarity
scores obtained during a data-bank search. Maximum-likelihood is used to fit a model to the scores, avoiding any costly simulation
of random sequences. The method is applied in detail to the Smith-Waterman algorithm when gaps are allowed, and is shown to
give results very similar to those obtained by simulation. 相似文献
2.
Protein sequence comparison: methods and significance 总被引:1,自引:0,他引:1
3.
Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches 总被引:2,自引:1,他引:1 下载免费PDF全文
Protein sequence database search programs may be evaluated both for their retrieval accuracy—the ability to separate meaningful from chance similarities—and for the accuracy of their statistical assessments of reported alignments. However, methods for improving statistical accuracy can degrade retrieval accuracy by discarding compositional evidence of sequence relatedness. This evidence may be preserved by combining essentially independent measures of alignment and compositional similarity into a unified measure of sequence similarity. A version of the BLAST protein database search program, modified to employ this new measure, outperforms the baseline program in both retrieval and statistical accuracy on ASTRAL, a SCOP-based test set. 相似文献
4.
SUMMARY: MaxBench is a web-based system available for evaluating the results of sequence and structure comparison methods, based on the SCOP protein domain classification. The system makes it easy for developers to both compare the overall performance of their methods to standard algorithms and investigate the results of individual comparisons. AVAILABILITY: http://www.sanger.ac.uk/Users/lp1/MaxBench/ 相似文献
5.
Background
Profile-based analysis of multiple sequence alignments (MSA) allows for accurate comparison of protein families. Here, we address the problems of detecting statistically confident dissimilarities between (1) MSA position and a set of predicted residue frequencies, and (2) between two MSA positions. These problems are important for (i) evaluation and optimization of methods predicting residue occurrence at protein positions; (ii) detection of potentially misaligned regions in automatically produced alignments and their further refinement; and (iii) detection of sites that determine functional or structural specificity in two related families. 相似文献6.
Background
The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30000 known and predicted human coding genes are characterized and have been assigned at least one function, there remains a fair number of genes (about 12000) for which no annotation has been made. The recent sequencing of other genomes has provided us with a huge amount of auxiliary sequence data which could help in the characterization of the human genes. Clustering these sequences into families is one of the first steps to perform comparative studies across several genomes. 相似文献7.
W R Taylor 《Protein engineering》1988,2(2):77-86
8.
Beginning with the concept of near-optimal sequence alignments, we can assign a probability that each element in one sequence is paired in an alignment with each element in another sequence. This involves a sum over the set of all possible pairwise alignments. The method employs a designed hidden Markov model (HMM) and the rigorous forward and forward-backward algorithms of Rabiner. The approach can use any standard sequence-element-to-element probabilistic similarity measures and affine gap penalty functions. This allows the positional alignment statistical significance to be obtained as a function of such variables. A measure of the probabilistic relationship between any single sequence and a set of sequences can be directly obtained. In addition, the employed HMM with the Viterbi algorithm provides a simple link to the standard dynamic programming optimal alignment algorithms. 相似文献
9.
General methods of sequence comparison 总被引:9,自引:0,他引:9
Michael S. Waterman 《Bulletin of mathematical biology》1984,46(4):473-500
Mathematical methods for comparison of nucleic acid sequences are reviewed. There are two major methods of sequence comparison:
dynamic programming and a method referred to here as the regions method. The problem types discussed are comparison of two
sequences, location of long matching segments, efficient database searches and comparison of several sequences.
This work was supported by a grant from the System Development Foundation. 相似文献
10.
Aleksandar Poleksic 《BMC bioinformatics》2009,10(1):112
Background
In the last decade, a significant improvement in detecting remote similarity between protein sequences has been made by utilizing alignment profiles in place of amino-acid strings. Unfortunately, no analytical theory is available for estimating the significance of a gapped alignment of two profiles. Many experiments suggest that the distribution of local profile-profile alignment scores is of the Gumbel form. However, estimating distribution parameters by random simulations turns out to be computationally very expensive. 相似文献11.
Gan HH Perlow RA Roy S Ko J Wu M Huang J Yan S Nicoletta A Vafai J Sun D Wang L Noah JE Pasquali S Schlick T 《Biophysical journal》2002,83(5):2781-2791
Current analyses of protein sequence/structure relationships have focused on expected similarity relationships for structurally similar proteins. To survey and explore the basis of these relationships, we present a general sequence/structure map that covers all combinations of similarity/dissimilarity relationships and provide novel energetic analyses of these relationships. To aid our analysis, we divide protein relationships into four categories: expected/unexpected similarity (S and S(?)) and expected/unexpected dissimilarity (D and D(?)) relationships. In the expected similarity region S, we show that trends in the sequence/structure relation can be derived based on the requirement of protein stability and the energetics of sequence and structural changes. Specifically, we derive a formula relating sequence and structural deviations to a parameter characterizing protein stiffness; the formula fits the data reasonably well. We suggest that the absence of data in region S(?) (high structural but low sequence similarity) is due to unfavorable energetics. In contrast to region S, region D(?) (high sequence but low structural similarity) is well-represented by proteins that can accommodate large structural changes. Our analyses indicate that there are several categories of similarity relationships and that protein energetics provide a basis for understanding these relationships. 相似文献
12.
MOTIVATION: Searching a protein sequence database for homologs is a
powerful tool for discovering the structure and function of a sequence. Two
new methods for searching sequence databases have recently been described:
Probabilistic Smith-Waterman (PSW), which is based on Hidden Markov models
for a single sequence using a standard scoring matrix, and a new version of
BLAST (WU-BLAST2), which uses Sum statistics for gapped alignments.
RESULTS: This paper compares and contrasts the effectiveness of these
methods with three older methods (Smith- Waterman: SSEARCH, FASTA and
BLASTP). The analysis indicates that the new methods are useful, and often
offer improved accuracy. These tools are compared using a curated (by Bill
Pearson) version of the annotated portion of PIR 39. Three different
statistical criteria are utilized: equivalence number, minimum errors and
the receiver operating characteristic. For complete-length protein query
sequences from large families, PSW's accuracy is superior to that of the
other methods, but its accuracy is poor when used with partial-length query
sequences. False negatives are twice as common as false positives
irrespective of the search methods if a family-specific threshold score
that minimizes the total number of errors (i.e. the most favorable
threshold score possible) is used. Thus, sensitivity, not selectivity, is
the major problem. Among the analyzed methods using default parameters, the
best accuracy was obtained from SSEARCH and PSW for complete-length
proteins, and the two BLAST programs, plus SSEARCH, for partial-length
proteins.
相似文献
13.
14.
We test models for the evolution of helical regions of RNA sequences, where the base pairing constraint leads to correlated compensatory substitutions occurring on either side of the pair. These models are of three types: 6-state models include only the four Watson-Crick pairs plus GU and UG; 7-state models include a single mismatch state that combines all of the 10 possible mismatches; 16-state models treat all mismatch states separately. We analyzed a set of eubacterial ribosomal RNA sequences with a well-established phylogenetic tree structure. For each model, the maximum-likelihood values of the parameters were obtained. The models were compared using the Akaike information criterion, the likelihood-ratio test, and Cox's test. With a high significance level, models that permit a nonzero rate of double substitutions performed better than those that assume zero double substitution rate. Some models assume symmetry between GC and CG, between AU and UA, and between GU and UG. Models that relaxed this symmetry assumption performed slightly better, but the tests did not all agree on the significance level. The most general time-reversible model significantly outperformed any of the simplifications. We consider the relative merits of all these models for molecular phylogenetics. 相似文献
15.
16.
Chen Z 《Bioinformatics (Oxford, England)》2003,19(18):2456-2460
MOTIVATION: Comprehensive performance assessment is important for improving sequence database search methods. Sensitivity, selectivity and speed are three major yet usually conflicting evaluation criteria. The average precision (AP) measure aims to combine the sensitivity and selectivity features of a search algorithm. It can be easily visualized and extended to analyze results from a set of queries. Finally, the time-AP plot can clearly show the overall performance of different search methods. RESULTS: Experiments are performed based on the SCOP database. Popular sequence comparison algorithms, namely Smith-Waterman (SSEARCH), FASTA, BLAST and PSI-BLAST are evaluated. We find that (1) the low-complexity segment filtration procedure in BLAST actually harms its overall search quality; (2) AP scores of different search methods are approximately in proportion of the logarithm of search time; and (3) homologs in protein families with many members tend to be more obscure than those in small families. This measure may be helpful for developing new search algorithms and can guide researchers in selecting most suitable search methods. AVAILABILITY: Test sets and source code of this evaluation tool are available upon request. 相似文献
17.
A class of non-linear similarity functionss
1 has been proposed for comparing subalignments of biological sequences. The distribution of maximals
1-similarities is well approximated by the extreme value distribution. The significance levels ofs
1 are studied for a variety of nucleotide frequency distributions as well as for several matrices of amino acid substitution
costs. Also, the significance levels ofs
1 are explored for comparing three biological sequences. Several previously described subalignments of bovine proenkephalin
and porcine prodynorphin are shown to be highly significant. 相似文献
18.
A survey of multiple sequence comparison methods 总被引:7,自引:0,他引:7
Multiple sequence comparison refers to the search for similarity in three or more sequences. This article presents a survey
of the exhaustive (optimal) and heuristic (possibly sub-optimal) methods developed for the comparison of multiple macromolecular
sequences. Emphasis is given to the different approaches of the heuristic methods. Four distance measures derived from information
engineering and genetic studies are introduced for the comparison between two alignments of sequences. The use ofentropy, which plays a central role in information theory as measures of information, choice and uncertainty, is proposed as a simple
measure for the evaluation of the optimality of an alignment in the absence of anya priori knowledge about the structures of the sequences being compared. This article also gives two examples of comparison between
alternative alignments of the same set of 5SRNAs as obtained by several different heuristic methods. 相似文献
19.
Abstract Organisms are said to be in developmental rate isomorphy when the proportions of developmental stage durations are unaffected by temperature. Comprehensive stage‐specific developmental data were generated on the cabbage beetle, Colaphellus bowringi Baly (Coleoptera: Chrysomelidae), at eight temperatures ranging from 16°C to 30°C (in 2°C increments) and five analytical methods were used to test the rate isomorphy hypothesis, including: (i) direct comparison of lower developmental thresholds with standard errors based on the traditional linear equation describing developmental rate as the linear function of temperature; (ii) analysis of covariance to compare the lower developmental thresholds of different stages based on the Ikemoto‐Takai linear equation; (iii) testing the significance of the slope item in the regression line of versus temperature, where p is the ratio of the developmental duration of a particular developmental stage to the entire pre‐imaginal developmental duration for one insect or mite species; (iv) analysis of variance to test for significant differences between the ratios of developmental stage durations to that of pre‐imaginal development; and (v) checking whether there is an element less than a given level of significance in the p‐value matrix of rotating regression line. The results revealed no significant difference among the lower developmental thresholds or among the aforementioned ratios, and thus convincingly confirmed the rate isomorphy hypothesis. 相似文献
20.
Correlation of sequence hydrophobicities measures similarity in three-dimensional protein structure 总被引:31,自引:0,他引:31
The degree of similarity in the three-dimensional structures of two proteins can be examined by comparing the patterns of hydrophobicity found in their amino acid sequences. Each type of amino acid residue is assigned a numerical hydrophobicity, and the correlation coefficient rH is computed between all pairs of residues in the two sequences. In tests on sequences from two properly aligned proteins of similar three-dimensional structures, rH is found in the range 0.3 to 0.7. Improperly aligned sequences or unrelated sequences give rH near zero. By considering the observed frequency of amino acid replacements among related structures, a set of optimal matching hydrophobicities (OMHs) was derived. With this set of OMHs, significant correlation coefficients are calculated for similar three-dimensional structures, even though the two sequences contain few identical residues. An example is the two similar folding domains of rhodanese (rH = 0.5). Predictions are made of similar three-dimensional structures for the alpha and beta chains of the various phycobiliproteins, and for delta hemolysin and melittin. 相似文献