首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A method is described for estimating the distribution and hence testing the statistical significance of sequence similarity scores obtained during a data-bank search. Maximum-likelihood is used to fit a model to the scores, avoiding any costly simulation of random sequences. The method is applied in detail to the Smith-Waterman algorithm when gaps are allowed, and is shown to give results very similar to those obtained by simulation.  相似文献   

2.
Protein sequence comparison: methods and significance   总被引:1,自引:0,他引:1  
  相似文献   

3.
Protein sequence database search programs may be evaluated both for their retrieval accuracy—the ability to separate meaningful from chance similarities—and for the accuracy of their statistical assessments of reported alignments. However, methods for improving statistical accuracy can degrade retrieval accuracy by discarding compositional evidence of sequence relatedness. This evidence may be preserved by combining essentially independent measures of alignment and compositional similarity into a unified measure of sequence similarity. A version of the BLAST protein database search program, modified to employ this new measure, outperforms the baseline program in both retrieval and statistical accuracy on ASTRAL, a SCOP-based test set.  相似文献   

4.
SUMMARY: MaxBench is a web-based system available for evaluating the results of sequence and structure comparison methods, based on the SCOP protein domain classification. The system makes it easy for developers to both compare the overall performance of their methods to standard algorithms and investigate the results of individual comparisons. AVAILABILITY: http://www.sanger.ac.uk/Users/lp1/MaxBench/  相似文献   

5.

Background  

Profile-based analysis of multiple sequence alignments (MSA) allows for accurate comparison of protein families. Here, we address the problems of detecting statistically confident dissimilarities between (1) MSA position and a set of predicted residue frequencies, and (2) between two MSA positions. These problems are important for (i) evaluation and optimization of methods predicting residue occurrence at protein positions; (ii) detection of potentially misaligned regions in automatically produced alignments and their further refinement; and (iii) detection of sites that determine functional or structural specificity in two related families.  相似文献   

6.

Background  

The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30000 known and predicted human coding genes are characterized and have been assigned at least one function, there remains a fair number of genes (about 12000) for which no annotation has been made. The recent sequencing of other genomes has provided us with a huge amount of auxiliary sequence data which could help in the characterization of the human genes. Clustering these sequences into families is one of the first steps to perform comparative studies across several genomes.  相似文献   

7.
8.
Beginning with the concept of near-optimal sequence alignments, we can assign a probability that each element in one sequence is paired in an alignment with each element in another sequence. This involves a sum over the set of all possible pairwise alignments. The method employs a designed hidden Markov model (HMM) and the rigorous forward and forward-backward algorithms of Rabiner. The approach can use any standard sequence-element-to-element probabilistic similarity measures and affine gap penalty functions. This allows the positional alignment statistical significance to be obtained as a function of such variables. A measure of the probabilistic relationship between any single sequence and a set of sequences can be directly obtained. In addition, the employed HMM with the Viterbi algorithm provides a simple link to the standard dynamic programming optimal alignment algorithms.  相似文献   

9.
General methods of sequence comparison   总被引:9,自引:0,他引:9  
Mathematical methods for comparison of nucleic acid sequences are reviewed. There are two major methods of sequence comparison: dynamic programming and a method referred to here as the regions method. The problem types discussed are comparison of two sequences, location of long matching segments, efficient database searches and comparison of several sequences. This work was supported by a grant from the System Development Foundation.  相似文献   

10.

Background  

In the last decade, a significant improvement in detecting remote similarity between protein sequences has been made by utilizing alignment profiles in place of amino-acid strings. Unfortunately, no analytical theory is available for estimating the significance of a gapped alignment of two profiles. Many experiments suggest that the distribution of local profile-profile alignment scores is of the Gumbel form. However, estimating distribution parameters by random simulations turns out to be computationally very expensive.  相似文献   

11.
Current analyses of protein sequence/structure relationships have focused on expected similarity relationships for structurally similar proteins. To survey and explore the basis of these relationships, we present a general sequence/structure map that covers all combinations of similarity/dissimilarity relationships and provide novel energetic analyses of these relationships. To aid our analysis, we divide protein relationships into four categories: expected/unexpected similarity (S and S(?)) and expected/unexpected dissimilarity (D and D(?)) relationships. In the expected similarity region S, we show that trends in the sequence/structure relation can be derived based on the requirement of protein stability and the energetics of sequence and structural changes. Specifically, we derive a formula relating sequence and structural deviations to a parameter characterizing protein stiffness; the formula fits the data reasonably well. We suggest that the absence of data in region S(?) (high structural but low sequence similarity) is due to unfavorable energetics. In contrast to region S, region D(?) (high sequence but low structural similarity) is well-represented by proteins that can accommodate large structural changes. Our analyses indicate that there are several categories of similarity relationships and that protein energetics provide a basis for understanding these relationships.  相似文献   

12.
Comparative accuracy of methods for protein sequence similarity search   总被引:2,自引:0,他引:2  
MOTIVATION: Searching a protein sequence database for homologs is a powerful tool for discovering the structure and function of a sequence. Two new methods for searching sequence databases have recently been described: Probabilistic Smith-Waterman (PSW), which is based on Hidden Markov models for a single sequence using a standard scoring matrix, and a new version of BLAST (WU-BLAST2), which uses Sum statistics for gapped alignments. RESULTS: This paper compares and contrasts the effectiveness of these methods with three older methods (Smith- Waterman: SSEARCH, FASTA and BLASTP). The analysis indicates that the new methods are useful, and often offer improved accuracy. These tools are compared using a curated (by Bill Pearson) version of the annotated portion of PIR 39. Three different statistical criteria are utilized: equivalence number, minimum errors and the receiver operating characteristic. For complete-length protein query sequences from large families, PSW's accuracy is superior to that of the other methods, but its accuracy is poor when used with partial-length query sequences. False negatives are twice as common as false positives irrespective of the search methods if a family-specific threshold score that minimizes the total number of errors (i.e. the most favorable threshold score possible) is used. Thus, sensitivity, not selectivity, is the major problem. Among the analyzed methods using default parameters, the best accuracy was obtained from SSEARCH and PSW for complete-length proteins, and the two BLAST programs, plus SSEARCH, for partial-length proteins.   相似文献   

13.
14.
Savill NJ  Hoyle DC  Higgs PG 《Genetics》2001,157(1):399-411
We test models for the evolution of helical regions of RNA sequences, where the base pairing constraint leads to correlated compensatory substitutions occurring on either side of the pair. These models are of three types: 6-state models include only the four Watson-Crick pairs plus GU and UG; 7-state models include a single mismatch state that combines all of the 10 possible mismatches; 16-state models treat all mismatch states separately. We analyzed a set of eubacterial ribosomal RNA sequences with a well-established phylogenetic tree structure. For each model, the maximum-likelihood values of the parameters were obtained. The models were compared using the Akaike information criterion, the likelihood-ratio test, and Cox's test. With a high significance level, models that permit a nonzero rate of double substitutions performed better than those that assume zero double substitution rate. Some models assume symmetry between GC and CG, between AU and UA, and between GU and UG. Models that relaxed this symmetry assumption performed slightly better, but the tests did not all agree on the significance level. The most general time-reversible model significantly outperformed any of the simplifications. We consider the relative merits of all these models for molecular phylogenetics.  相似文献   

15.
16.
MOTIVATION: Comprehensive performance assessment is important for improving sequence database search methods. Sensitivity, selectivity and speed are three major yet usually conflicting evaluation criteria. The average precision (AP) measure aims to combine the sensitivity and selectivity features of a search algorithm. It can be easily visualized and extended to analyze results from a set of queries. Finally, the time-AP plot can clearly show the overall performance of different search methods. RESULTS: Experiments are performed based on the SCOP database. Popular sequence comparison algorithms, namely Smith-Waterman (SSEARCH), FASTA, BLAST and PSI-BLAST are evaluated. We find that (1) the low-complexity segment filtration procedure in BLAST actually harms its overall search quality; (2) AP scores of different search methods are approximately in proportion of the logarithm of search time; and (3) homologs in protein families with many members tend to be more obscure than those in small families. This measure may be helpful for developing new search algorithms and can guide researchers in selecting most suitable search methods. AVAILABILITY: Test sets and source code of this evaluation tool are available upon request.  相似文献   

17.
A class of non-linear similarity functionss 1 has been proposed for comparing subalignments of biological sequences. The distribution of maximals 1-similarities is well approximated by the extreme value distribution. The significance levels ofs 1 are studied for a variety of nucleotide frequency distributions as well as for several matrices of amino acid substitution costs. Also, the significance levels ofs 1 are explored for comparing three biological sequences. Several previously described subalignments of bovine proenkephalin and porcine prodynorphin are shown to be highly significant.  相似文献   

18.
A survey of multiple sequence comparison methods   总被引:7,自引:0,他引:7  
Multiple sequence comparison refers to the search for similarity in three or more sequences. This article presents a survey of the exhaustive (optimal) and heuristic (possibly sub-optimal) methods developed for the comparison of multiple macromolecular sequences. Emphasis is given to the different approaches of the heuristic methods. Four distance measures derived from information engineering and genetic studies are introduced for the comparison between two alignments of sequences. The use ofentropy, which plays a central role in information theory as measures of information, choice and uncertainty, is proposed as a simple measure for the evaluation of the optimality of an alignment in the absence of anya priori knowledge about the structures of the sequences being compared. This article also gives two examples of comparison between alternative alignments of the same set of 5SRNAs as obtained by several different heuristic methods.  相似文献   

19.
Abstract Organisms are said to be in developmental rate isomorphy when the proportions of developmental stage durations are unaffected by temperature. Comprehensive stage‐specific developmental data were generated on the cabbage beetle, Colaphellus bowringi Baly (Coleoptera: Chrysomelidae), at eight temperatures ranging from 16°C to 30°C (in 2°C increments) and five analytical methods were used to test the rate isomorphy hypothesis, including: (i) direct comparison of lower developmental thresholds with standard errors based on the traditional linear equation describing developmental rate as the linear function of temperature; (ii) analysis of covariance to compare the lower developmental thresholds of different stages based on the Ikemoto‐Takai linear equation; (iii) testing the significance of the slope item in the regression line of versus temperature, where p is the ratio of the developmental duration of a particular developmental stage to the entire pre‐imaginal developmental duration for one insect or mite species; (iv) analysis of variance to test for significant differences between the ratios of developmental stage durations to that of pre‐imaginal development; and (v) checking whether there is an element less than a given level of significance in the p‐value matrix of rotating regression line. The results revealed no significant difference among the lower developmental thresholds or among the aforementioned ratios, and thus convincingly confirmed the rate isomorphy hypothesis.  相似文献   

20.
The degree of similarity in the three-dimensional structures of two proteins can be examined by comparing the patterns of hydrophobicity found in their amino acid sequences. Each type of amino acid residue is assigned a numerical hydrophobicity, and the correlation coefficient rH is computed between all pairs of residues in the two sequences. In tests on sequences from two properly aligned proteins of similar three-dimensional structures, rH is found in the range 0.3 to 0.7. Improperly aligned sequences or unrelated sequences give rH near zero. By considering the observed frequency of amino acid replacements among related structures, a set of optimal matching hydrophobicities (OMHs) was derived. With this set of OMHs, significant correlation coefficients are calculated for similar three-dimensional structures, even though the two sequences contain few identical residues. An example is the two similar folding domains of rhodanese (rH = 0.5). Predictions are made of similar three-dimensional structures for the alpha and beta chains of the various phycobiliproteins, and for delta hemolysin and melittin.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号