首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: The BLAST program for comparing two sequences assumes independent sequences in its random model. The resulting random alignment matrices have correlations across their diagonals. Analytic formulas for the BLAST p-value essentially neglect these correlations and are equivalent to a random model with independent diagonals. Progress on the independent diagonals model has been surprisingly rapid, but the practical magnitude of the correlations it neglects remains unknown. In addition, BLAST uses a finite-size correction that is particularly important when either of the sequences being compared is short. Several formulas for the finite-size correction have now been given, but the corresponding errors in the BLAST p-values have not been quantified. As the lengths of compared sequences tend to infinity, it is also theoretically unknown whether the neglected correlations vanish faster than the finite-size correction. RESULTS: Because we required certain analytic formulas, our study restricted its computer experiments to ungapped sequence alignment. We expect some of our conclusions to extend qualitatively to gapped sequence alignment, however. With this caveat, the finite-size correction appeared to vanish faster than the neglected correlations. Although the finite-size correction underestimated the BLAST p-value, it improved the approximation substantially for all but very short sequences. In practice, the Altschul-Gish finite-size correction was superior to Spouge's. The independent diagonals model was always within a factor of 2 of the true BLAST p-value, although fitting p-value parameters from it probably is unwise. CONTACT: spouge@ncbi.nlm.nih.gov  相似文献   

2.
MOTIVATION: Pairwise local sequence alignment is commonly used to search data bases for sequences related to some query sequence. Alignments are obtained using a scoring matrix that takes into account the different frequencies of occurrence of the various types of amino acid substitutions. Software like BLAST provides the user with a set of scoring matrices available to choose from, and in the literature it is sometimes recommended to try several scoring matrices on the sequences of interest. The significance of an alignment is usually assessed by looking at E-values and p-values. While sequence lengths and data base sizes enter the standard calculations of significance, it is much less common to take the use of several scoring matrices on the same sequences into account. Altschul proposed corrections of the p-value that account for the simultaneous use of an infinite number of PAM matrices. Here we consider the more realistic situation where the user may choose from a finite set of popular PAM and BLOSUM matrices, in particular the ones available in BLAST. It turns out that the significance of a result can be considerably overestimated, if a set of substitution matrices is used in an alignment problem and the most significant alignment is then quoted. RESULTS: Based on extensive simulations, we study the multiple testing problem that occurs when several scoring matrices for local sequence alignment are used. We consider a simple Bonferroni correction of the p-values and investigate its accuracy. Finally, we propose a more accurate correction based on extreme value distributions fitted to the maximum of the normalized scores obtained from different scoring matrices. For various sets of matrices we provide correction factors which can be easily applied to adjust p- and E-values reported by software packages.  相似文献   

3.
The complexity of searches and the volume of genomic data make sequence alignment one of bioinformatics most active research areas. New alignment approaches have incorporated digital signal processing techniques. Among these, correlation methods are highly sensitive. This paper proposes a novel sequence alignment method based on 2-dimensional images, where each nucleic acid base is represented as a fixed gray intensity pixel. Query and known database sequences are coded to their pixel representation and sequence alignment is handled as object recognition in a scene problem. Query and database become object and scene, respectively. An image correlation process is carried out in order to search for the best match between them. Given that this procedure can be implemented in an optical correlator, the correlation could eventually be accomplished at light speed. This paper shows an initial research stage where results were "digitally" obtained by simulating an optical correlation of DNA sequences represented as images. A total of 303 queries (variable lengths from 50 to 4500 base pairs) and 100 scenes represented by 100 x 100 images each (in total, one million base pair database) were considered for the image correlation analysis. The results showed that correlations reached very high sensitivity (99.01%), specificity (98.99%) and outperformed BLAST when mutation numbers increased. However, digital correlation processes were hundred times slower than BLAST. We are currently starting an initiative to evaluate the correlation speed process of a real experimental optical correlator. By doing this, we expect to fully exploit optical correlation light properties. As the optical correlator works jointly with the computer, digital algorithms should also be optimized. The results presented in this paper are encouraging and support the study of image correlation methods on sequence alignment.  相似文献   

4.
MOTIVATION: The global alignment of protein sequence pairs is often used in the classification and analysis of full-length sequences. The calculation of a Z-score for the comparison gives a length and composition corrected measure of the similarity between the sequences. However, the Z-score alone, does not indicate the likely biological significance of the similarity. In this paper, all pairs of domains from 250 sequences belonging to different SCOP folds were aligned and Z-scores calculated. The distribution of Z-scores was fitted with a peak distribution from which the probability of obtaining a given Z-score from the global alignment of two protein sequences of unrelated fold was calculated. A similar analysis was applied to subsequence pairs found by the Smith-Waterman algorithm. These analyses allow the probability that two protein sequences share the same fold to be estimated by global sequence alignment. RESULTS: The relationship between Z-score and probability varied little over the matrix/gap penalty combinations examined. However, an average shift of +4.7 was observed for Z-scores derived from global alignment of locally-aligned subsequences compared to global alignment of the full-length sequences. This shift was shown to be the result of pre-selection by local alignment, rather than any structural similarity in the subsequences. The search ability of both methods was benchmarked against the SCOP superfamily classification and showed that global alignment Z-scores generated from the entire sequence are as effective as SSEARCH at low error rates and more effective at higher error rates. However, global alignment Z-scores generated from the best locally-aligned subsequence were significantly less effective than SSEARCH. The method of estimating statistical significance described here was shown to give similar values to SSEARCH and BLAST, providing confidence in the significance estimation. AVAILABILITY: Software to apply the statistics to global alignments is available from http://barton.ebi.ac.uk. CONTACT: geoff@ebi.ac.uk  相似文献   

5.
Sequence alignment programs such as BLAST and PSI-BLAST are used routinely in pairwise, profile-based, or intermediate-sequence-search (ISS) methods to detect remote homologies for the purposes of fold assignment and comparative modeling. Yet, the sequence alignment quality of these methods at low sequence identity is not known. We have used the CE structure alignment program (Shindyalov and Bourne, Prot Eng 1998;11:739) to derive sequence alignments for all superfamily and family-level related proteins in the SCOP domain database. CE aligns structures and their sequences based on distances within each protein, rather than on interprotein distances. We compared BLAST, PSI-BLAST, CLUSTALW, and ISS alignments with the CE structural alignments. We found that global alignments with CLUSTALW were very poor at low sequence identity (<25%), as judged by the CE alignments. We used PSI-BLAST to search the nonredundant sequence database (nr) with every sequence in SCOP using up to four iterations. The resulting matrix was used to search a database of SCOP sequences. PSI-BLAST is only slightly better than BLAST in alignment accuracy on a per-residue basis, but PSI-BLAST matrix alignments are much longer than BLAST's, and so align correctly a larger fraction of the total number of aligned residues in the structure alignments. Any two SCOP sequences in the same superfamily that shared a hit or hits in the nr PSI-BLAST searches were identified as linked by the shared intermediate sequence. We examined the quality of the longest SCOP-query/ SCOP-hit alignment via an intermediate sequence, and found that ISS produced longer alignments than PSI-BLAST searches alone, of nearly comparable per-residue quality. At 10-15% sequence identity, BLAST correctly aligns 28%, PSI-BLAST 40%, and ISS 46% of residues according to the structure alignments. We also compared CE structure alignments with FSSP structure alignments generated by the DALI program. In contrast to the sequence methods, CE and structure alignments from the FSSP database identically align 75% of residue pairs at the 10-15% level of sequence identity, indicating that there is substantial room for improvement in these sequence alignment methods. BLAST produced alignments for 8% of the 10,665 nonimmunoglobulin SCOP superfamily sequence pairs (nearly all <25% sequence identity), PSI-BLAST matched 17% and the double-PSI-BLAST ISS method aligned 38% with E-values <10.0. The results indicate that intermediate sequences may be useful not only in fold assignment but also in achieving more complete sequence alignments for comparative modeling.  相似文献   

6.
Detection of homologous proteins by an intermediate sequence search   总被引:2,自引:0,他引:2  
We developed a variant of the intermediate sequence search method (ISS(new)) for detection and alignment of weakly similar pairs of protein sequences. ISS(new) relates two query sequences by an intermediate sequence that is potentially homologous to both queries. The improvement was achieved by a more robust overlap score for a match between the queries through an intermediate. The approach was benchmarked on a data set of 2369 sequences of known structure with insignificant sequence similarity to each other (BLAST E-value larger than 0.001); 2050 of these sequences had a related structure in the set. ISS(new) performed significantly better than both PSI-BLAST and a previously described intermediate sequence search method. PSI-BLAST could not detect correct homologs for 1619 of the 2369 sequences. In contrast, ISS(new) assigned a correct homolog as the top hit for 121 of these 1619 sequences, while incorrectly assigning homologs for only nine targets; it did not assign homologs for the remainder of the sequences. By estimate, ISS(new) may be able to assign the folds of domains in approximately 29,000 of the approximately 500,000 sequences unassigned by PSI-BLAST, with 90% specificity (1 - false positives fraction). In addition, we show that the 15 alignments with the most significant BLAST E-values include the nearly best alignments constructed by ISS(new).  相似文献   

7.
We consider character sequences evolving on a phylogenetic tree under the TKF91 model. We show that as the sequence lengths tend to infinity the topology of the phylogenetic tree and the edge lengths are determined by any one of (a) the alignment of sequences (b) the collection of sequence lengths. We also show that the probability of any homology structure on a collection of sequences related by a TKF91 process on a tree is independent of the root location.  相似文献   

8.
MOTIVATION: Filtration is an important technique used to speed up local alignment as exemplified in the BLAST programs. Recently, Ma et al. discovered that better filtering can be achieved by spacing out the matching positions according to a certain pattern, instead of contiguous positions to trigger a local alignment in their PatternHunter program. Such a match pattern is called a spaced seed. RESULTS: Our numerical computation shows that the ranks of spaced seeds (based on sensitivity) change with the sequences similarity. Since homologous sequences may have diverse similarity, we assess the sensitivity of spaced seeds over a range of similarity levels and present a list of good spaced seeds for facilitating homology search in DNA genomic sequences. We validate that the listed spaced seeds are indeed more sensitive using three arbitrarily chosen pairs of DNA genomic sequences.  相似文献   

9.
Searching databases for distant homologues using alignments instead of individual sequences increases the power of detection. However, most methods assume that protein evolution proceeds in a regular fashion, with the inferred tree of sequences providing a good estimation of the evolutionary process. We investigated the combined HMMER search results from random alignment subsets (with three sequences each) drawn from the parent alignment (Rand-shuffle algorithm), using the SCOP structural classification to determine true similarities. At false-positive rates of 5%, the Rand-shuffle algorithm improved HMMER's sensitivity, with a 37.5% greater sensitivity compared with HMMER alone, when easily identified similarities (identifiable by BLAST) were excluded from consideration. An extension of the Rand-shuffle algorithm (Ali-shuffle) weighted towards more informative sequence subsets. This approach improved the performance over HMMER alone and PSI-BLAST, particularly at higher false-positive rates. The improvements in performance of these sequence sub-sampling methods may reflect lower sensitivity to alignment error and irregular evolutionary patterns. The Ali-shuffle and Rand-shuffle sequence homology search programs are available by request from the authors.  相似文献   

10.
The challenge of similarity search in massive DNA sequence databases has inspired major changes in BLAST-style alignment tools, which accelerate search by inspecting only pairs of sequences sharing a common short "seed," or pattern of matching residues. Some of these changes raise the possibility of improving search performance by probing sequence pairs with several distinct seeds, any one of which is sufficient for a seed match. However, designing a set of seeds to maximize their combined sensitivity to biologically meaningful sequence alignments is computationally difficult, even given recent advances in designing single seeds. This work describes algorithmic improvements to seed design that address the problem of designing a set of n seeds to be used simultaneously. We give a new local search method to optimize the sensitivity of seed sets. The method relies on efficient incremental computation of the probability that an alignment contains a match to a seed pi, given that it has already failed to match any of the seeds in a set Pi. We demonstrate experimentally that multi-seed designs, even with relatively few seeds, can be significantly more sensitive than even optimized single-seed designs.  相似文献   

11.
MOTIVATION: Two proteins can have a similar 3-dimensional structure and biological function, but have sequences sufficiently different that traditional protein sequence comparison algorithms do not identify their relationship. The desire to identify such relations has led to the development of more sensitive sequence alignment strategies. One such strategy is the Intermediate Sequence Search (ISS), which connects two proteins through one or more intermediate sequences. In its brute-force implementation, ISS is a strategy that repetitively uses the results of the previous query as new search seeds, making it time-consuming and difficult to analyze. RESULTS: Saturated BLAST is a package that performs ISS in an efficient and automated manner. It was developed using Perl and Perl/Tk and implemented on the LINUX operating system. Starting with a protein sequence, Saturated BLAST runs a BLAST search and identifies representative sequences for the next generation of searches. The procedure is run until convergence or until some predefined criteria are met. Saturated BLAST has a friendly graphic user interface, a built-in BLAST result parser, several multiple alignment tools, clustering algorithms and various filters for the elimination of false positives, thereby providing an easy way to edit, visualize, analyze, monitor and control the search. Besides detecting remote homologies, Saturated BLAST can be used to maintain protein family databases and to search for new genes in genomic databases.  相似文献   

12.
MOTIVATION: Structural RNA genes exhibit unique evolutionary patterns that are designed to conserve their secondary structures; these patterns should be taken into account while constructing accurate multiple alignments of RNA genes. The Sankoff algorithm is a natural alignment algorithm that includes the effect of base-pair covariation in the alignment model. However, the extremely high computational cost of the Sankoff algorithm precludes its application to most RNA sequences. RESULTS: We propose an efficient algorithm for the multiple alignment of structural RNA sequences. Our algorithm is a variant of the Sankoff algorithm, and it uses an efficient scoring system that reduces the time and space requirements considerably without compromising on the alignment quality. First, our algorithm computes the match probability matrix that measures the alignability of each position pair between sequences as well as the base pairing probability matrix for each sequence. These probabilities are then combined to score the alignment using the Sankoff algorithm. By itself, our algorithm does not predict the consensus secondary structure of the alignment but uses external programs for the prediction. We demonstrate that both the alignment quality and the accuracy of the consensus secondary structure prediction from our alignment are the highest among the other programs examined. We also demonstrate that our algorithm can align relatively long RNA sequences such as the eukaryotic-type signal recognition particle RNA that is approximately 300 nt in length; multiple alignment of such sequences has not been possible by using other Sankoff-based algorithms. The algorithm is implemented in the software named 'Murlet'. AVAILABILITY: The C++ source code of the Murlet software and the test dataset used in this study are available at http://www.ncrna.org/papers/Murlet/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

13.
Alignment of protein sequences by their profiles   总被引:7,自引:0,他引:7  
The accuracy of an alignment between two protein sequences can be improved by including other detectably related sequences in the comparison. We optimize and benchmark such an approach that relies on aligning two multiple sequence alignments, each one including one of the two protein sequences. Thirteen different protocols for creating and comparing profiles corresponding to the multiple sequence alignments are implemented in the SALIGN command of MODELLER. A test set of 200 pairwise, structure-based alignments with sequence identities below 40% is used to benchmark the 13 protocols as well as a number of previously described sequence alignment methods, including heuristic pairwise sequence alignment by BLAST, pairwise sequence alignment by global dynamic programming with an affine gap penalty function by the ALIGN command of MODELLER, sequence-profile alignment by PSI-BLAST, Hidden Markov Model methods implemented in SAM and LOBSTER, pairwise sequence alignment relying on predicted local structure by SEA, and multiple sequence alignment by CLUSTALW and COMPASS. The alignment accuracies of the best new protocols were significantly better than those of the other tested methods. For example, the fraction of the correctly aligned residues relative to the structure-based alignment by the best protocol is 56%, which can be compared with the accuracies of 26%, 42%, 43%, 48%, 50%, 49%, 43%, and 43% for the other methods, respectively. The new method is currently applied to large-scale comparative protein structure modeling of all known sequences.  相似文献   

14.
All popular algorithms of pair-wise alignment of protein primary structures (e.g. Smith-Waterman (SW), FASTA, BLAST, et al.) utilize only amino acid sequences. The SW-algorithm is the most accurate among them, i.e. it produces alignments that are most similar to the alignments obtained by superposition of protein 3D-structures. But even the SW-algorithm is unable to restore the 3D-based alignment if similarity of amino acid sequences (%id) is below 30%. We have proposed a novel alignment method that explicitly takes into account the secondary structure of the compared proteins. We have shown that it creates significantly more accurate alignments compared to SW-algorithm. In particular, for sequences with %id < 30% the average accuracy of the new method is 58% compared to 35% for SW-algorithm (the accuracy of an algorithmic sequence alignment is the part of restored position of a "golden standard" alignment obtained by superposition of corresponding 3D-structures). The accuracy of the proposed method is approximately identical both for experimental, and for theoretically predicted secondary structures. Thus the method can be applied for alignment of protein sequences even if protein 3D-structure is unknown. The program is available at ftp://194.149.64.196/STRUSWER/.  相似文献   

15.
SUMMARY: Multiple sequence alignment is the NP-hard problem of aligning three or more DNA or amino acid sequences in an optimal way so as to match as many characters as possible from the set of sequences. The popular sequence alignment program ClustalW uses the classical method of approximating a sequence alignment, by first computing a distance matrix and then constructing a guide tree to show the evolutionary relationship of the sequences. We show that parallelizing the ClustalW algorithm can result in significant speedup. We used a cluster of workstations using C and message passing interface for our implementation. Experimental results show that speedup of over 5.5 on six processors is obtainable for most inputs. AVAILABILITY: The software is available upon request from the second author.  相似文献   

16.
MOTIVATION: Recently, the concept of the constrained sequence alignment was proposed to incorporate the knowledge of biologists about structures/functionalities/consensuses of their datasets into sequence alignment such that the user-specified residues/nucleotides are aligned together in the computed alignment. The currently developed programs use the so-called progressive approach to efficiently obtain a constrained alignment of several sequences. However, the kernels of these programs, the dynamic programming algorithms for computing an optimal constrained alignment between two sequences, run in (gamman2) memory, where gamma is the number of the constraints and n is the maximum of the lengths of sequences. As a result, such a high memory requirement limits the overall programs to align short sequences only. RESULTS: We adopt the divide-and-conquer approach to design a memory-efficient algorithm for computing an optimal constrained alignment between two sequences, which greatly reduces the memory requirement of the dynamic programming approaches at the expense of a small constant factor in CPU time. This new algorithm consumes only O(alphan) space, where alpha is the sum of the lengths of constraints and usually alpha < n in practical applications. Based on this algorithm, we have developed a memory-efficient tool for multiple sequence alignment with constraints. AVAILABILITY: http://genome.life.nctu.edu.tw/MUSICME.  相似文献   

17.
MOTIVATION: The number of Single Nucleotide Polymorphisms (SNPs) detectable in an alignment is a function of the length and the number of the aligned sequences. The latter is called sample size. However, a typical alignment, for instance obtained as a BLAST-search result of a query sequence against an EST database, does not evenly cover the query sequence. Therefore, it is usually not clear what the actual sample size is. RESULTS: We present a method to calculate the effective sample size, called n(eff), for a given BLAST alignment. This method takes into account that multiple coverage contributes only logarithmically to the SNP yield of a given sequence stretch. We show that the effective sample size n(eff) is usually much smaller than would be expected for a given amount of coverage and illustrate this with two typical examples.  相似文献   

18.
Homology search is a key tool for understanding the role, structure, and biochemical function of genomic sequences. The most popular technique for rapid homology search is BLAST, which has been in widespread use within universities, research centers, and commercial enterprises since the early 1990s. We propose a new step in the BLAST algorithm to reduce the computational cost of searching with negligible effect on accuracy. This new step - semigapped alignment - compromises between the efficiency of ungapped alignment and the accuracy of gapped alignment, allowing BLAST to accurately filter sequences with lower computational cost. In addition, we propose a heuristic - restricted insertion alignment - that avoids unlikely evolutionary paths with the aim of reducing gapped alignment cost with negligible effect on accuracy. Together, after including an optimization of the local alignment recursion, our two techniques more than double the speed of the gapped alignment stages in blast. We conclude that our techniques are an important improvement to the BLAST algorithm. Source code for the alignment algorithms is available for download at http://www.bsg.rmit.edu.au/iga/.  相似文献   

19.
MOTIVATION: Homologous sequences are sometimes similar over some regions but different over other regions. Homologous sequences have a much lower global similarity if the different regions are much longer than the similar regions. RESULTS: We present a generalized global alignment algorithm for comparing sequences with intermittent similarities, an ordered list of similar regions separated by different regions. A generalized global alignment model is defined to handle sequences with intermittent similarities. A dynamic programming algorithm is designed to compute an optimal general alignment in time proportional to the product of sequence lengths and in space proportional to the sum of sequence lengths. The algorithm is implemented as a computer program named GAP3 (Global Alignment Program Version 3). The generalized global alignment model is validated by experimental results produced with GAP3 on both DNA and protein sequences. The GAP3 program extends the ability of standard global alignment programs to recognize homologous sequences of lower similarity. AVAILABILITY: The GAP3 program is freely available for academic use at http://bioinformatics.iastate.edu/aat/align/align.html.  相似文献   

20.
Protein structure prediction by comparative modeling benefits greatly from the use of multiple sequence alignment information to improve the accuracy of structural template identification and the alignment of target sequences to structural templates. Unfortunately, this benefit is limited to those protein sequences for which at least several natural sequence homologues exist. We show here that the use of large diverse alignments of computationally designed protein sequences confers many of the same benefits as natural sequences in identifying structural templates for comparative modeling targets. A large-scale massively parallelized application of an all-atom protein design algorithm, including a simple model of peptide backbone flexibility, has allowed us to generate 500 diverse, non-native, high-quality sequences for each of 264 protein structures in our test set. PSI-BLAST searches using the sequence profiles generated from the designed sequences ("reverse" BLAST searches) give near-perfect accuracy in identifying true structural homologues of the parent structure, with 54% coverage. In 41 of 49 genomes scanned using reverse BLAST searches, at least one novel structural template (not found by the standard method of PSI-BLAST against PDB) is identified. Further improvements in coverage, through optimizing the scoring function used to design sequences and continued application to new protein structures beyond the test set, will allow this method to mature into a useful strategy for identifying distantly related structural templates.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号