首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 484 毫秒
1.
Kann MG  Goldstein RA 《Proteins》2002,48(2):367-376
A detailed analysis of the performance of hybrid, a new sequence alignment algorithm developed by Yu and coworkers that combines Smith Waterman local dynamic programming with a local version of the maximum-likelihood approach, was made to access the applicability of this algorithm to the detection of distant homologs by sequence comparison. We analyzed the statistics of hybrid with a set of nonhomologous protein sequences from the SCOP database and found that the statistics of the scores from hybrid algorithm follows an Extreme Value Distribution with lambda approximately 1, as previously shown by Yu et al. for the case of artificially generated sequences. Local dynamic programming was compared to the hybrid algorithm by using two different test data sets of distant homologs from the PFAM and COGs protein sequence databases. The studies were made with several score functions in current use including OPTIMA, a new score function originally developed to detect remote homologs with the Smith Waterman algorithm. We found OPTIMA to be the best score function for both both dynamic programming and the hybrid algorithms. The ability of dynamic programming to discriminate between homologs and nonhomologs in the two sets of distantly related sequences is slightly better than that of hybrid algorithm. The advantage of producing accurate score statistics with only a few simulations may overcome the small differences in performance and make this new algorithm suitable for detection of homologs in conjunction with a wide range of score functions and gap penalties.  相似文献   

2.
The score statistics of a recently introduced 'hybrid alignment' algorithm is studied in detail numerically. An extensive survey across the 2216 models of protein domains contained in the Pfam v5.4 database (Bateman et al., Nucleic Acids Res., 28, 263-266, 2000) verifies the theoretical predictions: For the position-specific scoring functions used in the Pfam models, the score statistics of hybrid alignment obey the Gumbel distribution, with the key Gumbel parameter lambda taking on the asymptotic value 1 universally for all models. Thus, the use of hybrid alignment eliminates the time-consuming computer simulations normally needed to assign p-values to alignment scores, freeing the users to experiment with different scoring parameters and functions. The performance of the hybrid algorithm in detecting sequence homology is also studied. For protein sequences from the SCOP database (Murzin et al., J. Mol. Biol., 247, 536-540, 1995) using uniform scoring functions, the performance is found to be comparable to the best of the existing methods. Preliminary results using the PfamA database suggest that the hybrid algorithm achieves similar performance as existing methods for position-specific scoring systems as well. Hybrid alignment is thereby established as a high performance alignment algorithm with well-characterized, universal statistics.  相似文献   

3.
Alignment of protein sequences is a key step in most computational methods for prediction of protein function and homology-based modeling of three-dimensional (3D)-structure. We investigated correspondence between "gold standard" alignments of 3D protein structures and the sequence alignments produced by the Smith-Waterman algorithm, currently the most sensitive method for pair-wise alignment of sequences. The results of this analysis enabled development of a novel method to align a pair of protein sequences. The comparison of the Smith-Waterman and structure alignments focused on their inner structure and especially on the continuous ungapped alignment segments, "islands" between gaps. Approximately one third of the islands in the gold standard alignments have negative or low positive score, and their recognition is below the sensitivity limit of the Smith-Waterman algorithm. From the alignment accuracy perspective, the time spent by the algorithm while working in these unalignable regions is unnecessary. We considered features of the standard similarity scoring function responsible for this phenomenon and suggested an alternative hierarchical algorithm, which explicitly addresses high scoring regions. This algorithm is considerably faster than the Smith-Waterman algorithm, whereas resulting alignments are in average of the same quality with respect to the gold standard. This finding shows that the decrease of alignment accuracy is not necessarily a price for the computational efficiency.  相似文献   

4.
Recomputation of the previously evaluated similarity results between biological sequences becomes inevitable when researchers realize errors in their sequenced data or when the researchers have to compare nearly similar sequences, e.g., in a family of proteins. We present an efficient scheme for updating local sequence alignments with an affine gap model. In principle, using the previous matching result between two amino acid sequences, we perform a forward-backward alignment to generate heuristic searching bands which are bounded by a set of suboptimal paths. Given a correctly updated sequence, we initially predict a new score of the alignment path for each contour to select the best candidates among them. Then, we run the Smith-Waterman algorithm in this confined space. Furthermore, our heuristic alignment for an updated sequence shows that it can be further accelerated by using reusable dynamic programming (rDP), our prior work. In this study, we successfully validate "relative node tolerance bound” (RNTB) in the pruned searching space. Furthermore, we improve the computational performance by quantifying the successful RNTB tolerance probability and switch to rDP on perturbation-resilient columns only. In our searching space derived by a threshold value of 90 percent of the optimal alignment score, we find that 98.3 percent of contours contain correctly updated paths. We also find that our method consumes only 25.36 percent of the runtime cost of sparse dynamic programming (sDP) method, and to only 2.55 percent of that of a normal dynamic programming with the Smith-Waterman algorithm.  相似文献   

5.
A new approach to sequence comparison: normalized sequence alignment   总被引:3,自引:0,他引:3  
The Smith-Waterman algorithm for local sequence alignment is one of the most important techniques in computational molecular biology. This ingenious dynamic programming approach was designed to reveal the highly conserved fragments by discarding poorly conserved initial and terminal segments. However, the existing notion of local similarity has a serious flaw: it does not discard poorly conserved intermediate segments. The Smith-Waterman algorithm finds the local alignment with maximal score but it is unable to find local alignment with maximum degree of similarity (e.g. maximal percent of matches). Moreover, there is still no efficient algorithm that answers the following natural question: do two sequences share a (sufficiently long) fragment with more than 70% of similarity? As a result, the local alignment sometimes produces a mosaic of well-conserved fragments artificially connected by poorly-conserved or even unrelated fragments. This may lead to problems in comparison of long genomic sequences and comparative gene prediction as recently pointed out by Zhang et al. (Bioinformatics, 15, 1012-1019, 1999). In this paper we propose a new sequence comparison algorithm (normalized local alignment ) that reports the regions with maximum degree of similarity. The algorithm is based on fractional programming and its running time is O(n2log n). In practice, normalized local alignment is only 3-5 times slower than the standard Smith-Waterman algorithm.  相似文献   

6.
There is a need for faster and more sensitive algorithms for sequence similarity searching in view of the rapidly increasing amounts of genomic sequence data available. Parallel processing capabilities in the form of the single instruction, multiple data (SIMD) technology are now available in common microprocessors and enable a single microprocessor to perform many operations in parallel. The ParAlign algorithm has been specifically designed to take advantage of this technology. The new algorithm initially exploits parallelism to perform a very rapid computation of the exact optimal ungapped alignment score for all diagonals in the alignment matrix. Then, a novel heuristic is employed to compute an approximate score of a gapped alignment by combining the scores of several diagonals. This approximate score is used to select the most interesting database sequences for a subsequent Smith-Waterman alignment, which is also parallelised. The resulting method represents a substantial improvement compared to existing heuristics. The sensitivity and specificity of ParAlign was found to be as good as Smith-Waterman implementations when the same method for computing the statistical significance of the matches was used. In terms of speed, only the significantly less sensitive NCBI BLAST 2 program was found to outperform the new approach. Online searches are available at http://dna.uio.no/search/  相似文献   

7.

Background

The Smith-Waterman algorithm, which produces the optimal pairwise alignment between two sequences, is frequently used as a key component of fast heuristic read mapping and variation detection tools for next-generation sequencing data. Though various fast Smith-Waterman implementations are developed, they are either designed as monolithic protein database searching tools, which do not return detailed alignment, or are embedded into other tools. These issues make reusing these efficient Smith-Waterman implementations impractical.

Results

To facilitate easy integration of the fast Single-Instruction-Multiple-Data Smith-Waterman algorithm into third-party software, we wrote a C/C++ library, which extends Farrar’s Striped Smith-Waterman (SSW) to return alignment information in addition to the optimal Smith-Waterman score. In this library we developed a new method to generate the full optimal alignment results and a suboptimal score in linear space at little cost of efficiency. This improvement makes the fast Single-Instruction-Multiple-Data Smith-Waterman become really useful in genomic applications. SSW is available both as a C/C++ software library, as well as a stand-alone alignment tool at: https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library.

Conclusions

The SSW library has been used in the primary read mapping tool MOSAIK, the split-read mapping program SCISSORS, the MEI detector TANGRAM, and the read-overlap graph generation program RZMBLR. The speeds of the mentioned software are improved significantly by replacing their ordinary Smith-Waterman or banded Smith-Waterman module with the SSW Library.  相似文献   

8.
In many applications, the algorithmically obtained alignment ideally should restore the "golden standard" (GS) alignment, which superimposes positions originating from the same position of the common ancestor of the compared sequences. The average similarity between the algorithmically obtained and GS alignments ("the quality") is an important characteristic of an alignment algorithm. We proposed to determine the quality of an algorithm, using sequences that were artificially generated in accordance with an appropriate evolution model; the approach was applied to the global version of the Smith-Waterman algorithm (SWA). The quality of SWA is between 97% (for a PAM distance of 60) and 70% (for a PAM distance of 300). The percentage of identical aligned residues is the same for algorithmic and GS alignments. The total length of indels in algorithmic alignments is less than in the GS-mainly due to a substantial decrease in the number of indels in algorithmic alignments.  相似文献   

9.

Background  

Detecting remote homologies by direct comparison of protein sequences remains a challenging task. We had previously developed a similarity score between sequences, called a local alignment kernel, that exhibits good performance for this task in combination with a support vector machine. The local alignment kernel depends on an amino acid substitution matrix. Since commonly used BLOSUM or PAM matrices for scoring amino acid matches have been optimized to be used in combination with the Smith-Waterman algorithm, the matrices optimal for the local alignment kernel can be different.  相似文献   

10.
The review considers the original works on the primary structure of biopolymers, which were carried out from 1983 to 2003. Most works were supported by the Russian program Human Genome and earlier similar Russian programs. Little-known publications of 1983-1993 and recent unpublished results are described in detail. In the field of genome comparisons, these concern the OWEN hierarchic algorithm aligning syntenic regions of two genome sequences. The resulting global alignment is obtained as an ordered chain of local similarities. Alignment of sequences sized about 10(6) nucleotides takes several minutes. The concept of local similarity conflicts is generalized to multiple comparisons. New algorithms aligning protein sequences are described and compared with the Smith-Waterman algorithm, which is now most accurate. The ANCHOR hierarchic algorithm generates alignments of much the same accuracy and is twice as rapid as the Smith-Waterman one. The STRSWer algorithm takes an account of the secondary structures of proteins under study. With the secondary structures predicted using the PSI-PRED software for pairs of proteins having 10-30% similarity, the average accuracy of alignments generated by STRSWer is 15% higher than that achieved with the Smith-Waterman algorithm.  相似文献   

11.

Background  

In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database have been created using an alternative statistical significance test: a Z-score based on Monte-Carlo statistics. Several papers have described the superiority of the Z-score as compared to the e-value, using simulated data. We were interested if this could be validated when applied to existing, evolutionary related protein sequences.  相似文献   

12.

Background  

To infer homology and subsequently gene function, the Smith-Waterman (SW) algorithm is used to find the optimal local alignment between two sequences. When searching sequence databases that may contain hundreds of millions of sequences, this algorithm becomes computationally expensive.  相似文献   

13.
The accuracy of the global Smith-Waterman alignments and Pareto-optimal alignments depending on the degree of sequence similarity (percent of coincidence, % id, and the number of remote fragments NGap) has been examined. An algorithm for constructing a set of three to six alignments has been developed of which the accuracy of the best alignment exceeds on the average the accuracy of the best alignment that can be constructed using the Smith-Waterman algorithm. For weakly homologous sequences (% id 15, NGap 20), the increase in the accuracy is on the average about 8%, with the average accuracy of the global Smith-Waterman alignments being about 38% (the accuracy was estimated on model test sets).  相似文献   

14.
MOTIVATION: Likelihood ratio approximants (LRA) have been widely used for model comparison in statistics. The present study was undertaken in order to explore their utility as a scoring (ranking) function in the classification of protein sequences. RESULTS: We used a simple LRA-based on the maximal similarity (or minimal distance) scores of the two top ranking sequence classes. The scoring methods (Smith-Waterman, BLAST, local alignment kernel and compression based distances) were compared on datasets designed to test sequence similarities between proteins distantly related in terms of structure or evolution. It was found that LRA-based scoring can significantly outperform simple scoring methods.  相似文献   

15.
Locality is an important and well-studied notion in comparative analysis of biological sequences. Similarly, taking into account affine gap penalties when calculating biological sequence alignments is a well-accepted technique for obtaining better alignments. When dealing with RNA, one has to take into consideration not only sequential features, but also structural features of the inspected molecule. This makes the computation more challenging, and usually prohibits the comparison only to small RNAs. In this paper we introduce two local metrics for comparing RNAs that extend the Smith-Waterman metric and its normalized version used for string comparison. We also present a global RNA alignment algorithm which handles affine gap penalties. Our global algorithm runs in O(m(2)n(1 + lg n/m)) time, while our local algorithms run in O(m(2)n(1 + lg n/m)) and O(n(2)m) time, respectively, where m 相似文献   

16.
We describe a new algorithm for protein classification and the detection of remote homologs. The rationale is to exploit both vertical and horizontal information of a multiple alignment in a well-balanced manner. This is in contrast to established methods such as profiles and profile hidden Markov models which focus on vertical information as they model the columns of the alignment independently and to family pairwise search which focuses on horizontal information as it treats given sequences separately. In our setting, we want to select from a given database of "candidate sequences" those proteins that belong to a given superfamily. In order to do so, each candidate sequence is separately tested against a multiple alignment of the known members of the superfamily by means of a new jumping alignment algorithm. This algorithm is an extension of the Smith-Waterman algorithm and computes a local alignment of a single sequence and a multiple alignment. In contrast to traditional methods, however, this alignment is not based on a summary of the individual columns of the multiple alignment. Rather, the candidate sequence is at each position aligned to one sequence of the multiple alignment, called the "reference sequence." In addition, the reference sequence may change within the alignment, while each such jump is penalized. To evaluate the discriminative quality of the jumping alignment algorithm, we compare it to profiles, profile hidden Markov models, and family pairwise search on a subset of the SCOP database of protein domains. The discriminative quality is assessed by median false positive counts (med-FP-counts). For moderate med-FP-counts, the number of successful searches with our method is considerably higher than with the competing methods.  相似文献   

17.
MOTIVATION: The only algorithm guaranteed to find the optimal local alignment is the Smith-Waterman. It is also one of the slowest due to the number of computations required for the search. To speed up the algorithm, Single-Instruction Multiple-Data (SIMD) instructions have been used to parallelize the algorithm at the instruction level. RESULTS: A faster implementation of the Smith-Waterman algorithm is presented. This algorithm achieved 2-8 times performance improvement over other SIMD based Smith-Waterman implementations. On a 2.0 GHz Xeon Core 2 Duo processor, speeds of >3.0 billion cell updates/s were achieved. AVAILABILITY: http://farrar.michael.googlepages.com/Smith-waterman  相似文献   

18.
Pairwise sequence alignments aim to decide whether two sequences are related and, if so, to exhibit their related domains. Recent works have pointed out that a significant number of true homologous sequences are missed when using classical comparison algorithms. This is the case when two homologous sequences share several little blocks of homology, too small to lead to a significant score. On the other hand, classical alignment algorithms, when detecting homologies, may fail to recognize all the significant biological signals. The aim of the paper is to give a solution to these two problems. We propose a new scoring method which tends to increase the score of an alignment when "blocks" are detected. This so-called Block-Scoring algorithm, which makes use of dynamic programming, is worth being used as a complementary tool to classical exact alignments methods. We validate our approach by applying it on a large set of biological data. Finally, we give a limit theorem for the score statistics of the algorithm.  相似文献   

19.
MOTIVATION: The global alignment of protein sequence pairs is often used in the classification and analysis of full-length sequences. The calculation of a Z-score for the comparison gives a length and composition corrected measure of the similarity between the sequences. However, the Z-score alone, does not indicate the likely biological significance of the similarity. In this paper, all pairs of domains from 250 sequences belonging to different SCOP folds were aligned and Z-scores calculated. The distribution of Z-scores was fitted with a peak distribution from which the probability of obtaining a given Z-score from the global alignment of two protein sequences of unrelated fold was calculated. A similar analysis was applied to subsequence pairs found by the Smith-Waterman algorithm. These analyses allow the probability that two protein sequences share the same fold to be estimated by global sequence alignment. RESULTS: The relationship between Z-score and probability varied little over the matrix/gap penalty combinations examined. However, an average shift of +4.7 was observed for Z-scores derived from global alignment of locally-aligned subsequences compared to global alignment of the full-length sequences. This shift was shown to be the result of pre-selection by local alignment, rather than any structural similarity in the subsequences. The search ability of both methods was benchmarked against the SCOP superfamily classification and showed that global alignment Z-scores generated from the entire sequence are as effective as SSEARCH at low error rates and more effective at higher error rates. However, global alignment Z-scores generated from the best locally-aligned subsequence were significantly less effective than SSEARCH. The method of estimating statistical significance described here was shown to give similar values to SSEARCH and BLAST, providing confidence in the significance estimation. AVAILABILITY: Software to apply the statistics to global alignments is available from http://barton.ebi.ac.uk. CONTACT: geoff@ebi.ac.uk  相似文献   

20.
MOTIVATION: The analyses of the increasing number of genome sequences requires shortcuts for the detection of orthologs, such as Reciprocal Best Hits (RBH), where orthologs are assumed if two genes each in a different genome find each other as the best hit in the other genome. Two BLAST options seem to affect alignment scores the most, and thus the choice of a best hit: the filtering of low information sequence segments and the algorithm used to produce the final alignment. Thus, we decided to test whether such options would help better detect orthologs. RESULTS: Using Escherichia coli K12 as an example, we compared the number and quality of orthologs detected as RBH. We tested four different conditions derived from two options: filtering of low-information segments, hard (default) versus soft; and alignment algorithm, default (based on matching words) versus Smith-Waterman. All options resulted in significant differences in the number of orthologs detected, with the highest numbers obtained with the combination of soft filtering with Smith-Waterman alignments. We compared these results with those of Reciprocal Shortest Distances (RSD), supposed to be superior to RBH because it uses an evolutionary measure of distance, rather than BLAST statistics, to rank homologs and thus detect orthologs. RSD barely increased the number of orthologs detected over those found with RBH. Error estimates, based on analyses of conservation of gene order, found small differences in the quality of orthologs detected using RBH. However, RSD showed the highest error rates. Thus, RSD have no advantages over RBH. AVAILABILITY: Orthologs detected as Reciprocal Best Hits using soft masking and Smith-Waterman alignments can be downloaded from http://popolvuh.wlu.ca/Orthologs.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号