首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Post-processing long pairwise alignments   总被引:2,自引:0,他引:2  
MOTIVATION: The local alignment problem for two sequences requires determining similar regions, one from each sequence, and aligning those regions. For alignments computed by dynamic programming, current approaches for selecting similar regions may have potential flaws. For instance, the criterion of Smith and Waterman can lead to inclusion of an arbitrarily poor internal segment. Other approaches can generate an alignment scoring less than some of its internal segments. RESULTS: We develop an algorithm that decomposes a long alignment into sub-alignments that avoid these potential imperfections. Our algorithm runs in time proportional to the original alignment's length. Practical applications to alignments of genomic DNA sequences are described.  相似文献   

2.
生物信息学中,Smith Waterman算法用于同源长序列的局部联配时,经常会出现马赛克问题(相似度很低的保守区域夹在两个相似度很高的区域中间)。在分析问题成因的基础上,提出利用动态加速扣分策略解决马赛克问题,即在计算得分矩阵的过程中.如果存在保守区域,则加大扣分的力度,争取在离开保守区域前让得分为0,从而将保守区域切断。实验结果表明,动态加速扣分策略顺利解决了序列局部联配中的马赛克问题,并且没有显著增加算法的时间复杂度和空间复杂度。  相似文献   

3.
The score statistics of probabilistic gapped local alignment of random sequences is investigated both analytically and numerically. The full probabilistic algorithm (e.g., the "local" version of maximum-likelihood or hidden Markov model method) is found to have anomalous statistics. A modified "semi-probabilistic" alignment consisting of a hybrid of Smith-Waterman and probabilistic alignment is then proposed and studied in detail. It is predicted that the score statistics of the hybrid algorithm is of the Gumbel universal form, with the key Gumbel parameter lambda taking on a fixed asymptotic value for a wide variety of scoring systems and parameters. A simple recipe for the computation of the "relative entropy," and from it the finite size correction to lambda, is also given. These predictions compare well with direct numerical simulations for sequences of lengths between 100 and 1,000 examined using various PAM substitution scores and affine gap functions. The sensitivity of the hybrid method in the detection of sequence homology is also studied using correlated sequences generated from toy mutation models. It is found to be comparable to that of the Smith-Waterman alignment and significantly better than the Viterbi version of the probabilistic alignment.  相似文献   

4.
BALSA: Bayesian algorithm for local sequence alignment   总被引:3,自引:1,他引:2       下载免费PDF全文
The Smith–Waterman algorithm yields a single alignment, which, albeit optimal, can be strongly affected by the choice of the scoring matrix and the gap penalties. Additionally, the scores obtained are dependent upon the lengths of the aligned sequences, requiring a post-analysis conversion. To overcome some of these shortcomings, we developed a Bayesian algorithm for local sequence alignment (BALSA), that takes into account the uncertainty associated with all unknown variables by incorporating in its forward sums a series of scoring matrices, gap parameters and all possible alignments. The algorithm can return both the joint and the marginal optimal alignments, samples of alignments drawn from the posterior distribution and the posterior probabilities of gap penalties and scoring matrices. Furthermore, it automatically adjusts for variations in sequence lengths. BALSA was compared with SSEARCH, to date the best performing dynamic programming algorithm in the detection of structural neighbors. Using the SCOP databases PDB40D-B and PDB90D-B, BALSA detected 19.8 and 41.3% of remote homologs whereas SSEARCH detected 18.4 and 38% at an error rate of 1% errors per query over the databases, respectively.  相似文献   

5.
We have written two programs for searching biological sequencedatabases that run on Intel hypercube computers. PSCANLJB comparesa single sequence against a sequence library, and PCOMPLIB comparesall the entries in one sequence library against a second library.The programs provide a general framework for similarity searching;they include functions for reading in query sequences, searchparameters and library entries, and reporting the results ofa search. We have isolated the code for the specific functionthat calculates the similarity score between the query and librarysequence; alternative searching algorithms can be implementedby editing two files. We have implemented the rapid FASTA sequencecomparison algorithm and the more rigorous Smith — Watermanalgorithm within this framework. The PSCANLIB program on a 16node iPSC/2 80386-based hypercube can compare a 229 amino acidprotein sequence with a 3.4 million residue sequence libraryin {small tilde}16s with the FASTA algorithm. Using the Smith— Waterman algorithm, the same search takes 35 min. ThePCOMPUB program can compare a 0.8 millon amino acid proteinsequence library with itself in 5.3 min with FASTA on a third-generation32 node Intel iPSC/860 hypercube. Received on September 8, 1990; accepted on December 15, 1990  相似文献   

6.
Comparative accuracy of methods for protein sequence similarity search   总被引:2,自引:0,他引:2  
MOTIVATION: Searching a protein sequence database for homologs is a powerful tool for discovering the structure and function of a sequence. Two new methods for searching sequence databases have recently been described: Probabilistic Smith-Waterman (PSW), which is based on Hidden Markov models for a single sequence using a standard scoring matrix, and a new version of BLAST (WU-BLAST2), which uses Sum statistics for gapped alignments. RESULTS: This paper compares and contrasts the effectiveness of these methods with three older methods (Smith- Waterman: SSEARCH, FASTA and BLASTP). The analysis indicates that the new methods are useful, and often offer improved accuracy. These tools are compared using a curated (by Bill Pearson) version of the annotated portion of PIR 39. Three different statistical criteria are utilized: equivalence number, minimum errors and the receiver operating characteristic. For complete-length protein query sequences from large families, PSW's accuracy is superior to that of the other methods, but its accuracy is poor when used with partial-length query sequences. False negatives are twice as common as false positives irrespective of the search methods if a family-specific threshold score that minimizes the total number of errors (i.e. the most favorable threshold score possible) is used. Thus, sensitivity, not selectivity, is the major problem. Among the analyzed methods using default parameters, the best accuracy was obtained from SSEARCH and PSW for complete-length proteins, and the two BLAST programs, plus SSEARCH, for partial-length proteins.   相似文献   

7.
The POLYFIT rigid‐body algorithm for automated global pairwise and multiple protein structural alignment is presented. Smith–Waterman local alignment is used to establish a set of seed equivalences that are extended using Needleman–Wunsch dynamic programming techniques. Structural and functional interaction constraints provided by evolution are encoded as one‐dimensional residue physical environment strings for alignment of highly structurally overlapped protein pairs. Local structure alignment of more distantly related pairs is carried out using rigid‐body conformational matching of 15‐residue fragments, with allowance made for less stringent conformational matching of metal‐ion and small molecule ligand‐contact, disulphide bridge, and cis‐peptide correspondences. Protein structural plasticity is accommodated through the stepped adjustment of a single empirical distance parameter value in the calculation of the Smith–Waterman dynamic programming matrix. Structural overlap is used both as a measure of similarity and to assess alignment quality. Pairwise alignment accuracy has been benchmarked against that of 10 widely used aligners on the Sippl and Wiederstein set of difficult pairwise structure alignment problems, and more extensively against that of Matt, SALIGN, and MUSTANG in pairwise and multiple structural alignments of protein domains with low shared sequence identity in the SCOP‐ASTRAL 40% compendium. The results demonstrate the advantages of POLYFIT over other aligners in the efficient and robust identification of matching seed residue positions in distantly related protein targets and in the generation of longer structurally overlapped alignment lengths. Superposition‐based application areas include comparative modeling and protein and ligand design. POLYFIT is available on the Web server at http://polyfit.insa‐toulouse.fr . Proteins 2013; 81:1823–1839. © 2013 Wiley Periodicals, Inc.  相似文献   

8.
Summary There are several algorithms designed for searches for homologous sequences (Fitch 1966; Needleman and Wunsch 1970; Chva'tal and Sankoff 1975; Griggs 1977; Sannkoff 1972; Smith and Waterman 1981; Smith et al. 1981, Wagner and Fisher 1974; Waterman et al. 1976). This paper presents some very simple and useful high speed, text editing algorithms that search for exact nucleotide sequence repetition and genome duplication. The last algorithm suggested here is specifically adapted for the 4-letter alphabet of nucleotide sequences. Owing to the rapid accumulation of nucleotide sequences and the frequent need to search for sequence repetition or where a given set of nucleotides occurs in long sequences, efficient algorithms of this type are a necessity.  相似文献   

9.
The score statistics of a recently introduced 'hybrid alignment' algorithm is studied in detail numerically. An extensive survey across the 2216 models of protein domains contained in the Pfam v5.4 database (Bateman et al., Nucleic Acids Res., 28, 263-266, 2000) verifies the theoretical predictions: For the position-specific scoring functions used in the Pfam models, the score statistics of hybrid alignment obey the Gumbel distribution, with the key Gumbel parameter lambda taking on the asymptotic value 1 universally for all models. Thus, the use of hybrid alignment eliminates the time-consuming computer simulations normally needed to assign p-values to alignment scores, freeing the users to experiment with different scoring parameters and functions. The performance of the hybrid algorithm in detecting sequence homology is also studied. For protein sequences from the SCOP database (Murzin et al., J. Mol. Biol., 247, 536-540, 1995) using uniform scoring functions, the performance is found to be comparable to the best of the existing methods. Preliminary results using the PfamA database suggest that the hybrid algorithm achieves similar performance as existing methods for position-specific scoring systems as well. Hybrid alignment is thereby established as a high performance alignment algorithm with well-characterized, universal statistics.  相似文献   

10.
Pairwise sequence alignments aim to decide whether two sequences are related and, if so, to exhibit their related domains. Recent works have pointed out that a significant number of true homologous sequences are missed when using classical comparison algorithms. This is the case when two homologous sequences share several little blocks of homology, too small to lead to a significant score. On the other hand, classical alignment algorithms, when detecting homologies, may fail to recognize all the significant biological signals. The aim of the paper is to give a solution to these two problems. We propose a new scoring method which tends to increase the score of an alignment when "blocks" are detected. This so-called Block-Scoring algorithm, which makes use of dynamic programming, is worth being used as a complementary tool to classical exact alignments methods. We validate our approach by applying it on a large set of biological data. Finally, we give a limit theorem for the score statistics of the algorithm.  相似文献   

11.
Kann M  Qian B  Goldstein RA 《Proteins》2000,41(4):498-503
The growth in protein sequence data has placed a premium on ways to infer structure and function of the newly sequenced proteins. One of the most effective ways is to identify a homologous relationship with a protein about which more is known. While close evolutionary relationships can be confidently determined with standard methods, the difficulty increases as the relationships become more distant. All of these methods rely on some score function to measure sequence similarity. The choice of score function is especially critical for these distant relationships. We describe a new method of determining a score function, optimizing the ability to discriminate between homologs and non-homologs. We find that this new score function performs better than standard score functions for the identification of distant homologies.  相似文献   

12.
Recomputation of the previously evaluated similarity results between biological sequences becomes inevitable when researchers realize errors in their sequenced data or when the researchers have to compare nearly similar sequences, e.g., in a family of proteins. We present an efficient scheme for updating local sequence alignments with an affine gap model. In principle, using the previous matching result between two amino acid sequences, we perform a forward-backward alignment to generate heuristic searching bands which are bounded by a set of suboptimal paths. Given a correctly updated sequence, we initially predict a new score of the alignment path for each contour to select the best candidates among them. Then, we run the Smith-Waterman algorithm in this confined space. Furthermore, our heuristic alignment for an updated sequence shows that it can be further accelerated by using reusable dynamic programming (rDP), our prior work. In this study, we successfully validate "relative node tolerance bound” (RNTB) in the pruned searching space. Furthermore, we improve the computational performance by quantifying the successful RNTB tolerance probability and switch to rDP on perturbation-resilient columns only. In our searching space derived by a threshold value of 90 percent of the optimal alignment score, we find that 98.3 percent of contours contain correctly updated paths. We also find that our method consumes only 25.36 percent of the runtime cost of sparse dynamic programming (sDP) method, and to only 2.55 percent of that of a normal dynamic programming with the Smith-Waterman algorithm.  相似文献   

13.
The algorithm of Smith & Waterman for identification of maximally similar subsequences is extended to allow identification of all non-intersecting similar subsequences with similarity score at or above some preset level. The resulting alignments are found in order of score, with the highest scoring alignment first. In the case of single gaps or multiple gaps weighted linear with gap length, the algorithm is extremely efficient, taking very little time beyond that of the initial calculation of the matrix. The algorithm is applied to comparisons of tRNA-rRNA sequences from Escherichia coli. A statistical analysis is important for proper evaluation of the results, which differ substantially from the results of an earlier analysis of the same sequences by Bloch and colleagues.  相似文献   

14.
A new approach to sequence comparison: normalized sequence alignment   总被引:3,自引:0,他引:3  
The Smith-Waterman algorithm for local sequence alignment is one of the most important techniques in computational molecular biology. This ingenious dynamic programming approach was designed to reveal the highly conserved fragments by discarding poorly conserved initial and terminal segments. However, the existing notion of local similarity has a serious flaw: it does not discard poorly conserved intermediate segments. The Smith-Waterman algorithm finds the local alignment with maximal score but it is unable to find local alignment with maximum degree of similarity (e.g. maximal percent of matches). Moreover, there is still no efficient algorithm that answers the following natural question: do two sequences share a (sufficiently long) fragment with more than 70% of similarity? As a result, the local alignment sometimes produces a mosaic of well-conserved fragments artificially connected by poorly-conserved or even unrelated fragments. This may lead to problems in comparison of long genomic sequences and comparative gene prediction as recently pointed out by Zhang et al. (Bioinformatics, 15, 1012-1019, 1999). In this paper we propose a new sequence comparison algorithm (normalized local alignment ) that reports the regions with maximum degree of similarity. The algorithm is based on fractional programming and its running time is O(n2log n). In practice, normalized local alignment is only 3-5 times slower than the standard Smith-Waterman algorithm.  相似文献   

15.
MOTIVATION: Much information about new protein sequences is derived from identifying homologous proteins. Such tasks are difficult when the evolutionary relationships are distant. Some modern methods achieve better results by building a model of a set of related sequences, and then identifying new proteins that fit the model. A further advance was the development of iterative methods that refine the model as more homologs are discovered. These methods are generally limited by ad hoc methods of sequence weighting, neglect of underlying evolutionary relationships and the representation of the set with a single one-size-fits-all model. These limitations are avoided through the use of a Tree hidden Markov model (T-HMM) approach. Our previous work described how a non-iterative version of the T-HMM method could identify distant homologs with superior performance compared with other non-iterated approaches, and described how this method was particularly appropriate for being implemented as an iterative algorithm. RESULTS: We describe an iterative version of the T-HMM algorithm, and evaluate its performance for the detection of distant homologs. Significant improvement over other commonly used methods is found. AVAILABILITY: The software (C++, Perl) is available from the corresponding author.  相似文献   

16.
The programme pscan has been developed to distribute proteindatabank scans over a network of computers that share a commonfilesystem. pscan may be used in conjunction with most conventionalsequence comparison programmes with few modifications. In testruns using the Smith — Waterman dynamic programming algorithm,the time required to scan a 6858 sequence databank using a querysequence 740 residues long was reduced from 50 min for a singleprocessor, to 11 minutes for five processors. Accordingly, pscanprovides a low-cost, portable alternative to dedicated parallelprocessing computers. Received on August 27, 1990; accepted on September 25, 1990  相似文献   

17.
Protein structure modeling by homology requires an accurate sequence alignment between the query protein and its structural template. However, sequence alignment methods based on dynamic programming (DP) are typically unable to generate accurate alignments for remote sequence homologs, thus limiting the applicability of modeling methods. A central problem is that the alignment that is "optimal" in terms of the DP score does not necessarily correspond to the alignment that produces the most accurate structural model. That is, the correct alignment based on structural superposition will generally have a lower score than the optimal alignment obtained from sequence. Variations of the DP algorithm have been developed that generate alternative alignments that are "suboptimal" in terms of the DP score, but these still encounter difficulties in detecting the correct structural alignment. We present here a new alternative sequence alignment method that relies heavily on the structure of the template. By initially aligning the query sequence to individual fragments in secondary structure elements and combining high-scoring fragments that pass basic tests for "modelability", we can generate accurate alignments within a small ensemble. Our results suggest that the set of sequences that can currently be modeled by homology can be greatly extended.  相似文献   

18.
Biological Sequence Comparison is one of the most important operations in Computational Biology since it is used to determine how similar two sequences are. Smith and Waterman proposed an exact algorithm (SW), based on dynamic programming, that is able to obtain the best local alignment between two sequences in quadratic time and space. In order to compare long biological sequences, SW is rarely used since the computation time and the amount of memory required becomes prohibitive. For this reason, heuristic methods like BLAST are widely used. Although faster, these heuristic methods do not guarantee that the best result will be produced. In this paper, we propose an exact parallel variant of the SW algorithm that obtains the best local alignments in quadratic time and reduced space. The results obtained in two clusters (8-machine and 16-machine) for DNA sequences longer than 32 KBP (kilo base-pairs) were very close to linear and, in some cases, superlinear. For very long DNA sequences (1.6 MBP), we were able to reduce execution time from 12.25 hours to 1.54 hours, in our 8-machine cluster. As far as we know, this is the first time 1.6 MBP sequences are compared with an exact SW variant. In this case, 30240 best local alignments were obtained.
Azzedine BoukercheEmail:
  相似文献   

19.
The determination of distant evolutionary relationships remains an important biological problem, and distant homologs often appear in statistically insignificant regions of sequence similarity searches. Intersect is a computer program designed to identify and visualize the overlaps between sets of sequences reported by multiple database searches. This capability extends the usefulness of database search results and aids researchers in identifying the individual sequences that best bridge sequence families and superfamilies. AVAILABILITY: The Intersect program is available from the Babbitt laboratory website at http://www.babbittlab.ucsf.edu/software/intersect  相似文献   

20.
Methods for protein structure (3D)-sequence (1D) compatibility evaluation (threading) have been developed during the past decade. The protocol in which a sequence can recognize its compatible structure in the structural library (i.e., the fold recognition or the forward-folding search) is available for the structure prediction of new proteins. However, the reverse protocol, in which a structure recognizes its homologous sequences among a sequence database, named the inverse-folding search, is a more difficult application. In this study, we have investigated the feasibility of the latter approach. A structural library, composed of about 400 well-resolved structures with mutually dissimilar sequences, was prepared, and 163 of them had remote homologs in the library. We examined whether they could correctly seek their homologs by both forward- and inverse-folding searches. The results showed that the inverse-folding protocol is more effective than the forward-folding protocol, once the reference states of the compatibility functions are appropriately adjusted. This adjustment only slightly affects the ability of the forward-folding search. We noticed that the scoring, in which a given sequence is re-mounted onto a structure according to the 3D-1D alignment determined by the dynamic programming method, is only effective in the forward-folding protocol and not in the inverse-folding protocol. Namely, the inverse-folding search works significantly better with the score given by the 3D-1D alignment per se, rather than that obtained by the re-mounting. The implications of these results are discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号