共查询到20条相似文献,搜索用时 0 毫秒
1.
A fast and sensitive multiple sequence alignment algorithm 总被引:4,自引:0,他引:4
A two-step multiple alignment strategy is presented that allowsrapid alignment of a set of homologous sequences and comparisonof pre-aligned groups of sequences. Examples are given demonstratingthe improvement in the quality of alignments when comparingentire groups instead of single sequences. The modular designof computer programs based on this algorithm allows for storageof aligned sequences and successive alignment of any numberof sequences.
Received on August 23, 1988; accepted on December 6, 1988 相似文献
2.
MOTIVATION: Sequence database searching is among the most important and challenging tasks in bioinformatics. The ultimate choice of sequence-search algorithm is that of Smith-Waterman. However, because of the computationally demanding nature of this method, heuristic programs or special-purpose hardware alternatives have been developed. Increased speed has been obtained at the cost of reduced sensitivity or very expensive hardware. RESULTS: A fast implementation of the Smith-Waterman sequence-alignment algorithm using Single-Instruction, Multiple-Data (SIMD) technology is presented. This implementation is based on the MultiMedia eXtensions (MMX) and Streaming SIMD Extensions (SSE) technology that is embedded in Intel's latest microprocessors. Similar technology exists also in other modern microprocessors. Six-fold speed-up relative to the fastest previously known Smith-Waterman implementation on the same hardware was achieved by an optimized 8-way parallel processing approach. A speed of more than 150 million cell updates per second was obtained on a single Intel Pentium III 500 MHz microprocessor. This is probably the fastest implementation of this algorithm on a single general-purpose microprocessor described to date. 相似文献
3.
Background
Sequence similarity searching is an important and challenging task in molecular biology and next-generation sequencing should further strengthen the need for faster algorithms to process such vast amounts of data. At the same time, the internal architecture of current microprocessors is tending towards more parallelism, leading to the use of chips with two, four and more cores integrated on the same die. The main purpose of this work was to design an effective algorithm to fit with the parallel capabilities of modern microprocessors. 相似文献4.
Improved sensitivity of biological sequence database searches 总被引:26,自引:0,他引:26
Brutlag Douglas L.; Dautricourt Jean-Pierre; Maulik Sunil; Relnh John 《Bioinformatics (Oxford, England)》1990,6(3):237-245
We have increased the sensitivity ofDNA and protein sequencedatabase searches by allowing similar but non-identical aminoacids or nucleotides to match. In addition, one can match k-tuplesor words instead of matching individual residues in order tospeed the search. A matching matrix specifies which k-tuplesmatch each other. The matching matrix can be calculated froma similarity matrix of amino acids and a threshold of similarityrequired for matching. This permits amino acid similarity matricesor replacement matrices (PAM matrices) to be used in the firststep of a sequence comparison rather than in a secondary scoringphase. The concept of matching non-identical k-tuples also increasesthe power ofDNA database searches. For example, a matrix thatspecifies that any 3-tuple in a DNA sequence can match any other3-tuple encoding the same amino acid permits a DNA databasesearch using a DNA query sequence for regions that would encodea similar amino acid sequence.
Received on October 10, 1989; accepted on May 1, 1990 相似文献
5.
Miller Perry L.; Nadkarni Prakash M.; Carriero Nicholas M. 《Bioinformatics (Oxford, England)》1991,7(1):71-78
We have parallelized the FASTA algorithm for biological sequencecomparison using Linda, a machine-independent parallel programminglanguage. The resulting parallel program runs on a variety ofdifferent parallel machines. A straightforward parallelizationstrategy works well if the amount of computation to be doneis relatively large. When the amount of computation is reduced,however, disk I/O becomes a bottleneck which may prevent additionalspeed-up as the number of processors is increased. The paperdescribes the parallelization of FASTA, and uses FASTA to illustratethe I/O bottleneck problem that may arise when performing paralleldatabase search with a fast sequence comparison algorithm. Thepaper also describes several program design strategies thatcan help with this problem. The paper discusses how this bottleneckis an example of a general problem that may occur when parallelizing,or otherwise speeding up, a time-consuming computation.
Received on July 25, 1990; accepted on October 15, 1990 相似文献
6.
The Smith–Waterman algorithm yields a single alignment, which, albeit optimal, can be strongly affected by the choice of the scoring matrix and the gap penalties. Additionally, the scores obtained are dependent upon the lengths of the aligned sequences, requiring a post-analysis conversion. To overcome some of these shortcomings, we developed a Bayesian algorithm for local sequence alignment (BALSA), that takes into account the uncertainty associated with all unknown variables by incorporating in its forward sums a series of scoring matrices, gap parameters and all possible alignments. The algorithm can return both the joint and the marginal optimal alignments, samples of alignments drawn from the posterior distribution and the posterior probabilities of gap penalties and scoring matrices. Furthermore, it automatically adjusts for variations in sequence lengths. BALSA was compared with SSEARCH, to date the best performing dynamic programming algorithm in the detection of structural neighbors. Using the SCOP databases PDB40D-B and PDB90D-B, BALSA detected 19.8 and 41.3% of remote homologs whereas SSEARCH detected 18.4 and 38% at an error rate of 1% errors per query over the databases, respectively. 相似文献
7.
Multiple sequence alignment by a pairwise algorithm 总被引:1,自引:0,他引:1
An algorithm is described that processes the results of a conventionalpairwise sequence alignment program to automatically producean unambiguous multiple alignment of many sequences. Unlikeother, more complex, multiple alignment programs, the methoddescribed here is fast enough to be used on almost any multiplesequence alignment problem.
Received on September 25, 1986; accepted on January 29, 1987 相似文献
8.
SUMMARY: Multiple sequence alignment is the NP-hard problem of aligning three or more DNA or amino acid sequences in an optimal way so as to match as many characters as possible from the set of sequences. The popular sequence alignment program ClustalW uses the classical method of approximating a sequence alignment, by first computing a distance matrix and then constructing a guide tree to show the evolutionary relationship of the sequences. We show that parallelizing the ClustalW algorithm can result in significant speedup. We used a cluster of workstations using C and message passing interface for our implementation. Experimental results show that speedup of over 5.5 on six processors is obtainable for most inputs. AVAILABILITY: The software is available upon request from the second author. 相似文献
9.
Lombard V Camon EB Parkinson HE Hingamp P Stoesser G Redaschi N 《Bioinformatics (Oxford, England)》2002,18(5):763-764
The submission of multiple sequence alignment data to EMBL has grown 30-fold in the past 10 years, creating a problem of archiving them. The EBI has developed a new public database of multiple sequence alignments called EMBL-Align. It has a dedicated web-based submission tool, Webin-Align. Together they represent a comprehensive data management solution for alignment data. Webin-Align accepts all the common alignment formats and can display data in CLUSTALW format as well as a new standard EMBL-Align flat file format. The alignments are stored in the EMBL-Align database and can be queried from the EBI SRS (Sequence Retrieval System) server. AVAILABILITY: Webin-Align: http://www.ebi.ac.uk/embl/Submission/align_top.html, EMBL-Align: ftp://ftp.ebi.ac.uk/pub/databases/embl/align, http://srs.ebi.ac.uk/ 相似文献
10.
Ishikawa Masato; Toya Tomoyuki; Hoshida Masaki; Nitta Katsumi; Ogiwara Atushi; Kanehisa Minoru 《Bioinformatics (Oxford, England)》1993,9(3):267-273
We have developed simulated annealing algorithms to solve theproblem of multiple sequence alignment. The algorithm wns shownto give the optimal solution as confirmed by the rigorous dynamicprogramming algorithm for three-sequence alignment. To overcomelong execution times for simulated annealing, we utilized aparallel computer. A sequential algorithm, a simple parallelalgorithm and the temperature parallel algorithm were testedon a problem. The results were compared with the result obtainedby a conventional tree-based algorithm where alignments weremerged by two-' dynamic programming. Every annealing algorithmproduced a better energy value than the conventional algorithm.The best energy value, which probably represents the optimalsolution, wns reached within a reasonable time by both of theparallel annealing algorithms. We consider the temperature parallelalgorithm of simulated annealing to be the most suitable forfinding the optimal multiple sequence alignment because thealgorithm does not require any scheduling for optimization.The algorithm is also usefiui for refining multiple alignmentsobtained by other hewistic methods. 相似文献
11.
Background
We present a complete re-implementation of the segment-based approach to multiple protein alignment that contains a number of improvements compared to the previous version 2.2 of DIALIGN. This previous version is superior to Needleman-Wunsch-based multi-alignment programs on locally related sequence sets. However, it is often outperformed by these methods on data sets with global but weak similarity at the primary-sequence level. 相似文献12.
Using data from the CATH structure classification, we have assessed the blastp, fasta, smith-waterman and gapped-blast algorithms, developed a portable normalization scheme and identified safe thresholds for database searching. Of the four methods assessed, fasta, smith-waterman and gapped-blast perform similarly, whereas the sensitivity of blastp was much lower. Introduction of an intermediate sequence search substantially improved the results. When tested on a set of relationships that could not be identified by blastp, intermediate sequences were able to find double the number of relationships identified by the smith-waterman algorithm alone. However, we found that the benefit of using intermediates varied considerably between each family and depended not only on the number of available sequences, but also their diversity. In an attempt to increase sensitivity further, a multiple intermediate sequence search (MISS) procedure was developed. When assessed on 1906 cases from a wide range of homologous families that could not be detected by the previous approaches, MISS was able to identify 241 additional relationships. MISS uses the full extent of sequence diversity to detect additional relationships, but does not consider any structure-specific information. For this reason, it is more generally applicable than fold recognition and threading methods, which require a library of known structures. 相似文献
13.
An implementation of Profilesearch (a technique to search forrelationships between a protein sequence and multiply alignedsequences) for a parallel computer is described. The numbercrunchingmachine, consisting of 21 T800 transputers, is connected toa Macintosh IIcx host computer. The program utilizes a standardMacintosh application as its userinterface, resultingin a transparent and userfriendly environment for addressingthe parallel computer. The program is independent of the nwnberof available processors and exceeds the speed of a VAXstation3200 with only one transputer in operation, thus allowing cheapand fast database searches with a PC frontend. For a largernwnber of processors, the speed increase is approximately linearwith no obvious symptoms of saturation with the available maximwnof 21 transputers. The program and environment are usefid tosearch quickly and easily for similarities between a singlesequence or sequence set and individual sequences containedin a large database. The alignment is determined by typicaldynamic programming techniques. 相似文献
14.
Multiple sequence alignment plays an important role in molecular sequence analysis. An alignment is the arrangement of two (pairwise alignment) or more (multiple alignment) sequences of 'residues' (nucleotides or amino acids) that maximizes the similarities between them. Algorithmically, the problem consists of opening and extending gaps in the sequences to maximize an objective function (measurement of similarity). A simple genetic algorithm was developed and implemented in the software MSA-GA. Genetic algorithms, a class of evolutionary algorithms, are well suited for problems of this nature since residues and gaps are discrete units. An evolutionary algorithm cannot compete in terms of speed with progressive alignment methods but it has the advantage of being able to correct for initially misaligned sequences; which is not possible with the progressive method. This was shown using the BaliBase benchmark, where Clustal-W alignments were used to seed the initial population in MSA-GA, improving outcome. Alignment scoring functions still constitute an open field of research, and it is important to develop methods that simplify the testing of new functions. A general evolutionary framework for testing and implementing different scoring functions was developed. The results show that a simple genetic algorithm is capable of optimizing an alignment without the need of the excessively complex operators used in prior study. The clear distinction between objective function and genetic algorithms used in MSA-GA makes extending and/or replacing objective functions a trivial task. 相似文献
15.
We describe a new approach to multiple sequence alignment using genetic algorithms and an associated software package called SAGA. The method involves evolving a population of alignments in a quasi evolutionary manner and gradually improving the fitness of the population as measured by an objective function which measures multiple alignment quality. SAGA uses an automatic scheduling scheme to control the usage of 22 different operators for combining alignments or mutating them between generations. When used to optimise the well known sums of pairs objective function, SAGA performs better than some of the widely used alternative packages. This is seen with respect to the ability to achieve an optimal solution and with regard to the accuracy of alignment by comparison with reference alignments based on sequences of known tertiary structure. The general attraction of the approach is the ability to optimise any objective function that one can invent. 相似文献
16.
17.
18.
19.
MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform 总被引:38,自引:3,他引:38
A multiple sequence alignment program, MAFFT, has been developed. The CPU time is drastically reduced as compared with existing methods. MAFFT includes two novel techniques. (i) Homo logous regions are rapidly identified by the fast Fourier transform (FFT), in which an amino acid sequence is converted to a sequence composed of volume and polarity values of each amino acid residue. (ii) We propose a simplified scoring system that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length. Two different heuristics, the progressive method (FFT-NS-2) and the iterative refinement method (FFT-NS-i), are implemented in MAFFT. The performances of FFT-NS-2 and FFT-NS-i were compared with other methods by computer simulations and benchmark tests; the CPU time of FFT-NS-2 is drastically reduced as compared with CLUSTALW with comparable accuracy. FFT-NS-i is over 100 times faster than T-COFFEE, when the number of input sequences exceeds 60, without sacrificing the accuracy. 相似文献
20.
DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches 总被引:15,自引:4,他引:15 下载免费PDF全文
DbClustal addresses the important problem of the automatic multiple alignment of the top scoring full-length sequences detected by a database homology search. By combining the advantages of both local and global alignment algorithms into a single system, DbClustal is able to provide accurate global alignments of highly divergent, complex sequence sets. Local alignment information is incorporated into a ClustalW global alignment in the form of a list of anchor points between pairs of sequences. The method is demonstrated using anchors supplied by the Blast post-processing program, Ballast. The rapidity and reliability of DbClustal have been demonstrated using the recently annotated Pyrococcus abyssi proteome where the number of alignments with totally misaligned sequences was reduced from 20% to <2%. A web site has been implemented proposing BlastP database searches with automatic alignment of the top hits by DbClustal. 相似文献