首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
SUMMARY: Multiple sequence alignment is the NP-hard problem of aligning three or more DNA or amino acid sequences in an optimal way so as to match as many characters as possible from the set of sequences. The popular sequence alignment program ClustalW uses the classical method of approximating a sequence alignment, by first computing a distance matrix and then constructing a guide tree to show the evolutionary relationship of the sequences. We show that parallelizing the ClustalW algorithm can result in significant speedup. We used a cluster of workstations using C and message passing interface for our implementation. Experimental results show that speedup of over 5.5 on six processors is obtainable for most inputs. AVAILABILITY: The software is available upon request from the second author.  相似文献   

2.
MSAT     
This article describes the development of a new method for multiple sequence alignment based on fold-level protein structure alignments, which provides an improvement in accuracy compared with the most commonly used sequence-only-based techniques. This method integrates the widely used, progressive multiple sequence alignment approach ClustalW with the Topology of Protein Structure (TOPS) topology-based alignment algorithm. The TOPS approach produces a structural alignment for the input protein set by using a topology-based pattern discovery program, providing a set of matched sequence regions that can be used to guide a sequence alignment using ClustalW. The resulting alignments are more reliable than a sequence-only alignment, as determined by 20-fold cross-validation with a set of 106 protein examples from the CATH database, distributed in seven superfold families. The method is particularly effective for sets of proteins that have similar structures at the fold level but low sequence identity. The aim of this research is to contribute towards bridging the gap between protein sequence and structure analysis, in the hope that this can be used to assist the understanding of the relationship between sequence, structure and function. The tool is available at http://balabio.dcs.gla.ac.uk/msat/.  相似文献   

3.
Aligning hundreds of sequences using progressive alignment tools such as ClustalW requires several hours on state-of-the-art workstations. We present a new approach to compute multiple sequence alignments in far shorter time using reconfigurable hardware. This results in an implementation of ClustalW with significant runtime savings on a standard off-the-shelf FPGA.  相似文献   

4.
Several methods have been developed for simultaneous estimation of alignment and tree, of which POY is the most popular. In a 2007 paper published in Systematic Biology, Ogden and Rosenberg reported on a simulation study in which they compared POY to estimating the alignment using ClustalW and then analyzing the resultant alignment using maximum parsimony. They found that ClustalW+MP outperformed POY with respect to alignment and phylogenetic tree accuracy, and they concluded that simultaneous estimation techniques are not competitive with two-phase techniques. Our paper presents a simulation study in which we focus on the NP-hard optimization problem that POY addresses: minimizing treelength. Our study considers the impact of the gap penalty and suggests that the poor performance observed for POY by Ogden and Rosenberg is due to the simple gap penalties they used to score alignment/tree pairs. Our study suggests that optimizing under an affine gap penalty might produce alignments that are better than ClustalW alignments, and competitive with those produced by the best current alignment methods. We also show that optimizing under this affine gap penalty produces trees whose topological accuracy is better than ClustalW+MP, and competitive with the current best two-phase methods.  相似文献   

5.
为了解决生物信息学中基因多序列比对的计算速度慢和软件陈旧的问题,提出了基于Yarn(Yet Another Resource Negotiator)云平台的生物基因多序列比对并行计算方法Yarn_clustalW。分析了clustalW算法的数学模型及其面向MapReduce的任务划分方式,Yarn_clustalW中综合考虑了基因的长度和数目,采用一种基于阈值刻度的任务划分方式。利用NCBI的GenBank生物基因数据作为案例程序进行了测试。实验结果表明:Yarn_clustalW比起多序列比对clustalW串行计算方法具有更快的运行时间与加速比,可以使生物科研人员节省很多时间与精力,方便对于药物靶标的发现,缩短生物药物的开发周期。  相似文献   

6.
Direct optimization frameworks for simultaneously estimating alignments and phylogenies have recently been developed. One such method, implemented in the program POY, is becoming more common for analyses of variable length sequences (e.g., analyses using ribosomal genes) and for combined evidence analyses (morphology + multiple genes). Simulation of sequences containing insertion and deletion events was performed in order to directly compare a widely used method of multiple sequence alignment (ClustalW) and subsequent parsimony analysis in PAUP* with direct optimization via POY. Data sets were simulated for pectinate, balanced, and random tree shapes under different conditions (clocklike, non-clocklike, and ultrametric). Alignment accuracy scores for the implied alignments from POY and the multiple sequence alignments from ClustalW were calculated and compared. In almost all cases (99.95%), ClustalW produced more accurate alignments than POY-implied alignments, judged by the proportion of correctly identified homologous sites. Topological accuracy (distance to the true tree) for POY topologies and topologies generated under parsimony in PAUP* from the ClustalW alignments were also compared. In 44.94% of the cases, Clustal alignment tree reconstructions via PAUP* were more accurate than POY, whereas in 16.71% of the cases POY reconstructions were more topologically accurate (38.38% of the time they were equally accurate). Comparisons between POY hypothesized alignments and the true alignments indicated that, on average, as alignment error increased, topological accuracy decreased.  相似文献   

7.
Sequence comparison methods based on position-specific score matrices (PSSMs) have proven a useful tool for recognition of the divergent members of a protein family and for annotation of functional sites. Here we investigate one of the factors that affects overall performance of PSSMs in a PSI-BLAST search, the algorithm used to construct the seed alignment upon which the PSSM is based. We compare PSSMs based on alignments constructed by global sequence similarity (ClustalW and ClustalW-pairwise), local sequence similarity (BLAST), and local structure similarity (VAST). To assess performance with respect to identification of conserved functional or structural sites, we examine the accuracy of the three-dimensional molecular models predicted by PSSM-sequence alignments. Using the known structures of those sequences as the standard of truth, we find that model accuracy varies with the algorithm used for seed alignment construction in the pattern local-structure (VAST) > local-sequence (BLAST) > global-sequence (ClustalW). Using structural similarity of query and database proteins as the standard of truth, we find that PSSM recognition sensitivity depends primarily on the diversity of the sequences included in the alignment, with an optimum around 30-50% average pairwise identity. We discuss these observations, and suggest a strategy for constructing seed alignments that optimize PSSM-sequence alignment accuracy and recognition sensitivity.  相似文献   

8.
The reconstruction of phylogenetic history is predicated on being able to accurately establish hypotheses of character homology, which involves sequence alignment for studies based on molecular sequence data. In an empirical study investigating nucleotide sequence alignment, we inferred phylogenetic trees for 43 species of the Apicomplexa and 3 of Dinozoa based on complete small-subunit rDNA sequences, using six different multiple-alignment procedures: manual alignment based on the secondary structure of the 18S rRNA molecule, and automated similarity-based alignment algorithms using the PileUp, ClustalW, TreeAlign, MALIGN, and SAM computer programs. Trees were constructed using neighboring-joining, weighted-parsimony, and maximum- likelihood methods. All of the multiple sequence alignment procedures yielded the same basic structure for the estimate of the phylogenetic relationship among the taxa, which presumably represents the underlying phylogenetic signal. However, the placement of many of the taxa was sensitive to the alignment procedure used; and the different alignments produced trees that were on average more dissimilar from each other than did the different tree-building methods used. The multiple alignments from the different procedures varied greatly in length, but aligned sequence length was not a good predictor of the similarity of the resulting phylogenetic trees. We also systematically varied the gap weights (the relative cost of inserting a new gap into a sequence or extending an already-existing gap) for the ClustalW program, and this produced alignments that were at least as different from each other as those produced by the different alignment algorithms. Furthermore, there was no combination of gap weights that produced the same tree as that from the structure alignment, in spite of the fact that many of the alignments were similar in length to the structure alignment. We also investigated the phylogenetic information content of the helical and nonhelical regions of the rDNA, and conclude that the helical regions are the most informative. We therefore conclude that many of the literature disagreements concerning the phylogeny of the Apicomplexa are probably based on differences in sequence alignment strategies rather than differences in data or tree-building methods.   相似文献   

9.
PCMA (profile consistency multiple sequence alignment) is a progressive multiple sequence alignment program that combines two different alignment strategies. Highly similar sequences are aligned in a fast way as in ClustalW, forming pre-aligned groups. The T-Coffee strategy is applied to align the relatively divergent groups based on profile-profile comparison and consistency. The scoring function for local alignments of pre-aligned groups is based on a novel profile-profile comparison method that is a generalization of the PSI-BLAST approach to profile-sequence comparison. PCMA balances speed and accuracy in a flexible way and is suitable for aligning large numbers of sequences. AVAILABILITY: PCMA is freely available for non-commercial use. Pre-compiled versions for several platforms can be downloaded from ftp://iole.swmed.edu/pub/PCMA/.  相似文献   

10.
MOTIVATION: A consensus sequence for a family of related sequences is, as the name suggests, a sequence that captures the features common to most members of the family. Consensus sequences are important in various DNA sequencing applications and are a convenient way to characterize a family of molecules. RESULTS: This paper describes a new algorithm for finding a consensus sequence, using the popular optimization method known as simulated annealing. Unlike the conventional approach of finding a consensus sequence by first forming a multiple sequence alignment, this algorithm searches for a sequence that minimises the sum of pairwise distances to each of the input sequences. The resulting consensus sequence can then be used to induce a multiple sequence alignment. The time required by the algorithm scales linearly with the number of input sequences and quadratically with the length of the consensus sequence. We present results demonstrating the high quality of the consensus sequences and alignments produced by the new algorithm. For comparison, we also present similar results obtained using ClustalW. The new algorithm outperforms ClustalW in many cases.  相似文献   

11.
We describe a program (and a website) to reformat the ClustalX/ClustalW outputs to a format that is widely used in the presentation of sequence alignment data in SNP analysis and molecular systematic studies. This program, CLOURE, CLustal OUtput REformatter, takes the multiple sequence alignment file (nucleic acid or protein) generated from Clustal as input files. The CLOURE-D format presents the Clustal alignment in a format that highlights only the different nucleotides/residues relative to the first query sequence. The program has been written in Visual Basic and will run on a Windows platform. The downloadable program, as well as a web-based server which has also been developed, can be accessed at http://imtech.res.in/~anand/cloure.html.  相似文献   

12.
Amino acid sequence alignment is an extremely useful tool in protein family analysis. Most family characteristics, such as the localization of functional residues, structural constraints and evolutionary relationships may be retrieved through the observation of the conservation pattern highlighted by the alignments. A quantitative score for the conservation in the alignment allows different stages of an alignment to be compared and consequently the alignment information to be efficiently exploited. Many scoring methods have been proposed during the last three decades. Claude Shannon's theory of communication (1948) paved the way for a consistent scoring of protein alignments by considering the residue (or symbol) frequency. A number of modifications have been proposed since that time, but the core statistical approach is still considered one of the best. By combining many database managing tools for treatment of protein sequences, a ClustalW software integration, a flexible symbols treatment and gap normalization functions, Entropy Calculator software has been developed. This new tool provides a global and optimal approach to multiple sequence alignment scoring by offering an easy graphic interface and a series of modification options that help in interpreting alignments and allow conservation pattern inferences to be performed.  相似文献   

13.
MOTIVATION:Aligning multiple proteins based on sequence information alone is challenging if sequence identity is low or there is a significant degree of structural divergence. We present a novel algorithm (SATCHMO) that is designed to address this challenge. SATCHMO simultaneously constructs a tree and a set of multiple sequence alignments, one for each internal node of the tree. The alignment at a given node contains all sequences within its sub-tree, and predicts which positions in those sequences are alignable and which are not. Aligned regions therefore typically get shorter on a path from a leaf to the root as sequences diverge in structure. Current methods either regard all positions as alignable (e.g. ClustalW), or align only those positions believed to be homologous across all sequences (e.g. profile HMM methods); by contrast SATCHMO makes different predictions of alignable regions in different subgroups. SATCHMO generates profile hidden Markov models at each node; these are used to determine branching order, to align sequences and to predict structurally alignable regions. RESULTS: In experiments on the BAliBASE benchmark alignment database, SATCHMO is shown to perform comparably to ClustalW and the UCSC SAM HMM software. Results using SATCHMO to identify protein domains are demonstrated on potassium channels, with implications for the mechanism by which tumor necrosis factor alpha affects potassium current. AVAILABILITY: The software is available for download from http://www.drive5.com/lobster/index.htm  相似文献   

14.
MOTIVATION: We consider the problem of multiple alignment of protein sequences with the goal of achieving a large SP (Sum-of-Pairs) score. RESULTS: We introduce a new graph-based method. We name our method QOMA (Quasi-Optimal Multiple Alignment). QOMA starts with an initial alignment. It represents this alignment using a K-partite graph. It then improves the SP score of the initial alignment through local optimizations within a window that moves greedily on the alignment. QOMA uses two parameters to permit flexibility in time/accuracy trade off: (1) The size of the window for local optimization. (2) The sparsity of the K-partite graph. Unlike traditional progressive methods, QOMA is independent of the order of sequences. The experimental results on BAliBASE benchmarks show that QOMA produces higher SP score than the existing tools including ClustalW, Probcons, Muscle, T-Coffee and DCA. The difference is more significant for distant proteins. AVAILABILITY: The software is available from the authors upon request.  相似文献   

15.
MOTIVATION: Multiple sequence alignment is an essential part of bioinformatics tools for a genome-scale study of genes and their evolution relations. However, making an accurate alignment between remote homologs is challenging. Here, we develop a method, called SPEM, that aligns multiple sequences using pre-processed sequence profiles and predicted secondary structures for pairwise alignment, consistency-based scoring for refinement of the pairwise alignment and a progressive algorithm for final multiple alignment. RESULTS: The alignment accuracy of SPEM is compared with those of established methods such as ClustalW, T-Coffee, MUSCLE, ProbCons and PRALINE(PSI) in easy (homologs) and hard (remote homologs) benchmarks. Results indicate that the average sum of pairwise alignment scores given by SPEM are 7-15% higher than those of the methods compared in aligning remote homologs (sequence identity <30%). Its accuracy for aligning homologs (sequence identity >30%) is statistically indistinguishable from those of the state-of-the-art techniques such as ProbCons or MUSCLE 6.0. AVAILABILITY: The SPEM server and its executables are available on http://theory.med.buffalo.edu.  相似文献   

16.
NDFl、IPFl和HNF4是与胰岛素基因表达有关的DNA结合蛋白,通过比较SWISSPROT蛋白质数据库中人类、小鼠、大鼠这三种核蛋白氨基酸一级序列、模体和结构域,发现其结构十分相似,根据蛋白质结构和功能的关系,推测这些DNA结合蛋白与胰岛素基因结合的核苷酸序列相似;从GenBanl(核酸数据库中获得人类、小鼠、大鼠胰岛素DNA序列,用ClustalW比较三者Promoter区的核苷酸序列,显示有一段核苷酸序列较为相似,同时搜索TRANSFAC基因转录数据库中NDFl、IPFl和NHF4蛋白核苷酸结合位点,发现核酸比对保守的部分序列与TRANSFAC数据库中这三个转录因子的DNA结合位点一致,另外一些核酸保守序列可能为其他未知DNA结合蛋白的结合位点。这种核酸序列比对设计为分子生物学实验寻找和验证胰岛素DNA结合蛋白与核苷酸的结合位点提供了简单而实用的方法。  相似文献   

17.
18.
Multiple sequence alignment (MSA) is a crucial first step in the analysis of genomic and proteomic data. Commonly occurring sequence features, such as deletions and insertions, are known to affect the accuracy of MSA programs, but the extent to which alignment accuracy is affected by the positions of insertions and deletions has not been examined independently of other sources of sequence variation. We assessed the performance of 6 popular MSA programs (ClustalW, DIALIGN-T, MAFFT, MUSCLE, PROBCONS, and T-COFFEE) and one experimental program, PRANK, on amino acid sequences that differed only by short regions of deleted residues. The analysis showed that the absence of residues often led to an incorrect placement of gaps in the alignments, even though the sequences were otherwise identical. In data sets containing sequences with partially overlapping deletions, most MSA programs preferentially aligned the gaps vertically at the expense of incorrectly aligning residues in the flanking regions. Of the programs assessed, only DIALIGN-T was able to place overlapping gaps correctly relative to one another, but this was usually context dependent and was observed only in some of the data sets. In data sets containing sequences with non-overlapping deletions, both DIALIGN-T and MAFFT (G-INS-I) were able to align gaps with near-perfect accuracy, but only MAFFT produced the correct alignment consistently. The same was true for data sets that comprised isoforms of alternatively spliced gene products: both DIALIGN-T and MAFFT produced highly accurate alignments, with MAFFT being the more consistent of the 2 programs. Other programs, notably T-COFFEE and ClustalW, were less accurate. For all data sets, alignments produced by different MSA programs differed markedly, indicating that reliance on a single MSA program may give misleading results. It is therefore advisable to use more than one MSA program when dealing with sequences that may contain deletions or insertions, particularly for high-throughput and pipeline applications where manual refinement of each alignment is not practicable.  相似文献   

19.
We present a method, called BlockMatch, for aligning two blocks, where a block is an RNA multiple sequence alignment with the consensus secondary structure of the alignment in Stockholm format. The method employs a quadratic-time dynamic programming algorithm for aligning columns and column pairs of the multiple alignments in the blocks. Unlike many other tools that can perform pairwise alignment of either single sequences or structures only, BlockMatch takes into account the characteristics of all the sequences in the blocks along with their consensus structures during the alignment process, thus being able to achieve a high-quality alignment result. We apply BlockMatch to phylogeny reconstruction on a set of 5S rRNA sequences taken from fifteen bacteria species. Experimental results showed that the phylogenetic tree generated by our method is more accurate than the tree constructed based on the widely used ClustalW tool. The BlockMatch algorithm is implemented into a web server, accessible at http://bioinformatics.njit.edu/blockmatch. A jar file of the program is also available for download from the web server.  相似文献   

20.
Quality assessment of multiple alignment programs   总被引:7,自引:0,他引:7  
A renewed interest in the multiple sequence alignment problem has given rise to several new algorithms. In contrast to traditional progressive methods, computationally expensive score optimization strategies are now predominantly employed. We systematically tested four methods (Poa, Dialign, T-Coffee and ClustalW) for the speed and quality of their alignments. As test sequences we used structurally derived alignments from BAliBASE and synthetic alignments generated by Rose. The tests included alignments of variable numbers of domains embedded in random spacer sequences. Overall, Dialign was the most accurate in cases with low sequence identity, while T-Coffee won in cases with high sequence identity. The fast Poa algorithm was almost as accurate, while ClustalW could compete only in strictly global cases with high sequence similarity.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号