首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The length of an alignment of biological sequences is typically longer than the mean length of its component sequences. (This arises from the insertion of gaps in the alignment.) When such an alignment is used as a profile for the alignment of further sequences (or profiles), it will have a bias toward additional sequences that match the length of the profile, rather than the mean length of sequences in the profile, as the alignment of these well entail fewer (or smaller) insertions) so avoiding gap-penalties). An algorithm is described to correct this bias that entails monitoring the correspondence, for every pair of positions, of the mean separations in both profiles as they are aligned. The correction was incorporated into a standard dynamic programming algorithm through a modification of the gap-penalty, but, unlike other approaches, this modification is not local and takes into consideration the overall alignment of the sequences. This implies that the algorithm cannot guarantee to find the optimal alignment, but tests suggest that close approximations are obtained. The method was tested on protein families by measuring the area in the parameter space of the phase containing the correct multiple alignment. No improvement (increase in phase area) was found with a family that required few gaps to be aligned correctly. However, for highly gapped alignments, a 50% increase in area was obtained with one family and the correct alignment was found for another that could not be aligned with the unbiased method.  相似文献   

2.
SUMMARY: Multiple sequence alignment is the NP-hard problem of aligning three or more DNA or amino acid sequences in an optimal way so as to match as many characters as possible from the set of sequences. The popular sequence alignment program ClustalW uses the classical method of approximating a sequence alignment, by first computing a distance matrix and then constructing a guide tree to show the evolutionary relationship of the sequences. We show that parallelizing the ClustalW algorithm can result in significant speedup. We used a cluster of workstations using C and message passing interface for our implementation. Experimental results show that speedup of over 5.5 on six processors is obtainable for most inputs. AVAILABILITY: The software is available upon request from the second author.  相似文献   

3.
Multiple sequence alignment is an essential tool in many areas of biological research, and the accuracy of an alignment can strongly affect the accuracy of a downstream application such as phylogenetic analysis, identification of functional motifs, or polymerase chain reaction primer design. The heads or tails (HoT) method (Landan G, Graur D. 2007. Heads or tails: a simple reliability check for multiple sequence alignments. Mol Biol Evol. 24:1380-1383.) assesses the consistency of an alignment by comparing the alignment of a set of sequences with the alignment of the same set of sequences written in reverse order. This study shows that HoT scores and the alignment accuracies are positively correlated, so alignments with higher HoT scores are preferable. However, HoT scores are overestimates of alignment accuracy in general, with the extent of overestimation depending on the method used for multiple sequence alignment.  相似文献   

4.
Multiple sequence alignment with hierarchical clustering.   总被引:155,自引:8,他引:147       下载免费PDF全文
F Corpet 《Nucleic acids research》1988,16(22):10881-10890
An algorithm is presented for the multiple alignment of sequences, either proteins or nucleic acids, that is both accurate and easy to use on microcomputers. The approach is based on the conventional dynamic-programming method of pairwise alignment. Initially, a hierarchical clustering of the sequences is performed using the matrix of the pairwise alignment scores. The closest sequences are aligned creating groups of aligned sequences. Then close groups are aligned until all sequences are aligned in one group. The pairwise alignments included in the multiple alignment form a new matrix that is used to produce a hierarchical clustering. If it is different from the first one, iteration of the process can be performed. The method is illustrated by an example: a global alignment of 39 sequences of cytochrome c.  相似文献   

5.
6.
An algorithm is presented for the multiple alignment of protein sequences that is both accurate and rapid computationally. The approach is based on the conventional dynamic-programming method of pairwise alignment. Initially, two sequences are aligned, then the third sequence is aligned against the alignment of both sequences one and two. Similarly, the fourth sequence is aligned against one, two and three. This is repeated until all sequences have been aligned. Iteration is then performed to yield a final alignment. The accuracy of sequence alignment is evaluated from alignment of the secondary structures in a family of proteins. For the globins, the multiple alignment was on average 99% accurate compared to 90% for pairwise comparison of sequences. For the alignment of immunoglobulin constant and variable domains, the use of many sequences yielded an alignment of 63% average accuracy compared to 41% average for individual variable/constant alignments. The multiple alignment algorithm yields an assignment of disulphide connectivity in mammalian serotransferrin that is consistent with crystallographic data, whereas pairwise alignments give an alternative assignment.  相似文献   

7.

Background  

Phylogeny-aware progressive alignment has been found to perform well in phylogenetic alignment benchmarks and to produce superior alignments for the inference of selection on codon sequences. Its implementation in the PRANK alignment program package also allows modelling of complex evolutionary processes and inference of posterior probabilities for sequence sites evolving under each distinct scenario, either simultaneously with the alignment of sequences or as a post-processing step for an existing alignment. This has led to software with many advanced features, and users may find it difficult to generate optimal alignments, visualise the full information in their alignment results, or post-process these results, e.g. by objectively selecting subsets of alignment sites.  相似文献   

8.
An Eulerian path approach to global multiple alignment for DNA sequences.   总被引:3,自引:0,他引:3  
With the rapid increase in the dataset of genome sequences, the multiple sequence alignment problem is increasingly important and frequently involves the alignment of a large number of sequences. Many heuristic algorithms have been proposed to improve the speed of computation and the quality of alignment. We introduce a novel approach that is fundamentally different from all currently available methods. Our motivation comes from the Eulerian method for fragment assembly in DNA sequencing that transforms all DNA fragments into a de Bruijn graph and then reduces sequence assembly to a Eulerian path problem. The paper focuses on global multiple alignment of DNA sequences, where entire sequences are aligned into one configuration. Our main result is an algorithm with almost linear computational speed with respect to the total size (number of letters) of sequences to be aligned. Five hundred simulated sequences (averaging 500 bases per sequence and as low as 70% pairwise identity) have been aligned within three minutes on a personal computer, and the quality of alignment is satisfactory. As a result, accurate and simultaneous alignment of thousands of long sequences within a reasonable amount of time becomes possible. Data from an Arabidopsis sequencing project is used to demonstrate the performance.  相似文献   

9.
The aim of the work is to develop a common method for estimating the pairwise alignment quality versus the evolutionary distance (degree of homology) between the sequences being compared and versus the type of alignment procedure. 3D alignments or any data on 3D protein structure are not used in the study. Based on the accepted protein sequences evolution model, it is possible to estimate the capability of the concrete alignment algorithm to recover the genuine alignment. In this study a classical Needleman and Wunsch global alignment algorithm has been tested on a set of sequences from the Prefab database. Accuracy and confidence of a global alignment procedure were calculated as dependent on the shares of insertions/deletions and mutations.  相似文献   

10.
When aligning biological sequences, the choice of parameter values for the alignment scoring function is critical. Small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to compute parameter values that are appropriate for aligning biological sequences is through inverse parametric sequence alignment. Given a collection of examples of biologically correct alignments, this is the problem of finding parameter values that make the scores of the example alignments close to those of optimal alignments for their sequences. We extend prior work on inverse parametric alignment to partial examples, which contain regions where the alignment is left unspecified, and to an improved formulation based on minimizing the average error between the score of an example and the score of an optimal alignment. Experiments on benchmark biological alignments show we can find parameters that generalize across protein families and that boost the accuracy of multiple sequence alignment by as much as 25%.  相似文献   

11.
A widely used algorithm for computing an optimal local alignment between two sequences requires a parameter set with a substitution matrix and gap penalties. It is recognized that a proper parameter set should be selected to suit the level of conservation between sequences. We describe an algorithm for selecting an appropriate substitution matrix at given gap penalties for computing an optimal local alignment between two sequences. In the algorithm, a substitution matrix that leads to the maximum alignment similarity score is selected among substitution matrices at various evolutionary distances. The evolutionary distance of the selected substitution matrix is defined as the distance of the computed alignment. To show the effects of gap penalties on alignments and their distances and help select appropriate gap penalties, alignments and their distances are computed at various gap penalties. The algorithm has been implemented as a computer program named SimDist. The SimDist program was compared with an existing local alignment program named SIM for finding reciprocally best-matching pairs (RBPs) of sequences in each of 100 protein families, where RBPs are commonly used as an operational definition of orthologous sequences. SimDist produced more accurate results than SIM on 50 of the 100 families, whereas both programs produced the same results on the other 50 families. SimDist was also used to compare three types of substitution matrices in scoring 444,461 pairs of homologous sequences from the 100 families.  相似文献   

12.
Page RD 《Nucleic acids research》2000,28(20):3839-3845
Comparative analysis is the preferred method of inferring RNA secondary structure, but its use requires considerable expertise and manual effort. As the importance of secondary structure for accurate sequence alignment and phylogenetic analysis becomes increasingly realised, the need for secondary structure models for diverse taxonomic groups becomes more pressing. The number of available structures bears little relation to the relative diversity or importance of the different taxonomic groups. Insects, for example, comprise the largest group of animals and yet are very poorly represented in secondary structure databases. This paper explores the utility of maximum weighted matching (MWM) to help automate the process of comparative analysis by inferring secondary structure for insect mitochondrial small subunit (12S) rRNA sequences. By combining information on correlated changes in substitutions and helix dot plots, MWM can rapidly generate plausible models of secondary structure. These models can be further refined using standard comparative techniques. This paper presents a secondary structure model for insect 12S rRNA based on an alignment of 225 insect sequences and an alignment for 16 exemplar insect sequences. This alignment is used as a template for a web server that automatically generates secondary structures for insect sequences.  相似文献   

13.
An evolutionary model for maximum likelihood alignment of DNA sequences   总被引:16,自引:0,他引:16  
Summary Most algorithms for the alignment of biological sequences are not derived from an evolutionary model. Consequently, these alignment algorithms lack a strong statistical basis. A maximum likelihood method for the alignment of two DNA sequences is presented. This method is based upon a statistical model of DNA sequence evolution for which we have obtained explicit transition probabilities. The evolutionary model can also be used as the basis of procedures that estimate the evolutionary parameters relevant to a pair of unaligned DNA sequences. A parameter-estimation approach which takes into account all possible alignments between two sequences is introduced; the danger of estimating evolutionary parameters from a single alignment is discussed.  相似文献   

14.
Multiple sequence alignment plays an important role in molecular sequence analysis. An alignment is the arrangement of two (pairwise alignment) or more (multiple alignment) sequences of 'residues' (nucleotides or amino acids) that maximizes the similarities between them. Algorithmically, the problem consists of opening and extending gaps in the sequences to maximize an objective function (measurement of similarity). A simple genetic algorithm was developed and implemented in the software MSA-GA. Genetic algorithms, a class of evolutionary algorithms, are well suited for problems of this nature since residues and gaps are discrete units. An evolutionary algorithm cannot compete in terms of speed with progressive alignment methods but it has the advantage of being able to correct for initially misaligned sequences; which is not possible with the progressive method. This was shown using the BaliBase benchmark, where Clustal-W alignments were used to seed the initial population in MSA-GA, improving outcome. Alignment scoring functions still constitute an open field of research, and it is important to develop methods that simplify the testing of new functions. A general evolutionary framework for testing and implementing different scoring functions was developed. The results show that a simple genetic algorithm is capable of optimizing an alignment without the need of the excessively complex operators used in prior study. The clear distinction between objective function and genetic algorithms used in MSA-GA makes extending and/or replacing objective functions a trivial task.  相似文献   

15.
A general protein sequence alignment methodology for detecting a priori unknown common structural and functional regions is described. The method proposed in this paper is based on two basic requirements for a meaningful alignment. First, each sequence or segment of a sequence is characterized by a multivariate physicochemical profile. Second, the alignment is performed by considering all the sequences simultaneously, and the algorithm detects those regions that form a set of similar profiles. In order to test the structural meaning of the alignment obtained from the sequences, quantitative comparisons are performed with structurally conserved regions (SCR) determined from the X-ray structures of three serine proteases. Results suggest that the limits of the SCR may be predicted from the similarities between the physicochemical profiles of the sequences. The procedures are not completely automated. The final step requires a visual screening of alternative pathways in order to determine an optimal alignment.  相似文献   

16.
MOTIVATION: Homologous sequences are sometimes similar over some regions but different over other regions. Homologous sequences have a much lower global similarity if the different regions are much longer than the similar regions. RESULTS: We present a generalized global alignment algorithm for comparing sequences with intermittent similarities, an ordered list of similar regions separated by different regions. A generalized global alignment model is defined to handle sequences with intermittent similarities. A dynamic programming algorithm is designed to compute an optimal general alignment in time proportional to the product of sequence lengths and in space proportional to the sum of sequence lengths. The algorithm is implemented as a computer program named GAP3 (Global Alignment Program Version 3). The generalized global alignment model is validated by experimental results produced with GAP3 on both DNA and protein sequences. The GAP3 program extends the ability of standard global alignment programs to recognize homologous sequences of lower similarity. AVAILABILITY: The GAP3 program is freely available for academic use at http://bioinformatics.iastate.edu/aat/align/align.html.  相似文献   

17.
Summary A method for detecting homology between two protein or nucleic acid sequences which require insertions or deletions for optimum alignment has been devised for use with a computer. Sequences are assessed for possible relationship by Monte Carlo methods involving comparisons between the alignment of the real sequences and alignments of randomly scrambled sequences of the Same composition as the real sequences, each alignment having the optimum number of gaps. As each gap is successively introduced into a comparison (real or random) a maximum score is determined from the similarity of the aligned residues. From the distribution of the maximum alignment scores of randomly scrambled sequences having the same number of gaps, the percentage of random comparisons having higher scores is determined, and the smallest of these percentage levels for each pair of sequences (real or random) indicates the optimum alignment. The fraction of the comparisons of random sequences having percentage levels at their optimum alignment below that of the real sequence comparison at its optimum estimates the probability that such an alignment might have arisen by chance. Related sequences are detected since their optimum alignment score, by virtue of a contribution from ancestral homology in addition to optimised random considerations, occupies a more extreme position in the appropriate frequency distribution of scores than do the majority of optimum scores of randomly scrambled sequences in their appropriate distributions.Application of this optimum match method of sequence comparison shows that the sensitivity of the maximum match method of Needleman and Wunsch (1970) decreases quite dramatically with sequence comparisons which require only a few gaps for a reasonable alignment, or when sequences differ greatly in length. The maximum match method as applied by Barker and Dayhoff (1972) has the additional disadvantage that deletions which have occurred in the longer of two homologous protein sequences further decrease the sensitivity of detection of relationship. The constrained match method of Sankoff and Cedergren (1973) is seen to be misleading since large increments in the alignment score from added gaps do not necessarily result in a high total alignment score required to demonstrate sequence homology.  相似文献   

18.

Background

Obtaining an accurate sequence alignment is fundamental for consistently analyzing biological data. Although this problem may be efficiently solved when only two sequences are considered, the exact inference of the optimal alignment easily gets computationally intractable for the multiple sequence alignment case. To cope with the high computational expenses, approximate heuristic methods have been proposed that address the problem indirectly by progressively aligning the sequences in pairs according to their relatedness. These methods however are not flexible to change the alignment of an already aligned group of sequences in the view of new data, resulting thus in compromises on the quality of the deriving alignment. In this paper we present ReformAlign, a novel meta-alignment approach that may significantly improve on the quality of the deriving alignments from popular aligners. We call ReformAlign a meta-aligner as it requires an initial alignment, for which a variety of alignment programs can be used. The main idea behind ReformAlign is quite straightforward: at first, an existing alignment is used to construct a standard profile which summarizes the initial alignment and then all sequences are individually re-aligned against the formed profile. From each sequence-profile comparison, the alignment of each sequence against the profile is recorded and the final alignment is indirectly inferred by merging all the individual sub-alignments into a unified set. The employment of ReformAlign may often result in alignments which are significantly more accurate than the starting alignments.

Results

We evaluated the effect of ReformAlign on the generated alignments from ten leading alignment methods using real data of variable size and sequence identity. The experimental results suggest that the proposed meta-aligner approach may often lead to statistically significant more accurate alignments. Furthermore, we show that ReformAlign results in more substantial improvement in cases where the starting alignment is of relatively inferior quality or when the input sequences are harder to align.

Conclusions

The proposed profile-based meta-alignment approach seems to be a promising and computationally efficient method that can be combined with practically all popular alignment methods and may lead to significant improvements in the generated alignments.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2105-15-265) contains supplementary material, which is available to authorized users.  相似文献   

19.

Background  

Genomic sequence data cannot be fully appreciated in isolation. Comparative genomics – the practice of comparing genomic sequences from different species – plays an increasingly important role in understanding the genotypic differences between species that result in phenotypic differences as well as in revealing patterns of evolutionary relationships. One of the major challenges in comparative genomics is producing a high-quality alignment between two or more related genomic sequences. In recent years, a number of tools have been developed for aligning large genomic sequences. Most utilize heuristic strategies to identify a series of strong sequence similarities, which are then used as anchors to align the regions between the anchor points. The resulting alignment is globally correct, but in many cases is suboptimal locally. We describe a new program, GenAlignRefine, which improves the overall quality of global multiple alignments by using a genetic algorithm to improve local regions of alignment. Regions of low quality are identified, realigned using the program T-Coffee, and then refined using a genetic algorithm. Because a better COFFEE (Consistency based Objective Function For alignmEnt Evaluation) score generally reflects greater alignment quality, the algorithm searches for an alignment that yields a better COFFEE score. To improve the intrinsic slowness of the genetic algorithm, GenAlignRefine was implemented as a parallel, cluster-based program.  相似文献   

20.
BlastAlign uses NCBI blastn to build a multiple nucleotide alignment and is intended for use with sequences that have large indels or are otherwise difficult to align globally. The program builds a matrix representing regions of homology along the sequences, from which it selects the 'most representative' sequence and then extracts the blastn query-anchored multiple alignment for this sequence. The matrix is printed and allows subgroups to be identified visually and an option allows other sequences to be used as the 'most representative'. The program contains elements of both Perl and Python and will run on UNIX (including Mac OSX) and DOS. An additional Perl program BlastAlignP uses tblastn to align nucleotide sequences to a single amino acid sequence, thus allowing an open reading frame to be maintained in the resulting multiple alignment. AVAILABILITY: It is freely available at http://www.bio.ic.ac.uk/research/belshaw/BlastAlign.tar and at http://evolve.zoo.ox.ac.uk/software/blastalign.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号