Fast seed-based alignment heuristics such as BLAST and BLAT have become indispensable tools in comparative genomics for all studies aiming at the evolutionary relations of proteins, genes, and non-coding RNAs. This is true in particular for the large mammalian genomes. The sensitivity and specificity of these tools, however, crucially depend on parameters such as seed sizes or maximum expectation values. In settings that require high sensitivity the amount of short local match fragments easily becomes intractable. Then, fragment chaining is a powerful leverage to quickly connect, score, and rank the fragments to improve the specificity.  相似文献   

Fast, optimal alignment of three sequences using linear gap costs   总被引:2,自引:0,他引:2  
Alignment algorithms can be used to infer a relationship between sequences when the true relationship is unknown. Simple alignment algorithms use a cost function that gives a fixed cost to each possible point mutation-mismatch, deletion, insertion. These algorithms tend to find optimal alignments that have many small gaps. It is more biologically plausible to have fewer longer gaps rather than many small gaps in an alignment. To address this issue, linear gap cost algorithms are in common use for aligning biological sequence data. More reliable inferences are obtained by aligning more than two sequences at a time. The obvious dynamic programming algorithm for optimally aligning k sequences of length n runs in O(n(k)) time. This is impractical if k>/=3 and n is of any reasonable length. Thus, for this problem there are many heuristics for aligning k sequences, however, they are not guaranteed to find an optimal alignment. In this paper, we present a new algorithm guaranteed to find the optimal alignment for three sequences using linear gap costs. This gives the same results as the dynamic programming algorithm for three sequences, but typically does so much more quickly. It is particularly fast when the (three-way) edit distance is small. Our algorithm uses a speed-up technique based on Ukkonen's greedy algorithm (Ukkonen, 1983) which he presented for two sequences and simple costs.  相似文献   

Xie  Jingzhao  Zhou  Run  Sun  Gang  Sun  Jian  Yu  Hongfang 《Cluster computing》2022,25(2):1321-1339
Cluster Computing - With the rapid development of wireless communication technology and the rapid popularization of various mobile devices, an increasing number of users want the ability to access...  相似文献   

Optimal sequence alignment using affine gap costs   总被引:27,自引:0,他引:27  
When comparing two biological sequences, it is often desirable for a gap to be assigned a cost not directly proportional to its length. If affine gap costs are employed, in other words if opening a gap costsv and each null in the gap costsu, the algorithm of Gotoh (1982,J. molec. Biol. 162, 705) finds the minimum cost of aligning two sequences in orderMN steps. Gotoh's algorithm attempts to find only one from among possibly many optimal (minimum-cost) alignments, but does not always succeed. This paper provides an example for which this part of Gotoh's algorithm fails and describes an algorithm that finds all and only the optimal alignments. This modification of Gotoh's algorithm still requires orderMN steps. A more precise form of path graph than previously used is needed to represent accurately all optimal alignments for affine gap costs.  相似文献   

MOTIVATION: Protein structures are flexible and undergo structural rearrangements as part of their function, and yet most existing protein structure comparison methods treat them as rigid bodies, which may lead to incorrect alignment. RESULTS: We have developed the Flexible structure AlignmenT by Chaining AFPs (Aligned Fragment Pairs) with Twists (FATCAT), a new method for structural alignment of proteins. The FATCAT approach simultaneously addresses the two major goals of flexible structure alignment; optimizing the alignment and minimizing the number of rigid-body movements (twists) around pivot points (hinges) introduced in the reference protein. In contrast, currently existing flexible structure alignment programs treat the hinge detection as a post-process of a standard rigid body alignment. We illustrate the advantages of the FATCAT approach by several examples of comparison between proteins known to adopt different conformations, where the FATCAT algorithm achieves more accurate structure alignments than current methods, while at the same time introducing fewer hinges.  相似文献   



Studies on the distribution of indel sizes have consistently found that they obey a power law. This finding has lead several scientists to propose that logarithmic gap costs, G (k) = a + c ln k, are more biologically realistic than affine gap costs, G (k) = a + bk, for sequence alignment. Since quick and efficient affine costs are currently the most popular way to globally align sequences, the goal of this paper is to determine whether logarithmic gap costs improve alignment accuracy significantly enough the merit their use over the faster affine gap costs.  相似文献   

We have found certain conserved motifs and secondary structural patterns present in the vicinity of interior domain boundary points (dbps) by a data-driven approach without any a priori constraint on the type and number of such features, and without any requirement of sequence homology. We have used these motifs and patterns to rerank the solutions obtained by the well-known domain guess by size (DGS) algorithm. We predict, overall, five solutions. The average accuracy of overall (i.e., top five) predictions by our method [domain boundary prediction using conserved patterns (DPCP)] has improved the average accuracy of the top five solutions of DGS from 71.74 to 82.88 %, in the case of two-continuous-domain proteins, and from 21.38 to 80.56 %, for two-discontinuous-domain proteins. Considering only the top solution, the gains in accuracy are from 0 to 72.74 % for two-continuous-domain proteins with chain lengths up to 300 residues, and from 0 to 62.85 % for those with up to 400 residues. In the case of discontinuous domains, top_min solutions (the minimum number of solutions required for predicting all dbps of a protein) of DPCP improve the average accuracy of DGS prediction from 12.5 to 76.3 % in proteins with chain lengths up to 300 residues, and from 13.33 to 70.84 % for proteins with up to 400 residues. In our validation experiments, the performance of DPCP was also found to be superior to that of domain identification from secondary structure element alignment (DomSSEA), the best method reported so far for efficient prediction of domain boundaries using predicted secondary structure. The average accuracies of the topmost solution of DomSSEA are 61 and 52 % for proteins with up to 300 residues and 400, respectively, in the case of continuous domains; the corresponding accuracies for the discontinuous case are 28 and 21 %.  相似文献   

Multiple sequence alignment is one of the dominant problems in computational molecular biology. Numerous scoring functions and methods have been proposed, most of which result in NP-hard problems. In this paper we propose for the first time a general formulation for multiple alignment with arbitrary gap-costs based on an integer linear program (ILP). In addition we describe a branch-and-cut algorithm to effectively solve the ILP to optimality. We evaluate the performances of our approach in terms of running time and quality of the alignments using the BAliBase database of reference alignments. The results show that our implementation ranks amongst the best programs developed so far.  相似文献   

Based on the observation that a single mutational event can delete or insert multiple residues, affine gap costs for sequence alignment charge a penalty for the existence of a gap, and a further length-dependent penalty. From structural or multiple alignments of distantly related proteins, it has been observed that conserved residues frequently fall into ungapped blocks separated by relatively nonconserved regions. To take advantage of this structure, a simple generalization of affine gap costs is proposed that allows nonconserved regions to be effectively ignored. The distribution of scores from local alignments using these generalized gap costs is shown empirically to follow an extreme value distribution. Examples are presented for which generalized affine gap costs yield superior alignments from the standpoints both of statistical significance and of alignment accuracy. Guidelines for selecting generalized affine gap costs are discussed, as is their possible application to multiple alignment. Proteins 32:88–96, 1998. Published 1998 Wiley-Liss, Inc.
  • 1 This article is a US government work and, as such, is in the public domain in the United States of America.
  •   相似文献   

    Hunter CG  Subramaniam S 《Proteins》2003,50(4):580-588
    A novel clustering method is used to cluster protein fragments by shape. The centroids (mean fragments from each cluster) form a basis set of structural motifs. A database of 156,643 seven-residue fragments is used, and eight different basis sets with varying levels of resolution are generated. Coarse basis sets contain tens of centroids and provide meaningful local shapes, which are more detailed than the traditional secondary structure categories. High-resolution basis sets contain thousands of centroids and can be used to model tertiary structure of longer segments. The basis sets generated fit nontraining set proteins with the expected accuracy.  相似文献   

    MOTIVATION: It is widely recognized that homology search and ortholog clustering are very useful for analyzing biological sequences. However, recent growth of sequence database size makes homolog detection difficult, and rapid and accurate methods are required. RESULTS: We present a novel method for fast and accurate homology detection, assuming that the Smith-Waterman (SW) scores between all similar sequence pairs in a target database are computed and stored. In this method, SW alignment is computed only if the upper bound, which is derived from our novel inequality, is higher than the given threshold. In contrast to other methods such as FASTA and BLAST, this method is guaranteed to find all sequences whose scores against the query are higher than the specified threshold. Results of computational experiments suggest that the method is dozens of times faster than SSEARCH if genome sequence data of closely related species are available.  相似文献   

    The outcome of a phylogenetic analysis based on DNA sequence data is highly dependent on the homology-assignment step and may vary with alignment parameter costs. Robustness to changes in parameter costs is therefore a desired quality of a data set because the final conclusions will be less dependent on selecting a precise optimal cost set. Here, node stability is explored in relationship to separate versus combined analysis in three different data sets, all including several data partitions. Robustness to changes in cost sets is measured as number of successive changes that can be made in a given cost set before a specific clade is lost. The changes are in all cases base change cost, gap penalties, and adding/removing/changing affine gap costs. When combining data partitions, the number of clades that appear in the entire parameter space is not remarkably increased, in some cases this number even decreased. However, when combining data partitions the trees from cost sets including affine gap costs were always more similar than the trees were from cost sets without affine gap costs. This was not the case when the data partitions were analyzed independently. When data sets were combined 80% of the clades found under cost sets including affine gap costs resisted at least one change to the cost set.  相似文献   

    Exploring a large number of parameter sets in sensitivity analyses of direct optimization parsimony can be costly in terms of time and computing resources, and there is little a priori guidance available for reasonable limits to these search parameters. For this reason, we sought a general‐purpose upper limit for gap costs in the direct optimization program POY to streamline this process. To test the performance of POY as gap costs increase, we simulated data onto a pre‐set topology using a GTR + I + G model modified to include gaps by adding them according to a negative‐binomial model. Gaps were then removed and the data were analysed in POY at increasing gap costs. Increasing gap costs consistently resulted in reduced phylogenetic accuracy across trees of different relative branch lengths. Decoupling gap insertion and gap extension costs recovered a fraction of the accuracy lost by having both high gap insertion and gap extension costs, but only in trees with long internal nodes. To determine whether loss of phylogenetic accuracy was node‐specific, we designed a small dataset with a constrained node, where all possible combinations of cost substitution and different percentages of gap versus nucleotide changes were explored. These analyses showed that the effects of gap insertion and extension are node‐specific, and the minimum threshold for convergence on gap‐supported nodes is similar to the threshold for accuracy loss found in the larger simulated datasets. Subsequent analyses of empirical data revealed that a similar pattern of loss with gap cost increase can occur with ribosomal genes (18S, 28S, 16S and 12S) but this pattern was not seen in the intron data (myoglobin II) examined. In conjunction with previously published congruence‐based studies, the results suggest that POY sensitivity analyses can be streamlined and made more accurate if gap insertion and extension costs follow, as a guideline, a limit of four times the highest base‐transformation cost. © The Willi Hennig Society 2008.  相似文献   

    Ngila is an application that will find the best alignment of a pair of sequences using log-affine gap costs, which are the most biologically realistic gap costs. AVAILABILITY: Portable source code for Ngila can be downloaded from its development website, http://scit.us/projects/ngila/. It compiles on most operating systems.  相似文献   

    A Robinson-Foulds (RF) supertree for a collection of input trees is a tree containing all the species in the input trees that is at minimum total RF distance to the input trees. Thus, an RF supertree is consistent with the maximum number of splits in the input trees. Constructing RF supertrees for rooted and unrooted data is NP-hard. Nevertheless, effective local search heuristics have been developed for the restricted case where the input trees and the supertree are rooted. We describe new heuristics, based on the Edge Contract and Refine (ECR) operation, that remove this restriction, thereby expanding the utility of RF supertrees. Our experimental results on simulated and empirical data sets show that our unrooted local search algorithms yield better supertrees than those obtained from MRP and rooted RF heuristics in terms of total RF distance to the input trees and, for simulated data, in terms of RF distance to the true tree.  相似文献   

    ABSTRACT: BACKGROUND: Single nucleotide polymorphism (SNP) genotyping assays normally give rise to certain percents of no-calls; the problem becomes severe when the target organisms, such as cattle, do not have a high resolution genomic sequence. Missing SNP genotypes, when related to target traits, would confound downstream data analyses such as genome-wide association studies (GWAS). Existing methods for recovering the missing values are successful to some extent --- either accurate but not fast enough or fast but not accurate enough. RESULTS: To a target missing genotype, we take only the SNP loci within a genetic distance vicinity and only the samples within a similarity vicinity into our local imputation process. For missing genotype imputation, the comparative performance evaluations through extensive simulation studies using real human and cattle genotype datasets demonstrated that our nearest neighbor based local imputation method was one of the most efficient methods, and outperformed existing methods except the time-consuming fastPHASE; for missing haplotype allele imputation, the comparative performance evaluations using real mouse haplotype datasets demonstrated that our method was not only one of the most efficient methods, but also one of the most accurate methods. CONCLUSIONS: Given that fastPHASE requires a long imputation time on medium to high density datasets, and that our nearest neighbor based local imputation method only performed slightly worse, yet better than all other methods, one might want to adopt our method as an alternative missing SNP genotype or missing haplotype allele imputation method.  相似文献   

    ABSTRACT: BACKGROUND: ProGraphMSA is a state-of-the-art multiple sequence alignment tool which produces phylogenetically sensiblegap patterns while maintaining robustness by allowing alternative splicings and errors in the branching pattern ofthe guide tree. RESULTS: This is achieved by incorporating a graph-based sequence representation combined with the advantages of thephylogeny-aware gap placement algorithm of Prank. Further, we account for variations in the substitution patternby implementing context-specific profiles as in CS-Blast and by estimating amino acid frequencies from inputdata. CONCLUSIONS: ProGraphMSA shows good performance and competitive execution times in various benchmarks.  相似文献   

    Many closely related populations are distinguished by variation in sexual signals and this variation is hypothesized to play an important role in reproductive isolation and speciation. Within populations, there is considerable evidence that sexual signals provide information about the incidence and severity of parasite infections, but it remains unclear if variation in parasite communities across space could play a role in initiating or maintaining sexual trait divergence. To test for variation in parasite-associated selection, we compared three barn swallow subspecies with divergent sexual signals. We found that parasite community structure and host tolerance to ecologically similar parasites varied between subspecies. Across subspecies we also found that different parasites were costly in terms of male survival and reproductive success. For each subspecies, the preferred sexual signal(s) were associated with the most costly local parasite(s), indicating that divergent signals are providing relevant information to females about local parasite communities. Across subspecies, the same traits were often associated with different parasites, indicating that parasite-sexual signal links are quite flexible and may evolve relatively quickly. This study provides evidence for (1) variation in parasite communities and (2) different parasite-sexual signal links among three closely related subspecies with divergent sexual signal traits, suggesting that parasites may play an important role in initiating and/or maintaining the divergence of sexual signals among these closely related, yet geographically isolated populations.  相似文献   

