首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.

Background  

Fast seed-based alignment heuristics such as BLAST and BLAT have become indispensable tools in comparative genomics for all studies aiming at the evolutionary relations of proteins, genes, and non-coding RNAs. This is true in particular for the large mammalian genomes. The sensitivity and specificity of these tools, however, crucially depend on parameters such as seed sizes or maximum expectation values. In settings that require high sensitivity the amount of short local match fragments easily becomes intractable. Then, fragment chaining is a powerful leverage to quickly connect, score, and rank the fragments to improve the specificity.  相似文献   

2.
Fast, optimal alignment of three sequences using linear gap costs   总被引:2,自引:0,他引:2  
Alignment algorithms can be used to infer a relationship between sequences when the true relationship is unknown. Simple alignment algorithms use a cost function that gives a fixed cost to each possible point mutation-mismatch, deletion, insertion. These algorithms tend to find optimal alignments that have many small gaps. It is more biologically plausible to have fewer longer gaps rather than many small gaps in an alignment. To address this issue, linear gap cost algorithms are in common use for aligning biological sequence data. More reliable inferences are obtained by aligning more than two sequences at a time. The obvious dynamic programming algorithm for optimally aligning k sequences of length n runs in O(n(k)) time. This is impractical if k>/=3 and n is of any reasonable length. Thus, for this problem there are many heuristics for aligning k sequences, however, they are not guaranteed to find an optimal alignment. In this paper, we present a new algorithm guaranteed to find the optimal alignment for three sequences using linear gap costs. This gives the same results as the dynamic programming algorithm for three sequences, but typically does so much more quickly. It is particularly fast when the (three-way) edit distance is small. Our algorithm uses a speed-up technique based on Ukkonen's greedy algorithm (Ukkonen, 1983) which he presented for two sequences and simple costs.  相似文献   

3.
Xie  Jingzhao  Zhou  Run  Sun  Gang  Sun  Jian  Yu  Hongfang 《Cluster computing》2022,25(2):1321-1339
Cluster Computing - With the rapid development of wireless communication technology and the rapid popularization of various mobile devices, an increasing number of users want the ability to access...  相似文献   

4.
MOTIVATION: Protein structures are flexible and undergo structural rearrangements as part of their function, and yet most existing protein structure comparison methods treat them as rigid bodies, which may lead to incorrect alignment. RESULTS: We have developed the Flexible structure AlignmenT by Chaining AFPs (Aligned Fragment Pairs) with Twists (FATCAT), a new method for structural alignment of proteins. The FATCAT approach simultaneously addresses the two major goals of flexible structure alignment; optimizing the alignment and minimizing the number of rigid-body movements (twists) around pivot points (hinges) introduced in the reference protein. In contrast, currently existing flexible structure alignment programs treat the hinge detection as a post-process of a standard rigid body alignment. We illustrate the advantages of the FATCAT approach by several examples of comparison between proteins known to adopt different conformations, where the FATCAT algorithm achieves more accurate structure alignments than current methods, while at the same time introducing fewer hinges.  相似文献   

5.

Background  

Studies on the distribution of indel sizes have consistently found that they obey a power law. This finding has lead several scientists to propose that logarithmic gap costs, G (k) = a + c ln k, are more biologically realistic than affine gap costs, G (k) = a + bk, for sequence alignment. Since quick and efficient affine costs are currently the most popular way to globally align sequences, the goal of this paper is to determine whether logarithmic gap costs improve alignment accuracy significantly enough the merit their use over the faster affine gap costs.  相似文献   

6.
We have found certain conserved motifs and secondary structural patterns present in the vicinity of interior domain boundary points (dbps) by a data-driven approach without any a priori constraint on the type and number of such features, and without any requirement of sequence homology. We have used these motifs and patterns to rerank the solutions obtained by the well-known domain guess by size (DGS) algorithm. We predict, overall, five solutions. The average accuracy of overall (i.e., top five) predictions by our method [domain boundary prediction using conserved patterns (DPCP)] has improved the average accuracy of the top five solutions of DGS from 71.74 to 82.88 %, in the case of two-continuous-domain proteins, and from 21.38 to 80.56 %, for two-discontinuous-domain proteins. Considering only the top solution, the gains in accuracy are from 0 to 72.74 % for two-continuous-domain proteins with chain lengths up to 300 residues, and from 0 to 62.85 % for those with up to 400 residues. In the case of discontinuous domains, top_min solutions (the minimum number of solutions required for predicting all dbps of a protein) of DPCP improve the average accuracy of DGS prediction from 12.5 to 76.3 % in proteins with chain lengths up to 300 residues, and from 13.33 to 70.84 % for proteins with up to 400 residues. In our validation experiments, the performance of DPCP was also found to be superior to that of domain identification from secondary structure element alignment (DomSSEA), the best method reported so far for efficient prediction of domain boundaries using predicted secondary structure. The average accuracies of the topmost solution of DomSSEA are 61 and 52 % for proteins with up to 300 residues and 400, respectively, in the case of continuous domains; the corresponding accuracies for the discontinuous case are 28 and 21 %.  相似文献   

7.
Multiple sequence alignment is one of the dominant problems in computational molecular biology. Numerous scoring functions and methods have been proposed, most of which result in NP-hard problems. In this paper we propose for the first time a general formulation for multiple alignment with arbitrary gap-costs based on an integer linear program (ILP). In addition we describe a branch-and-cut algorithm to effectively solve the ILP to optimality. We evaluate the performances of our approach in terms of running time and quality of the alignments using the BAliBase database of reference alignments. The results show that our implementation ranks amongst the best programs developed so far.  相似文献   

8.
MOTIVATION: It is widely recognized that homology search and ortholog clustering are very useful for analyzing biological sequences. However, recent growth of sequence database size makes homolog detection difficult, and rapid and accurate methods are required. RESULTS: We present a novel method for fast and accurate homology detection, assuming that the Smith-Waterman (SW) scores between all similar sequence pairs in a target database are computed and stored. In this method, SW alignment is computed only if the upper bound, which is derived from our novel inequality, is higher than the given threshold. In contrast to other methods such as FASTA and BLAST, this method is guaranteed to find all sequences whose scores against the query are higher than the specified threshold. Results of computational experiments suggest that the method is dozens of times faster than SSEARCH if genome sequence data of closely related species are available.  相似文献   

9.
The outcome of a phylogenetic analysis based on DNA sequence data is highly dependent on the homology-assignment step and may vary with alignment parameter costs. Robustness to changes in parameter costs is therefore a desired quality of a data set because the final conclusions will be less dependent on selecting a precise optimal cost set. Here, node stability is explored in relationship to separate versus combined analysis in three different data sets, all including several data partitions. Robustness to changes in cost sets is measured as number of successive changes that can be made in a given cost set before a specific clade is lost. The changes are in all cases base change cost, gap penalties, and adding/removing/changing affine gap costs. When combining data partitions, the number of clades that appear in the entire parameter space is not remarkably increased, in some cases this number even decreased. However, when combining data partitions the trees from cost sets including affine gap costs were always more similar than the trees were from cost sets without affine gap costs. This was not the case when the data partitions were analyzed independently. When data sets were combined 80% of the clades found under cost sets including affine gap costs resisted at least one change to the cost set.  相似文献   

10.
ABSTRACT: BACKGROUND: Single nucleotide polymorphism (SNP) genotyping assays normally give rise to certain percents of no-calls; the problem becomes severe when the target organisms, such as cattle, do not have a high resolution genomic sequence. Missing SNP genotypes, when related to target traits, would confound downstream data analyses such as genome-wide association studies (GWAS). Existing methods for recovering the missing values are successful to some extent --- either accurate but not fast enough or fast but not accurate enough. RESULTS: To a target missing genotype, we take only the SNP loci within a genetic distance vicinity and only the samples within a similarity vicinity into our local imputation process. For missing genotype imputation, the comparative performance evaluations through extensive simulation studies using real human and cattle genotype datasets demonstrated that our nearest neighbor based local imputation method was one of the most efficient methods, and outperformed existing methods except the time-consuming fastPHASE; for missing haplotype allele imputation, the comparative performance evaluations using real mouse haplotype datasets demonstrated that our method was not only one of the most efficient methods, but also one of the most accurate methods. CONCLUSIONS: Given that fastPHASE requires a long imputation time on medium to high density datasets, and that our nearest neighbor based local imputation method only performed slightly worse, yet better than all other methods, one might want to adopt our method as an alternative missing SNP genotype or missing haplotype allele imputation method.  相似文献   

11.
A Robinson-Foulds (RF) supertree for a collection of input trees is a tree containing all the species in the input trees that is at minimum total RF distance to the input trees. Thus, an RF supertree is consistent with the maximum number of splits in the input trees. Constructing RF supertrees for rooted and unrooted data is NP-hard. Nevertheless, effective local search heuristics have been developed for the restricted case where the input trees and the supertree are rooted. We describe new heuristics, based on the Edge Contract and Refine (ECR) operation, that remove this restriction, thereby expanding the utility of RF supertrees. Our experimental results on simulated and empirical data sets show that our unrooted local search algorithms yield better supertrees than those obtained from MRP and rooted RF heuristics in terms of total RF distance to the input trees and, for simulated data, in terms of RF distance to the true tree.  相似文献   

12.
Exploring a large number of parameter sets in sensitivity analyses of direct optimization parsimony can be costly in terms of time and computing resources, and there is little a priori guidance available for reasonable limits to these search parameters. For this reason, we sought a general‐purpose upper limit for gap costs in the direct optimization program POY to streamline this process. To test the performance of POY as gap costs increase, we simulated data onto a pre‐set topology using a GTR + I + G model modified to include gaps by adding them according to a negative‐binomial model. Gaps were then removed and the data were analysed in POY at increasing gap costs. Increasing gap costs consistently resulted in reduced phylogenetic accuracy across trees of different relative branch lengths. Decoupling gap insertion and gap extension costs recovered a fraction of the accuracy lost by having both high gap insertion and gap extension costs, but only in trees with long internal nodes. To determine whether loss of phylogenetic accuracy was node‐specific, we designed a small dataset with a constrained node, where all possible combinations of cost substitution and different percentages of gap versus nucleotide changes were explored. These analyses showed that the effects of gap insertion and extension are node‐specific, and the minimum threshold for convergence on gap‐supported nodes is similar to the threshold for accuracy loss found in the larger simulated datasets. Subsequent analyses of empirical data revealed that a similar pattern of loss with gap cost increase can occur with ribosomal genes (18S, 28S, 16S and 12S) but this pattern was not seen in the intron data (myoglobin II) examined. In conjunction with previously published congruence‐based studies, the results suggest that POY sensitivity analyses can be streamlined and made more accurate if gap insertion and extension costs follow, as a guideline, a limit of four times the highest base‐transformation cost. © The Willi Hennig Society 2008.  相似文献   

13.
ABSTRACT: BACKGROUND: ProGraphMSA is a state-of-the-art multiple sequence alignment tool which produces phylogenetically sensiblegap patterns while maintaining robustness by allowing alternative splicings and errors in the branching pattern ofthe guide tree. RESULTS: This is achieved by incorporating a graph-based sequence representation combined with the advantages of thephylogeny-aware gap placement algorithm of Prank. Further, we account for variations in the substitution patternby implementing context-specific profiles as in CS-Blast and by estimating amino acid frequencies from inputdata. CONCLUSIONS: ProGraphMSA shows good performance and competitive execution times in various benchmarks.  相似文献   

14.
Ngila is an application that will find the best alignment of a pair of sequences using log-affine gap costs, which are the most biologically realistic gap costs. AVAILABILITY: Portable source code for Ngila can be downloaded from its development website, http://scit.us/projects/ngila/. It compiles on most operating systems.  相似文献   

15.
16.
17.
We have developed a novel analyte injection method for the SensíQ Pioneer surface plasmon resonance-based biosensor referred to as “FastStep.” By merging buffer and sample streams immediately prior to the reaction flow cells, the instrument is capable of automatically generating a two- or threefold dilution series (of seven or five concentrations, respectively) from a single analyte sample. Using sucrose injections, we demonstrate that the production of each concentration within the step gradient is highly reproducible. For kinetic studies, we developed analysis software that utilizes the sucrose responses to automatically define the concentration of analyte at any point during the association phase. To validate this new approach, we compared the results of standard and FastStep injections for ADP binding to a target kinase and a panel of compounds binding to carbonic anhydrase II. Finally, we illustrate how FastStep can be used in a primary screening mode to obtain a full concentration series of each compound in a fragment library.  相似文献   

18.
Gap size and gap shape are two important properties of forest gaps that can influence microsite conditions in a forest stand and determine the recruitment and establishment of trees. There is no universally adopted method for measuring the gap size, although several options are available. In addition, few methods have been proposed for measuring the gap shape. This paper proposes a photographic method of estimating canopy gap size and gap shape. The proposed method is based on a vertical hemispherical photograph of the gap and is thus named the hemispherical photograph method (HPM). We tested the accuracy of the HPM measure of gap size by two ground-based methods and compared the HPM with other methods. Our results indicate that the HPM measurement of the canopy gap size is accurate, but is significantly influenced by the location of the camera. Compared with the ground-based methods, the HPM is more objective and repeatable. Compared with other photographic methods, HPM is more accurate due to the more actual assumptions, but is more labor-intensive because more field measurements are necessary. We conclude that the HPM is a powerful tool for comparative and long-term studies of forest gaps.  相似文献   

19.
20.
A new fragment picker has been developed for CS-Rosetta that combines beneficial features of the original fragment picker, MFR, used with CS-Rosetta, and the fragment picker, NNMake, that was used for purely sequence based fragment selection in the context of ROSETTA de-novo structure prediction. Additionally, the new fragment picker has reduced sensitivity to outliers and other difficult to match data points rendering the protocol more robust and less likely to introduce bias towards wrong conformations in cases where data is bad, missing or inconclusive. The fragment picker protocol gives significant improvements on 6 of 23 CS-Rosetta targets. An independent benchmark on 39 protein targets, whose NMR data sets were published only after protocol optimization had been finished, also show significantly improved performance for the new fragment picker (van der Schot et al. in J Biomol NMR, 2013).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号