首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
We present a model of contextual alignment of biological sequences. It is an extension of the classical alignment, in which we assume that the cost of a substitution depends on the surrounding symbols. In this model the cost of transforming one sequence into another depends on the order of editing operations. We present efficient algorithms for calculating this cost, as well as reconstructing (the representation of) all the orders of operations which yield this optimal cost. A precise characterization of the families of linear orders which can emerge this way is given.  相似文献   

2.
We perform a computational study using a new approach to the analysis of protein sequences. The contextual alignment model, proposed recently by Gambin et al. (2002), is based on the assumption that, while constructing an alignment, the score of a substitution of one residue by another depends on the surrounding residues. The contextual alignment scores calculated in this model were used to hierarchical clustering of several protein families from the database of Clusters of Orthologous Groups (COG). The clustering has been also constructed based on the standard approach. The comparative analysis shows that the contextual model results in more consistent clustering trees. The difference, although small, is with no exception in favour of the contextual model. The consistency of the family of trees is measured by several consensus and agreement methods, as well as by the inter-tree distance approach.  相似文献   

3.
We present a software tool CTX-BLAST that incorporates contextual alignment model into the popular protein BLAST program. Our alignment tool allows us to investigate the effect of context-dependency in the protein alignment much more efficient than using previous dynamic algorithms. The software makes use of non-symmetric contextual substitution tables and calculates the statistical significance of a given alignment according to the contextual statistical model. AVAILABILITY: CTX-BLAST is an open source software freely available from www.sourceforge.net/projects/CTX-BLAST. A program for statistical estimation of E-value parameters and the contextual substitution table CTX-BLOSUM62 are also provided. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

4.
This study is motivated by a real problem encountered in the manufacturing and distribution process at a local electronic manufacturer of security devices. We investigate the impact of operations redesign (i.e., operations merging) on the cost of safety stock in a supply chain. A simple safety stock method is used to derive a model for estimating safety stock levels. Our result shows that operations redesign can have a significant impact on safety stock investment. We extend and complement the existing literature in the following aspects: (i) we address the issue of safety stock deployment, i.e., we not only investigate the problem of how many operations should be delayed, but also determine which operations need to be delayed, (ii) we provide an efficient heuristic algorithm to determine which operations need to be merged, and (iii) we find the optimal operations redesign strategies under some special cases. Our analysis also reveals some important conditions and insights for better operations redesign, which enable us not only to decide when an operations redesign is appropriate, but also to suggest the scale and the format of the operations redesign.  相似文献   

5.

Background  

Detecting remote homologies by direct comparison of protein sequences remains a challenging task. We had previously developed a similarity score between sequences, called a local alignment kernel, that exhibits good performance for this task in combination with a support vector machine. The local alignment kernel depends on an amino acid substitution matrix. Since commonly used BLOSUM or PAM matrices for scoring amino acid matches have been optimized to be used in combination with the Smith-Waterman algorithm, the matrices optimal for the local alignment kernel can be different.  相似文献   

6.
A widely used algorithm for computing an optimal local alignment between two sequences requires a parameter set with a substitution matrix and gap penalties. It is recognized that a proper parameter set should be selected to suit the level of conservation between sequences. We describe an algorithm for selecting an appropriate substitution matrix at given gap penalties for computing an optimal local alignment between two sequences. In the algorithm, a substitution matrix that leads to the maximum alignment similarity score is selected among substitution matrices at various evolutionary distances. The evolutionary distance of the selected substitution matrix is defined as the distance of the computed alignment. To show the effects of gap penalties on alignments and their distances and help select appropriate gap penalties, alignments and their distances are computed at various gap penalties. The algorithm has been implemented as a computer program named SimDist. The SimDist program was compared with an existing local alignment program named SIM for finding reciprocally best-matching pairs (RBPs) of sequences in each of 100 protein families, where RBPs are commonly used as an operational definition of orthologous sequences. SimDist produced more accurate results than SIM on 50 of the 100 families, whereas both programs produced the same results on the other 50 families. SimDist was also used to compare three types of substitution matrices in scoring 444,461 pairs of homologous sequences from the 100 families.  相似文献   

7.
Recomputation of the previously evaluated similarity results between biological sequences becomes inevitable when researchers realize errors in their sequenced data or when the researchers have to compare nearly similar sequences, e.g., in a family of proteins. We present an efficient scheme for updating local sequence alignments with an affine gap model. In principle, using the previous matching result between two amino acid sequences, we perform a forward-backward alignment to generate heuristic searching bands which are bounded by a set of suboptimal paths. Given a correctly updated sequence, we initially predict a new score of the alignment path for each contour to select the best candidates among them. Then, we run the Smith-Waterman algorithm in this confined space. Furthermore, our heuristic alignment for an updated sequence shows that it can be further accelerated by using reusable dynamic programming (rDP), our prior work. In this study, we successfully validate "relative node tolerance bound” (RNTB) in the pruned searching space. Furthermore, we improve the computational performance by quantifying the successful RNTB tolerance probability and switch to rDP on perturbation-resilient columns only. In our searching space derived by a threshold value of 90 percent of the optimal alignment score, we find that 98.3 percent of contours contain correctly updated paths. We also find that our method consumes only 25.36 percent of the runtime cost of sparse dynamic programming (sDP) method, and to only 2.55 percent of that of a normal dynamic programming with the Smith-Waterman algorithm.  相似文献   

8.
In the class of repeated sequences that occur in DNA, minisatellites have been found polymorphic and became useful tools in genetic mapping and forensic studies. They consist of a heterogeneous tandem array of a short repeat unit. The slightly different units along the array are called variants. Minisatellites evolve mainly through tandem duplications and tandem deletions of variants. Jeffreys et al. (1997) devised a method to obtain the sequence of variants along the array in a digital code and called such sequences maps. Minisatellite maps give access to the detail of mutation processes at work on such loci. In this paper, we design an algorithm to compare two maps under an evolutionary model that includes deletion, insertion, mutation, tandem duplication, and tandem deletion of a variant. Our method computes an optimal alignment in reasonable time; and the alignment score, i.e., the weighted sum of its elementary operations, is a distance metric between maps. The main difficulty is that the optimal sequence of operations depends on the order in which they are applied to the map. Taking the maps of the minisatellite MSY1 of 609 men, we computed all pairwise distances and reconstructed an evolutionary tree of these individuals. MSY1 (DYF155S1) is a hypervariable locus on the Y chromosome. In our tree, the populations of some haplogroups are monophyletic, showing that one can decipher a microevolutionary signal using minisatellite maps comparison.  相似文献   

9.
RNA molecules whose secondary structures contain similar substructures often have similar functions. Therefore, an important task in the study of RNA is to develop methods for discovering substructures in RNA secondary structures that occur frequently (also referred to as motifs). In this paper, we consider the problem of computing an optimal local alignment of two given labeled ordered forests F1 and F2. This problem asks for a substructure of F1 and a substructure of F2 that exhibit a high similarity. Since an RNA molecule's secondary structure can be represented as a labeled ordered forest, the problem we study has a direct application to finding potential motifs. We generalize the previously studied concept of a closed subforest to a gapped subforest and present the first algorithm for computing the optimal local gapped subforest alignment of F1 and F2. We also show that our technique can improve the time and space complexity of the previously most efficient algorithm for optimal local closed subforest alignment. Furthermore, we prove that a special case of our local gapped subforest alignment problem is equivalent to a problem known in the literature as the local sequence-structure alignment problem (lssa) and modify our main algorithm to obtain a much faster algorithm for lssa than the one previously proposed. An implementation of our algorithm is available at www.comp.nus.edu.sg/~bioinfo/LGSFAligner/. Its running time is significantly faster than the original lssa program.  相似文献   

10.
Most pairwise and multiple sequence alignment programs seek alignments with optimal scores. Central to defining such scores is selecting a set of substitution scores for aligned amino acids or nucleotides. For local pairwise alignment, substitution scores are implicitly of log-odds form. We now extend the log-odds formalism to multiple alignments, using Bayesian methods to construct “BILD” (“Bayesian Integral Log-odds”) substitution scores from prior distributions describing columns of related letters. This approach has been used previously only to define scores for aligning individual sequences to sequence profiles, but it has much broader applicability. We describe how to calculate BILD scores efficiently, and illustrate their uses in Gibbs sampling optimization procedures, gapped alignment, and the construction of hidden Markov model profiles. BILD scores enable automated selection of optimal motif and domain model widths, and can inform the decision of whether to include a sequence in a multiple alignment, and the selection of insertion and deletion locations. Other applications include the classification of related sequences into subfamilies, and the definition of profile-profile alignment scores. Although a fully realized multiple alignment program must rely upon more than substitution scores, many existing multiple alignment programs can be modified to employ BILD scores. We illustrate how simple BILD score based strategies can enhance the recognition of DNA binding domains, including the Api-AP2 domain in Toxoplasma gondii and Plasmodium falciparum.  相似文献   

11.
We present a model of disease transmission on a regular and small world network and compare different control options. Comparison is based on a total cost of epidemic, including cost of palliative treatment of ill individuals and preventive cost aimed at vaccination or culling of susceptible individuals. Disease is characterized by pre-symptomatic phase, which makes detection and control difficult. Three general strategies emerge: global preventive treatment, local treatment within a neighborhood of certain size and only palliative treatment with no prevention. While the choice between the strategies depends on a relative cost of palliative and preventive treatment, the details of the local strategy and, in particular, the size of the optimal treatment neighborhood depend on the epidemiological factors. The required extent of prevention is proportional to the size of the infection neighborhood, but depends on time till detection and time till treatment in a non-nonlinear (power) law. The optimal size of control neighborhood is also highly sensitive to the relative cost, particularly for inefficient detection and control application. These results have important consequences for design of prevention strategies aiming at emerging diseases for which parameters are not nessecerly known in advance.  相似文献   

12.
Nonhomogeneous substitution models have been introduced for phylogenetic inference when the substitution process is nonstationary, for example, when sequence composition differs between lineages. Existing models can have many parameters, and it is then difficult and computationally expensive to learn the parameters and to select the optimal model complexity. We extend an existing nonhomogeneous substitution model by introducing a reversible jump Markov chain Monte Carlo method for efficient Bayesian inference of the model order along with other phylogenetic parameters of interest. We also introduce a new hierarchical prior which leads to more reasonable results when only a small number of lineages share a particular substitution process. The method is implemented in the PHASE software, which includes specialized substitution models for RNA genes with conserved secondary structure. We apply an RNA-specific nonhomogeneous model to a structure-based alignment of rRNA sequences spanning the entire tree of life. A previous study of the same genes from a similar set of species found robust evidence for a mesophilic last universal common ancestor (LUCA) by inference of the G+C composition at the root of the tree. In the present study, we find that the helical GC composition at the root is strongly dependent on the root position. With a bacterial rooting, we find that there is no longer strong support for either a mesophile or a thermophile LUCA, although a hyperthermophile LUCA remains unlikely. We discuss reasons why results using only RNA helices may differ from results using all aligned sites when applying nonhomogeneous models to RNA genes.  相似文献   

13.
This article proposes a novel approach to statistical alignment of nucleotide sequences by introducing a context dependent structure on the substitution process in the underlying evolutionary model. We propose to estimate alignments and context dependent mutation rates relying on the observation of two homologous sequences. The procedure is based on a generalized pair-hidden Markov structure, where conditional on the alignment path, the nucleotide sequences follow a Markov distribution. We use a stochastic approximation expectation maximization (saem) algorithm to give accurate estimators of parameters and alignments. We provide results both on simulated data and vertebrate genomes, which are known to have a high mutation rate from CG dinucleotide. In particular, we establish that the method improves the accuracy of the alignment of a human pseudogene and its functional gene.  相似文献   

14.
Genetic sequence data typically exhibit variability in substitution rates across sites. In practice, there is often too little variation to fit a different rate for each site in the alignment, but the distribution of rates across sites may not be well modeled using simple parametric families. Mixtures of different distributions can capture more complex patterns of rate variation, but are often parameter-rich and difficult to fit. We present a simple hierarchical model in which a baseline rate distribution, such as a gamma distribution, is discretized into several categories, the quantiles of which are estimated using a discretized beta distribution. Although this approach involves adding only two extra parameters to a standard distribution, a wide range of rate distributions can be captured. Using simulated data, we demonstrate that a "beta-" model can reproduce the moments of the rate distribution more accurately than the distribution used to simulate the data, even when the baseline rate distribution is misspecified. Using hepatitis C virus and mammalian mitochondrial sequences, we show that a beta- model can fit as well or better than a model with multiple discrete rate categories, and compares favorably with a model which fits a separate rate category to each site. We also demonstrate this discretization scheme in the context of codon models specifically aimed at identifying individual sites undergoing adaptive or purifying evolution.  相似文献   

15.
The amino acid sequences of proteins provide rich information for inferring distant phylogenetic relationships and for predicting protein functions. Estimating the rate matrix of residue substitutions from amino acid sequences is also important because the rate matrix can be used to develop scoring matrices for sequence alignment. Here we use a continuous time Markov process to model the substitution rates of residues and develop a Bayesian Markov chain Monte Carlo method for rate estimation. We validate our method using simulated artificial protein sequences. Because different local regions such as binding surfaces and the protein interior core experience different selection pressures due to functional or stability constraints, we use our method to estimate the substitution rates of local regions. Our results show that the substitution rates are very different for residues in the buried core and residues on the solvent-exposed surfaces. In addition, the rest of the proteins on the binding surfaces also have very different substitution rates from residues. Based on these findings, we further develop a method for protein function prediction by surface matching using scoring matrices derived from estimated substitution rates for residues located on the binding surfaces. We show with examples that our method is effective in identifying functionally related proteins that have overall low sequence identity, a task known to be very challenging.  相似文献   

16.
MOTIVATION: The evolution of viruses is very rapid and in addition to local point mutations (insertion, deletion, substitution) it also includes frequent recombinations, genome rearrangements and horizontal transfer of genetic materials (HGTS). Evolutionary analysis of viral sequences is therefore a complicated matter for two main reasons: First, due to HGTs and recombinations, the right model of evolution is a network and not a tree. Second, due to genome rearrangements, an alignment of the input sequences is not guaranteed. These facts encourage developing methods for inferring phylogenetic networks that do not require aligned sequences as input. RESULTS: In this work, we present the first computational approach which deals with both genome rearrangements and horizontal gene transfers and does not require a multiple alignment as input. We formalize a new set of computational problems which involve analyzing such complex models of evolution. We investigate their computational complexity, and devise algorithms for solving them. Moreover, we demonstrate the viability of our methods on several synthetic datasets as well as four biological datasets. AVAILABILITY: The code is available from the authors upon request.  相似文献   

17.
Chen H  Kihara D 《Proteins》2008,71(3):1255-1274
The error in protein tertiary structure prediction is unavoidable, but it is not explicitly shown in most of the current prediction algorithms. Estimated error of a predicted structure is crucial information for experimental biologists to use the prediction model for design and interpretation of experiments. Here, we propose a method to estimate errors in predicted structures based on the stability of the optimal target-template alignment when compared with a set of suboptimal alignments. The stability of the optimal alignment is quantified by an index named the SuboPtimal Alignment Diversity (SPAD). We implemented SPAD in a profile-based threading algorithm and investigated how well SPAD can indicate errors in threading models using a large benchmark dataset of 5232 alignments. SPAD shows a very good correlation not only to alignment shift errors but also structure-level errors, the root mean square deviation (RMSD) of predicted structure models to the native structures (i.e. global errors), and local errors at each residue position. We have further compared SPAD with seven other quality measures, six from sequence alignment-based measures and one atomic statistical potential, discrete optimized protein energy (DOPE), in terms of the correlation coefficient to the global and local structure-level errors. In terms of the correlation to the RMSD of structure models, when a target and a template are in the same SCOP family, the sequence identity showed a best correlation to the RMSD; in the superfamily level, SPAD was the best; and in the fold level, DOPE was best. However, in a head-to-head comparison, SPAD wins over the other measures. Next, SPAD is compared with three other measures of local errors. In this comparison, SPAD was best in all of the family, the superfamily and the fold levels. Using the discovered correlation, we have also predicted the global and local error of our predicted structures of CASP7 targets by the SPAD. Finally, we proposed a sausage representation of predicted tertiary structures which intuitively indicate the predicted structure and the estimated error range of the structure simultaneously.  相似文献   

18.
The limits on maximum information that can be transferred by single neurons may help us to understand how sensory and other information is being processed in the brain. According to the efficient-coding hypothesis (Barlow, Sensory Comunication, MIT press, Cambridge, 1961), neurons are adapted to the statistical properties of the signals to which they are exposed. In this paper we employ methods of information theory to calculate, both exactly (numerically) and approximately, the ultimate limits on reliable information transmission for an empirical neuronal model. We couple information transfer with the metabolic cost of neuronal activity and determine the optimal information-to-metabolic cost ratios. We find that the optimal input distribution is discrete with only six points of support, both with and without a metabolic constraint. However, we also find that many different input distributions achieve mutual information close to capacity, which implies that the precise structure of the capacity-achieving input is of lesser importance than the value of capacity.  相似文献   

19.
A hidden Markov model for progressive multiple alignment   总被引:4,自引:0,他引:4  
MOTIVATION: Progressive algorithms are widely used heuristics for the production of alignments among multiple nucleic-acid or protein sequences. Probabilistic approaches providing measures of global and/or local reliability of individual solutions would constitute valuable developments. RESULTS: We present here a new method for multiple sequence alignment that combines an HMM approach, a progressive alignment algorithm, and a probabilistic evolution model describing the character substitution process. Our method works by iterating pairwise alignments according to a guide tree and defining each ancestral sequence from the pairwise alignment of its child nodes, thus, progressively constructing a multiple alignment. Our method allows for the computation of each column minimum posterior probability and we show that this value correlates with the correctness of the result, hence, providing an efficient mean by which unreliably aligned columns can be filtered out from a multiple alignment.  相似文献   

20.
Fast, optimal alignment of three sequences using linear gap costs   总被引:2,自引:0,他引:2  
Alignment algorithms can be used to infer a relationship between sequences when the true relationship is unknown. Simple alignment algorithms use a cost function that gives a fixed cost to each possible point mutation-mismatch, deletion, insertion. These algorithms tend to find optimal alignments that have many small gaps. It is more biologically plausible to have fewer longer gaps rather than many small gaps in an alignment. To address this issue, linear gap cost algorithms are in common use for aligning biological sequence data. More reliable inferences are obtained by aligning more than two sequences at a time. The obvious dynamic programming algorithm for optimally aligning k sequences of length n runs in O(n(k)) time. This is impractical if k>/=3 and n is of any reasonable length. Thus, for this problem there are many heuristics for aligning k sequences, however, they are not guaranteed to find an optimal alignment. In this paper, we present a new algorithm guaranteed to find the optimal alignment for three sequences using linear gap costs. This gives the same results as the dynamic programming algorithm for three sequences, but typically does so much more quickly. It is particularly fast when the (three-way) edit distance is small. Our algorithm uses a speed-up technique based on Ukkonen's greedy algorithm (Ukkonen, 1983) which he presented for two sequences and simple costs.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号