首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 363 毫秒
1.
Summary Minimum message length encoding is a technique of inductive inference with theoretical and practical advantages. It allows the posterior odds-ratio of two theories or hypotheses to be calculated. Here it is applied to problems of aligning or relating two strings, in particular two biological macromolecules. We compare the r-theory, that the strings are related, with the null-theory, that they are not related. If they are related, the probabilities of the various alignments can be calculated. This is done for one-, three-, and five-state models of relation or mutation. These correspond to linear and piecewise linear cost functions on runs of insertions and deletions. We describe how to estimate parameters of a model. The validity of a model is itself an hypothesis and can be objectively tested. This is done on real DNA strings and on artificial data. The tests on artificial data indicate limits on what can be inferred in various situations. The tests on real DNA support either the three- or five-state models over the one-state model. Finally, a fast, approximate minimum message length string comparison algorithm is described.Offprint requests to: L. Allison  相似文献   

2.
Reconstruction of strings past   总被引:1,自引:0,他引:1  
A major use of string–alignment algorithms is to comparemacro molecules that are thought to have evolved from a commonancestor to estimate the duration of, or the amount of mutationin, their separate evolution and to infer as much as possibleabout their most recent common ancestor. Minimum message lengthencoding, a method of inductive inference, is applied to thestnng-oJignment pmblem. it leads to an alignment method thataverages over all alignments in a weighted fashion. Esperimentsindicates that this method can recover the actual parametersof evolution with high accuracy and over a wide range of values,whereas the use of a single optimal alignment gives biased results.  相似文献   

3.
Efficient methods for multiple sequence alignment with guaranteed error bounds   总被引:11,自引:0,他引:11  
Multiple string (sequence) alignment is a difficult and important problem in computational biology, where it is central in two related tasks: finding highly conserved subregions or embedded patterns of a set of biological sequences (strings of DNA, RNA or amino acids), and inferring the evolutionary history of a set of taxa from their associated biological sequences. Several precise measures have been proposed for evaluating the goodness of a multiple alignment, but no efficient methods are known which compute the optimal alignment for any of these measures in any but small cases. In this paper, we consider two previously proposed measures, and given two computationaly efficient multiple alignment methods (one for each measure) whose deviation from the optimal value isguaranteed to be less than a factor of two. This is the novel feature of these methods, but the methods have additional virtues as well. For both methods, the guaranteed bounds are much smaller than two when the number of strings is small (1.33 for three strings of any length); for one of the methods we give a related randomized method which is much faster and which gives, with high probability, multiple alignments with fairly small error bounds; and for the other measure, the method given yields a non-obviouslower bound on the value of the optimal alignment.  相似文献   

4.
Wu CH  Drummond AJ 《Genetics》2011,188(1):151-164
We provide a framework for Bayesian coalescent inference from microsatellite data that enables inference of population history parameters averaged over microsatellite mutation models. To achieve this we first implemented a rich family of microsatellite mutation models and related components in the software package BEAST. BEAST is a powerful tool that performs Bayesian MCMC analysis on molecular data to make coalescent and evolutionary inferences. Our implementation permits the application of existing nonparametric methods to microsatellite data. The implemented microsatellite models are based on the replication slippage mechanism and focus on three properties of microsatellite mutation: length dependency of mutation rate, mutational bias toward expansion or contraction, and number of repeat units changed in a single mutation event. We develop a new model that facilitates microsatellite model averaging and Bayesian model selection by transdimensional MCMC. With Bayesian model averaging, the posterior distributions of population history parameters are integrated across a set of microsatellite models and thus account for model uncertainty. Simulated data are used to evaluate our method in terms of accuracy and precision of estimation and also identification of the true mutation model. Finally we apply our method to a red colobus monkey data set as an example.  相似文献   

5.
The reconstruction and synthesis of ancestral RNAs is a feasible goal for paleogenetics. This will require new bioinformatics methods, including a robust statistical framework for reconstructing histories of substitutions, indels and structural changes. We describe a “transducer composition” algorithm for extending pairwise probabilistic models of RNA structural evolution to models of multiple sequences related by a phylogenetic tree. This algorithm draws on formal models of computational linguistics as well as the 1985 protosequence algorithm of David Sankoff. The output of the composition algorithm is a multiple-sequence stochastic context-free grammar. We describe dynamic programming algorithms, which are robust to null cycles and empty bifurcations, for parsing this grammar. Example applications include structural alignment of non-coding RNAs, propagation of structural information from an experimentally-characterized sequence to its homologs, and inference of the ancestral structure of a set of diverged RNAs. We implemented the above algorithms for a simple model of pairwise RNA structural evolution; in particular, the algorithms for maximum likelihood (ML) alignment of three known RNA structures and a known phylogeny and inference of the common ancestral structure. We compared this ML algorithm to a variety of related, but simpler, techniques, including ML alignment algorithms for simpler models that omitted various aspects of the full model and also a posterior-decoding alignment algorithm for one of the simpler models. In our tests, incorporation of basepair structure was the most important factor for accurate alignment inference; appropriate use of posterior-decoding was next; and fine details of the model were least important. Posterior-decoding heuristics can be substantially faster than exact phylogenetic inference, so this motivates the use of sum-over-pairs heuristics where possible (and approximate sum-over-pairs). For more exact probabilistic inference, we discuss the use of transducer composition for ML (or MCMC) inference on phylogenies, including possible ways to make the core operations tractable.  相似文献   

6.
The re‐analysis of mtDNA sequence data on sea bass (Dicentrarchus labrax) in the north‐east Atlantic revealed the presence of a ‘slippage error’ in the alignment of the mitochondrial DNA (mtDNA) control region sequences. This induced an overestimation of the genetic variability within this species, and hence the inference of a striking multi‐clade partitioning of D. labrax populations within this area. After correction, the existence of highly distinct D. labrax haplogroups in Atlantic areas does not hold anymore, but the robust dichotomy between a single Mediterranean and a single Atlantic subgroup is confirmed. Here we present the new results related to the amended mtDNA control region alignment, and also summarize the key message in relation to the microsatellite data, which are unaffected by this revision. © 2012 The Linnean Society of London, Biological Journal of the Linnean Society, 2012, 106 , 455–458.  相似文献   

7.
A chaos dynamics model of inductive inference is proposed. Inductive inference is a process to predict the future from a finite length of a complex sequence of data. Chaos dynamics is able to generate a complex sequence of data governed by a deterministic rule. A certain deterministic rule can be assigned to a finite length of any complex sequence of data in reverse.Once a deterministic rule is assigned to this finite length of complex data, the rule will in turn generate another complex sequence of data in the future. This process of generating data in the future is equivalent to future prediction. The overall process of assigning a deterministic rule and generating data in the future, therefore, is considered to be an inductive inference process.This chaos dynamics model may provide an effective concept for theorizing inductive inference. It is also suggestive of future realization of an inductive inference machine simulating the information processing of the brain.  相似文献   

8.
A protein is defined as an indexed string of elements at each level in the hierarchy of protein structure: sequence, secondary structure, super-secondary structure, etc. The elements, for example, residues or secondary structure segments such as helices or beta-strands, are associated with a series of properties and can be involved in a number of relationships with other elements. Element-by-element dissimilarity matrices are then computed and used in the alignment procedure based on the sequence alignment algorithm of Needleman & Wunsch, expanded by the simulated annealing technique to take into account relationships as well as properties. The utility of this method for exploring the variability of various aspects of protein structure and for comparing distantly related proteins is demonstrated by multiple alignment of serine proteinases, aspartic proteinase lobes and globins.  相似文献   

9.
A generalized single-step stepwise mutation model (SMM) is developed that takes into account an arbitrary initial state to a certain partial difference equation. This is solved in both the approximate continuum limit and the more exact discrete form. A time evolution model is developed for Y DNA or mtDNA that takes into account the reflective boundary modeling minimum microsatellite length and the original difference equation. A comparison is made between the more widely known continuum Gaussian model and a discrete model, which is based on modified Bessel functions of the first kind. A correction is made to the SMM model for the probability that two individuals are related that takes into account a reflecting boundary modeling minimum microsatellite length. This method is generalized to take into account the general n-step model and exact solutions are found. A new model is proposed for the step distribution.  相似文献   

10.
Simple and concise representations of protein-folding patterns provide powerful abstractions for visualizations, comparisons, classifications, searching and aligning structural data. Structures are often abstracted by replacing standard secondary structural features-that is, helices and strands of sheet-by vectors or linear segments. Relying solely on standard secondary structure may result in a significant loss of structural information. Further, traditional methods of simplification crucially depend on the consistency and accuracy of external methods to assign secondary structures to protein coordinate data. Although many methods exist automatically to identify secondary structure, the impreciseness of definitions, along with errors and inconsistencies in experimental structure data, drastically limit their applicability to generate reliable simplified representations, especially for structural comparison. This article introduces a mathematically rigorous algorithm to delineate protein structure using the elegant statistical and inductive inference framework of minimum message length (MML). Our method generates consistent and statistically robust piecewise linear explanations of protein coordinate data, resulting in a powerful and concise representation of the structure. The delineation is completely independent of the approaches of using hydrogen-bonding patterns or inspecting local substructural geometry that the current methods use. Indeed, as is common with applications of the MML criterion, this method is free of parameters and thresholds, in striking contrast to the existing programs which are often beset by them. The analysis of results over a large number of proteins suggests that the method produces consistent delineation of structures that encompasses, among others, the segments corresponding to standard secondary structure. AVAILABILITY: http://www.csse.monash.edu.au/~karun/pmml.  相似文献   

11.
Sample size for a phylogenetic inference.   总被引:1,自引:0,他引:1  
The objective of this work is to describe sample-size calculations for the inference of a nonzero central branch length in an unrooted four-species phylogeny. Attention is restricted to independent binary characters, such as might be obtained from an alignment of the purine-pyrimidine sequences of a nucleic acid molecule. A statistical test based on a multinomial model for character-state configurations is described. The importance of including invariable sites in models for sequence change is demonstrated, and their effect on sample size is quantified. The methods are applied to a four-species alignment of small-subunit rRNA sequences derived from two archaebacteria, a eubacteria and a eukaryote. We conclude that the information in these sequences is not sufficient to resolve the branching order of this tree. Estimates of the number of aligned nucleotide positions required to provide a reasonably powerful test are given.  相似文献   

12.
BEAST: Bayesian evolutionary analysis by sampling trees   总被引:2,自引:0,他引:2  

Background  

The evolutionary analysis of molecular sequence variation is a statistical enterprise. This is reflected in the increased use of probabilistic models for phylogenetic inference, multiple sequence alignment, and molecular population genetics. Here we present BEAST: a fast, flexible software architecture for Bayesian analysis of molecular sequences related by an evolutionary tree. A large number of popular stochastic models of sequence evolution are provided and tree-based models suitable for both within- and between-species sequence data are implemented.  相似文献   

13.
14.
Molecular sequence data are often aligned on the basis of secondary and/or tertiary structure models. However, these models are regularly updated and sometimes differ depending on the way in which they were constructed. We examined whether the choice of a particular 18S rRNA secondary structure model as alignment basis influences phylogeny inference. We therefore compared 18S rRNA phylogenies derived from alignments based on different models. We used: 1. Maximum parsimony; 2. The neighbour-joining method; 3. The maximum-likelihood approach; and 4. Evolutionary parsimony. This demonstrated that the secondary structure model on which an alignment is based may influence: 1. The tree topologies found by these four methods; 2. The numbers of most parsimonious trees found; and 3. The statistical values calculated by the evolutionary parsimony method.  相似文献   

15.
Summary We describe an algorithm for the concurrent comparison of three or more amino acid sequences. The basis of the approach is a progressive evaluation of selected segments from each sequence. Only a small subset of all possible segments from each sequence is compared, and a minimum of information is retained for the trace-back of the alignment. As a result, this method has the advantage of being both rapid and minimally consumptive of computer memory when constructing an alignment. This being the case, there are no practical limits on the length of sequences that may be aligned. A computer program for the alignment of three sequences is described, and this method is compared with two three-sequence extensions of the Needleman and Wunsch variety, including a recently published approach. In addition, we have made simultaneous alignments of sets of four and five sequences with this selected-segment method.  相似文献   

16.
A model for the control of mitosis is presented and, along with four other models described previously, is tested by the response of Physarum polycephalum to UV irradiation. Plasmodia were irradiated following the second mitosis (MII) after fusion of microplasmodia. As shown by other authors, the onset of the next mitosis (MIII) was delayed but the period MIII–MIV was shortened relative to control plasmodia. It is shown that the period MIII–MIV cannot be shortened beyond a minimum of 6 h despite increasing doses of UV. This minimum length is shown to be relatively independent of growth rate. If conditions were such that the length of MIII–MIV was shortened to this minimum value the length of MIV–MV was also shorter than the corresponding control period. If the period MII–MIV was longer than the minimum following irradiation then the length of MIV–MV was not shortened. It is argued that only the latter situation allows models to be tested and it is shown how the observed result is consistent with only two of the five models considered. A further test compared the length of MIII–MIV under these conditions with that predicted from the amount of DNA destroyed by the UV. This result was consistent only with the same two models.  相似文献   

17.
To infer a phylogenetic tree from a set of DNA sequences, typically a multiple alignment is first used to obtain homologous bases. The inferred phylogeny can be very sensitive to how the alignment was created. We develop tools for analyzing the robustness of phylogeny to perturbations in alignment parameters in the NW algorithm. Our main tool is parametric alignment, with novel improvements that are of general interest in parametric inference. Using parametric alignment and a Gaussian distribution on alignment parameters, we derive probabilities of optimal alignment summaries and inferred phylogenies. We apply our method to analyze intronic sequences from Drosophila flies. We show that phylogeny estimates can be sensitive to the choice of alignment parameters, and that parametric alignment elucidates the relationship between alignment parameters and reconstructed trees.  相似文献   

18.
Phylogenies are often thought to be more dependent upon the specifics of the sequence alignment rather than on the method of reconstruction. Simulation of sequences containing insertion and deletion events was performed in order to determine the role that alignment accuracy plays during phylogenetic inference. Data sets were simulated for pectinate, balanced, and random tree shapes under different conditions (ultrametric equal branch length, ultrametric random branch length, nonultrametric random branch length). Comparisons between hypothesized alignments and true alignments enabled determination of two measures of alignment accuracy, that of the total data set and that of individual branches. In general, our results indicate that as alignment error increases, topological accuracy decreases. This trend was much more pronounced for data sets derived from more pectinate topologies. In contrast, for balanced, ultrametric, equal branch length tree shapes, alignment inaccuracy had little average effect on tree reconstruction. These conclusions are based on average trends of many analyses under different conditions, and any one specific analysis, independent of the alignment accuracy, may recover very accurate or inaccurate topologies. Maximum likelihood and Bayesian, in general, outperformed neighbor joining and maximum parsimony in terms of tree reconstruction accuracy. Results also indicated that as the length of the branch and of the neighboring branches increase, alignment accuracy decreases, and the length of the neighboring branches is the major factor in topological accuracy. Thus, multiple-sequence alignment can be an important factor in downstream effects on topological reconstruction.  相似文献   

19.
Similarity problems intensively investigated in computational molecular biology have the following two stringology models: find the longest string included in any string of a given finite language, and find the shortest string including every string of a given finite language. These two problems are exemplified by the two well-known pairs of problems, the longest common subsequence (or substring) problem and the shortest common supersequence (or superstring) problem. interpretations.

In this paper we consider opposite problems connected with string non-inclusion relations: find the shortest string included in no string of a given finite language and find the longest string including no string of a given finite language. The predicate “string is not included in string β” is interpreted either as “ is not a subsequence of β” or as “ is not a substring of β”. The main purpose is to determine the complexity status of the non-similarity problems. Using graph approaches, we present NP-hardness proofs for the first interpretation and polynomial-time algorithms for the second one. Special cases of the problems, and related issues are discussed.  相似文献   


20.
In the single-particle tracking experiment, the internal motion of a single DNA or polymer molecule whose one end is attached to a microsphere (optical marker) and the other end is anchored to a substratum is studied (Finzi and Gelles, 1995). The stochastic Brownian dynamics of the sphere reflect the spontaneous fluctuations, thus the physical characteristics, of the DNA or polymer molecule (Qian and Elson, 1999, Qian, 2000). In this paper, two continuous models of polymer molecules, a flexible elastic string and a weakly bentable elastic rod, are analyzed. Both models are cast mathematically in terms of linear stochastic differential equations. Based on Fourier analyses, we calculate the mean square displacement (MSD) of the particle motion, the key observable in the experiment. We obtain for both models the short-time asymptotics for the MSD, as well as the long-time behavior in terms of the smallest non-zero eigenvalues. It is shown that: (i) the long-time dynamics of continuous elastic string model quantitatively agree with that of the discrete bead-spring model. (ii) The short-time MSD of both models are controlled by the tethered particle, with linear dependence on t. (iii) The two models show characteristic difference for long-time behavior: The longest relaxation time is proportional to L 2 for long elastic string and to L for short elastic string, but is proportional to L 4 for both long and short weakly bentable rod. Received: 26 March 1998 / Revised version: 9 June 2000 / Published online: 14 September 2000  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号