首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Hidden Markov models (HMMs) are a class of stochastic models that have proven to be powerful tools for the analysis of molecular sequence data. A hidden Markov model can be viewed as a black box that generates sequences of observations. The unobservable internal state of the box is stochastic and is determined by a finite state Markov chain. The observable output is stochastic with distribution determined by the state of the hidden Markov chain. We present a Bayesian solution to the problem of restoring the sequence of states visited by the hidden Markov chain from a given sequence of observed outputs. Our approach is based on a Monte Carlo Markov chain algorithm that allows us to draw samples from the full posterior distribution of the hidden Markov chain paths. The problem of estimating the probability of individual paths and the associated Monte Carlo error of these estimates is addressed. The method is illustrated by considering a problem of DNA sequence multiple alignment. The special structure for the hidden Markov model used in the sequence alignment problem is considered in detail. In conclusion, we discuss certain interesting aspects of biological sequence alignments that become accessible through the Bayesian approach to HMM restoration.  相似文献   

2.
The most commonly used models for analysing local dependencies in DNA sequences are (high-order) Markov chains. Incorporating knowledge relative to the possible grouping of the nucleotides enables to define dedicated sub-classes of Markov chains. The problem of formulating lumpability hypotheses for a Markov chain is therefore addressed. In the classical approach to lumpability, this problem can be formulated as the determination of an appropriate state space (smaller than the original state space) such that the lumped chain defined on this state space retains the Markov property. We propose a different perspective on lumpability where the state space is fixed and the partitioning of this state space is represented by a one-to-many probabilistic function within a two-level stochastic process. Three nested classes of lumped processes can be defined in this way as sub-classes of first-order Markov chains. These lumped processes enable parsimonious reparameterizations of Markov chains that help to reveal relevant partitions of the state space. Characterizations of the lumped processes on the original transition probability matrix are derived. Different model selection methods relying either on hypothesis testing or on penalized log-likelihood criteria are presented as well as extensions to lumped processes constructed from high-order Markov chains. The relevance of the proposed approach to lumpability is illustrated by the analysis of DNA sequences. In particular, the use of lumped processes enables to highlight differences between intronic sequences and gene untranslated region sequences.  相似文献   

3.
This paper proposes a graphical method for detecting interspecies recombination in multiple alignments of DNA sequences. A fixed-size window is moved along a given DNA sequence alignment. For every position, the marginal posterior probability over tree topologies is determined by means of a Markov chain Monte Carlo simulation. Two probabilistic divergence measures are plotted along the alignment, and are used to identify recombinant regions. The method is compared with established detection methods on a set of synthetic benchmark sequences and two real-world DNA sequence alignments.  相似文献   

4.
The Wiener method of nonlinear system identification is extended to systems with a Markov chain input. Multivariate functionals are constructed that are orthonormal with respect to the probability measure of the Markov input. Any system operating on a Markov chain may be represented by an orthogonal expansion in these functionals. The coefficients of the orthogonal expansion may be evaluated by crosscorrelation. Application of this technique to nonlinear neural systems with a Markov actionpotential input are discussed.  相似文献   

5.
A Bayesian approach to DNA sequence segmentation   总被引:3,自引:0,他引:3  
Boys RJ  Henderson DA 《Biometrics》2004,60(3):573-581
Many deoxyribonucleic acid (DNA) sequences display compositional heterogeneity in the form of segments of similar structure. This article describes a Bayesian method that identifies such segments by using a Markov chain governed by a hidden Markov model. Markov chain Monte Carlo (MCMC) techniques are employed to compute all posterior quantities of interest and, in particular, allow inferences to be made regarding the number of segment types and the order of Markov dependence in the DNA sequence. The method is applied to the segmentation of the bacteriophage lambda genome, a common benchmark sequence used for the comparison of statistical segmentation algorithms.  相似文献   

6.
A kinetic model for the synthesis of proteins in prokaryotes is presented and analysed. This model is based on a Markov model for the state of the DNA strand encoding the protein. The states that the DNA strand can occupy are: ready, repressed, or having a mRNA chain of length i in the process of being completed. The case i = 0 corresponds to the RNA polymerase attached, but no nucleotides attached to the chain. The Markov model consists of differential equations for the rates of change of the probabilities. The rate of production of the mRNA molecules is equal to the probability that the chain is assembled to the penultimate nucleotide, times the rate at which that nucleotide is attached. Similarly, the mRNA molecules can also be in different states, including: ready and having an amino acid chain of length j attached. The rate of protein synthesis is the rate at which the chain is completed. A Michaelis-Menten type of analysis is done, assuming that the rate of protein degradation determines the ’slow’ time, and that all the other kinetic rates are ‘fast’. In the self-regulated case, this results in a single ordinary differential equation for the protein concentration.  相似文献   

7.
Ionizing radiation damage to a mammalian genome is modeled using continuous time Markov chains. Models are given for the initial infliction of DNA double strand breaks by radiation and for the enzymatic processing of this initial damage. Damage processing pathways include DNA double strand break repair and chromosome exchanges. Linear, saturable, or inducible repair is considered, competing kinetically with pairwise interactions of the DNA double strand breaks. As endpoints, both chromosome aberrations and the inability of cells to form clones are analyzed. For the post-irradiation behavior, using the discrete time Markov chain embedded at transitions gives the ultimate distribution of damage more simply than does integrating the Kolmogorov forward equations. In a representative special case explicit expressions for the probability distribution of damage at large times are given in the form used for numerical computations and comparisons with experiments on human lymphocytes. A principle of branching ratios, that late assays can only measure appropriate ratios of repair and interaction functions, not the functions themselves, is derived and discussed.This work was supported in # DMS-9025103  相似文献   

8.
A finite-context (Markov) model of order k yields the probability distribution of the next symbol in a sequence of symbols, given the recent past up to depth k. Markov modeling has long been applied to DNA sequences, for example to find gene-coding regions. With the first studies came the discovery that DNA sequences are non-stationary: distinct regions require distinct model orders. Since then, Markov and hidden Markov models have been extensively used to describe the gene structure of prokaryotes and eukaryotes. However, to our knowledge, a comprehensive study about the potential of Markov models to describe complete genomes is still lacking. We address this gap in this paper. Our approach relies on (i) multiple competing Markov models of different orders (ii) careful programming techniques that allow orders as large as sixteen (iii) adequate inverted repeat handling (iv) probability estimates suited to the wide range of context depths used. To measure how well a model fits the data at a particular position in the sequence we use the negative logarithm of the probability estimate at that position. The measure yields information profiles of the sequence, which are of independent interest. The average over the entire sequence, which amounts to the average number of bits per base needed to describe the sequence, is used as a global performance measure. Our main conclusion is that, from the probabilistic or information theoretic point of view and according to this performance measure, multiple competing Markov models explain entire genomes almost as well or even better than state-of-the-art DNA compression methods, such as XM, which rely on very different statistical models. This is surprising, because Markov models are local (short-range), contrasting with the statistical models underlying other methods, where the extensive data repetitions in DNA sequences is explored, and therefore have a non-local character.  相似文献   

9.
The degree of similarity of DNA sequences can be concluded according to the comparison of DNA sequences, which helps to speculate their relationship in respect of the structure, function and evolution. In this paper, we introduce the fundamental of the weighted relative entropy based on 2-step Markov Model to compare DNA sequences. The DNA sequence, consisted of four characters A, T, C, G, can be considered as a Markov chain. By taking state space I = {A, T, C, G} and describe the DNA sequences with 2-step transition probability matrix we can get the eigenvalue of the DNA sequence to define the similarity metric. Therefore, we find a new method to compare the DNA sequences, which is used to classify chromosomes DNA sequences obtained from 30 species. The phylogenetic tree built by the alignment-free method of the distance matrix resulted from the weighted relative entropy has clearer and more accurate division.  相似文献   

10.
Crowley EM 《Biopolymers》2001,58(2):165-174
A goal of the human genome project is to determine the entire sequence of DNA (3 x 10(9) base pairs) found in chromosomes. The massive amounts of data produced by this project require interpretation. A Bayesian model is developed for locating regulatory regions in a DNA sequence. Regulatory regions are areas of DNA to which specific proteins bind and control whether or not a gene is transcribed to produce templates for protein synthesis. Each human cell contains the same DNA sequence. Thus the particular function of different cells is determined by the genes that are transcribed in that cell. A Hidden Markov chain is used to model whether a small interval of the DNA is in a regulatory region or not. This can be regarded as a changepoint problem where the changepoints are the start of a regulatory or nonregulatory region. The data consists of protein-binding elements, which are short subsequences, or "words," in the DNA sequence. Although these words can occur anywhere in the sequence, a larger number are expected in regulatory regions. Therefore, regulatory regions are detected by locating clusters of words. For a particular DNA sequence, the model automatically selects those words that best predict regions of interest. Markov chain Monte Carlo methods are used to explore the posterior distribution of the Hidden Markov chain. The model is tested by means of simulations, and applied to several DNA sequences.  相似文献   

11.
 Exact formulas for the mean and variance of the proportion of different types in a fixed generation of a multi-type Galton-Watson process are derived. The formulas are given in terms of iterates of the probability generating function of the offspring distribution. It is also shown that the sequence of types backwards from a randomly sampled particle in a fixed generation is a non-homogeneous Markov chain where the transition probabilities can be given explicitly, again in terms of probability generating functions. Two biological applications are considered: mutations in mitochondrial DNA and the polymerase chain reaction. Received: 10 June 2001 / Revised version: 21 November 2001 / Published online: 23 August 2002 Mathematics Subject Classification (2000): Primary 60J80, Secondary 92D10, 92D25 Key words or phrases: Multi-type Galton-Watson process – sampling formula – PCR – mitochondrial DNA  相似文献   

12.
Liu L  Ho YK  Yau S 《DNA and cell biology》2007,26(7):477-483
The inhomogeneous Markov chain model is used to discriminate acceptor and donor sites in genomic DNA sequences. It outperforms statistical methods such as homogeneous Markov chain model, higher order Markov chain and interpolated Markov chain models, and machine-learning methods such as k-nearest neighbor and support vector machine as well. Besides its high accuracy, another advantage of inhomogeneous Markov chain model is its simplicity in computation. In the three states system (acceptor, donor, and neither), the inhomogeneous Markov chain model is combined with a three-layer feed forward neural network. Using this combined system 3175 primate splice-junction gene sequences have been tested, with a prediction accuracy of greater than 98%.  相似文献   

13.
It has been suggested that when an organism is exposed to ionizing radiation the initial damage results from the occurrence of ionization in a so-called sensitive volume due to absorption of radiation quanta. The initial radiation damage is then transmitted or amplified to a level of macroscopic perception. In this paper a mechanism by which this transmission may take place and a finite Markov chain model applicable to this transmission are postulated and discussed. This mechanism is assumed to be the depolymerization of essential chain molecules which are connected to some “central group” associated with the seensitive volume. The depolymerization of the macromolecules following a hit in the sensitive volume is postulated to be determined by a chain mechanism, which acts in a manner inverse to the mechanism controlling the polymerization process. A mathematical study of this problem is made using the theory of Markov chains. The probability of complete degradation of the chain macromolecule, and the probability of recombination of the units to give the intact chain were determined, assuming that the probabilty of successive steps in the degradation increase linearly from the intact state to that of complete breakdown.  相似文献   

14.
A Bayesian framework for the analysis of cospeciation   总被引:8,自引:0,他引:8  
Abstract.— Information on the history of cospeciation and host switching for a group of host and parasite species is contained in the DNA sequences sampled from each. Here, we develop a Bayesian framework for the analysis of cospeciation. We suggest a simple model of host switching by a parasite on a host phylogeny in which host switching events are assumed to occur at a constant rate over the entire evolutionary history of associated hosts and parasites. The posterior probability density of the parameters of the model of host switching are evaluated numerically using Markov chain Monte Carlo. In particular, the method generates the probability density of the number of host switches and of the host switching rate. Moreover, the method provides information on the probability that an event of host switching is associated with a particular pair of branches. A Bayesian approach has several advantages over other methods for the analysis of cospeciation. In particular, it does not assume that the host or parasite phylogenies are known without error; many alternative phylogenies are sampled in proportion to their probability of being correct.  相似文献   

15.
MOTIVATION: We present a statistical method for detecting recombination, whose objective is to accurately locate the recombinant breakpoints in DNA sequence alignments of small numbers of taxa (4 or 5). Our approach explicitly models the sequence of phylogenetic tree topologies along a multiple sequence alignment. Inference under this model is done in a Bayesian way, using Markov chain Monte Carlo (MCMC). The algorithm returns the site-dependent posterior probability of each tree topology, which is used for detecting recombinant regions and locating their breakpoints. RESULTS: The method was tested on a synthetic and three real DNA sequence alignments, where it was found to outperform the established detection methods PLATO, RECPARS, and TOPAL.  相似文献   

16.
17.
Ewing G  Nicholls G  Rodrigo A 《Genetics》2004,168(4):2407-2420
We present a Bayesian statistical inference approach for simultaneously estimating mutation rate, population sizes, and migration rates in an island-structured population, using temporal and spatial sequence data. Markov chain Monte Carlo is used to collect samples from the posterior probability distribution. We demonstrate that this chain implementation successfully reaches equilibrium and recovers truth for simulated data. A real HIV DNA sequence data set with two demes, semen and blood, is used as an example to demonstrate the method by fitting asymmetric migration rates and different population sizes. This data set exhibits a bimodal joint posterior distribution, with modes favoring different preferred migration directions. This full data set was subsequently split temporally for further analysis. Qualitative behavior of one subset was similar to the bimodal distribution observed with the full data set. The temporally split data showed significant differences in the posterior distributions and estimates of parameter values over time.  相似文献   

18.
C Fuchs 《Gene》1980,10(4):371-373
Several Markov chain models (up to fourth order) have been fitted to the sequences of the seven DNAs presented in Fuchs et al. (1980). Two methods for determining the order of Markov chain are applied to the data. The two methods lead to different conclusions and we dicuss these discrepancies. When the distribution of the nucleotides in a DNA sequence is investigated, it is suggested that the study on the order of the Markov model should be supplemented with additional analysis.  相似文献   

19.
Several Markov chain models (up to fourth order) have been fitted to the sequences of the seven DNAs presented in Fuchs et al. (1980). Two methods for determining the order of Markov chain are applied to the data. The two methods lead to different conclusions and we dicuss these discrepancies. When the distribution of the nucleotides in a DNA sequence is investigated, it is suggested that the study on the order of the Markov model should be supplemented with additional analysis.  相似文献   

20.
Markov chain Monte Carlo (MCMC) has recently gained use as a method of estimating required probability and likelihood functions in pedigree analysis, when exact computation is impractical. However, when a multiallelic locus is involved, irreducibility of the constructed Markov chain, an essential requirement of the MCMC method, may fail. Solutions proposed by several researchers, which do not identify all the noncommunicating sets of genotypic configurations, are inefficient with highly polymorphic loci. This is a particularly serious problem in linkage analysis, because highly polymorphic markers are much more informative and thus are preferred. In the present paper, we describe an algorithm that finds all the noncommunicating classes of genotypic configurations on any pedigree. This leads to a more efficient method of defining an irreducible Markov chain. Examples, including a pedigree from a genetic study of familial Alzheimer disease, are used to illustrate how the algorithm works and how penetrances are modified for specific individuals to ensure irreducibility.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号