首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In searching for strong homologies between multiple nucleic acid or protein sequences, researchers commonly look at fixed-length segments in common to the sequences. Such homologies form the foundation of segment-based algorithms for multiple alignment of protein sequences. The researcher uses settings of “unusualness of multiple matches” to calibrate the algorithms. In applications where a researcher has found a multiple matching word, statistical significance helps gauge the unusualness of the observed match. Previous approximations for the unusualness of multiple matches are based on large sample theory, and are sometimes quite inaccurate. Section 2 illustrates this inaccuracy, and provides accurate approximations for the probability of a common word inR out ofR sequences. Section 3 generalizes the approximation to multiple matching inR out ofS sequences. Section 4 describes a more complex approximation that incorporates exact probabilities and yields excellent accuracy; this approximation is useful for checking the simpler approximations over a range of values.  相似文献   

2.
An accurate approximation is derived to the distribution of the length of the longest matching word present between two random DNA sequences of finite length, using only elementary probability arguments. The distribution is shown to be consistent with previous asymptotic results for the mean and variance of longest common words. The application of the distribution to assessing the statistical significance of sequence similarities is considered. It is shown how the distribution can be modified to take account of non-independence of neighbouring bases in real sequences.  相似文献   

3.
Probability of fixation of an advantageous mutant in a viral quasispecies   总被引:7,自引:0,他引:7  
Wilke CO 《Genetics》2003,163(2):467-474
The probability that an advantageous mutant rises to fixation in a viral quasispecies is investigated in the framework of multitype branching processes. Whether fixation is possible depends on the overall growth rate of the quasispecies that will form if invasion is successful rather than on the individual fitness of the invading mutant. The exact fixation probability can be calculated only if the fitnesses of all potential members of the invading quasispecies are known. Quasispecies fixation has two important characteristics: First, a sequence with negative selection coefficient has a positive fixation probability as long as it has the potential to grow into a quasispecies with an overall growth rate that exceeds that of the established quasispecies. Second, the fixation probabilities of sequences with identical fitnesses can nevertheless vary over many orders of magnitudes. Two approximations for the probability of fixation are introduced. Both approximations require only partial knowledge about the potential members of the invading quasispecies. The performance of these two approximations is compared to the exact fixation probability on a network of RNA sequences with identical secondary structure.  相似文献   

4.
The exact distribution of word counts in random sequences and several approximations have been proposed in the past few years. The exact distribution has no theoretical limit but may require prohibitive computation time. On the other hand, approximate distributions can be rapidly calculated but, in practice, are only accurate under specific conditions. After making a survey of these distributions, we compare them according to both their accuracy and computational cost. Rules are suggested for choosing between Gaussian approximations, compound Poisson approximation, and exact distribution. This work is illustrated with the detection of exceptional words in the phage Lambda genome.  相似文献   

5.
The calculation of the survival probability of a selectively advantageous allele is a central part of the quantitative theory of genetic evolution. However, several areas of investigation in population genetics theory, including the generalized neutrality theory, the concept of Muller's ratchet, and the risk of extinction of sexually reproducing populations due to the accumulation of deleterious mutations, rely on the calculation of the survival probability of selectively disadvantageous mutant genes. The calculation of these probabilities in the standard Wright-Fisher model of genetic evolution appears to be intractable, and yet is a key element in the above investigations. In this paper we find bounds for the fixation probability of deleterious and advantageous additive mutants, as well as finding close approximations for these probabilities. In addition, we derive analytical estimates for the relative error of our approximations and compare our results with those from numerical computation. Our results justify the diffusion approximation for the fixation probability of a single mutant.  相似文献   

6.
In this paper, we give an overview about the different results existing on the statistical distribution of word counts in a Markovian sequence of letters. Results concerning the number of overlapping occurrences, the number of renewals and the number of clumps will be presented. Counts of single words and also multiple words are considered. Most of the results are approximations as the length of the sequence tends to infinity. We will see that Gaussian approximations switch to (compound) Poisson approximations for rare words. Modeling DNA sequences or proteins by stationary Markov chains, these results can be used to study the statistical frequency of motifs in a given sequence.  相似文献   

7.
8.
Outpatient appointment scheduling balances efficiency with access to healthcare services, yet appointment no-shows, cancellations, and delay are significant barriers to effective healthcare delivery. Patients with longer appointment delay often waste appointments more frequently, prompting a need for greater flexibility in appointment allocation. We present a joint capacity control and overbooking model where a clinic maximizes profits by controlling bookings from two sequential patient classes with different no-show rates. When booking advance requests, the clinic must balance high no-show probability with the probability of subsequent requests at lower waste rates. We show the optimal policy is computationally intensive to derive; therefore, we develop bounds and approximations which we compare via numerical study with the optimal policy as well as policies from practice and previous literature. We find the optimal policy increases profits 17.8% over first-come-first-serve allocation. We develop a simple policy which performs 0.3% below optimal on average. While pure open access can achieve optimality, it performs 23.0% below optimal on average.  相似文献   

9.
In this paper, we propose two metrics to compare DNA and protein sequences based on a Poisson model of word occurrences. Instead of comparing the frequencies of all fixed-length words in two sequences, we consider (1) the probability of ‘generating’ one sequence under the Poisson model estimated from the other; (2) their different expression levels of words. Phylogenetic trees of 25 viruses including SARS-CoVs are constructed to illustrate our approach.  相似文献   

10.
Sequencing by hybridization is a method for reconstructing a DNA sequence based on its k-mer content. This content, called the spectrum of the sequence, can be obtained from hybridization with a universal DNA chip. However, even with a sequencing chip containing all 4(9) 9-mers and assuming no hybridization errors, only about 400-bases-long sequences can be reconstructed unambiguously. Drmanac et al. (1989) suggested sequencing long DNA targets by obtaining spectra of many short overlapping fragments of the target, inferring their relative positions along the target, and then computing spectra of subfragments that are short enough to be uniquely recoverable. Drmanac et al. do not treat the realistic case of errors in the hybridization process. In this paper, we study the effect of such errors. We show that the probability of ambiguous reconstruction in the presence of (false negative) errors is close to the probability in the errorless case. More precisely, the ratio between these probabilities is 1 + O(p = (1 - p)(4). 1 = d) where d is the average length of subfragments, and p is the probability of a false negative. We also obtain lower and upper bounds for the probability of unambiguous reconstruction based on an errorless spectrum. For realistic chip sizes, these bounds are tighter than those given by Arratia et al. (1996). Finally, we report results on simulations with real DNA sequences, showing that even in the presence of 50% false negative errors, a target of cosmid length can be recovered with less than 0.1% miscalled bases.  相似文献   

11.
We develop a novel mathematical model for microsatellite mutations during polymerase chain reaction (PCR). Based on the model, we study the first- and second-order moments of the number of repeat units in a randomly chosen molecule after n PCR cycles and their corresponding mean field approximations. We give upper bounds for the approximation errors and show that the approximation errors are small when the mutation rate is low. Based on the theoretical results, we develop a moment estimation method to estimate the mutation rate per-repeat-unit per PCR cycle and the probability of expansion when mutations occur. Simulation studies show that the moment estimation method can accurately recover the true mutation rate and probability of expansion. Finally, the method is applied to experimental data from single-molecule PCR experiments.  相似文献   

12.
When two strings of symbols are aligned it is important to know whether the observed number of matches is better than that expected between two independent sequences with the same frequency of symbols. When strings are of different lengths, nulls need to be inserted in order to align the sequences. One approach is to use simple approximations of sampling for replacement. We describe an algorithm for exactly determining the frequencies of given numbers of matches, sampling without replacement. This does not lead to a simple closed form expression. However we show examples where sampling with, or without, replacement give very similar results and the simple approach may be adequate for all but the smallest cases.  相似文献   

13.
We study the establishment probability of invaders in stochastically fluctuating environments and the related issue of extinction probability of small populations in such environments, by means of an inhomogeneous branching process model. In the model it is assumed that individuals reproduce asexually during discrete reproduction periods. Within each period, individuals have (independent) Poisson distributed numbers of offspring. The expected numbers of offspring per individual are independently identically distributed over the periods. It is shown that the establishment probability of an invader varies over the reproduction periods according to a stable distribution. We give a method for simulating the establishment probabilities and approximations for the expected establishment probability. Furthermore, we show that, due to the stochasticity of the establishment success over different periods, the expected success of sequential invasions is larger then that of simultaneous invasions and we study the effects of environmental fluctuations on the extinction probability of small populations and metapopulations. The results can easily be generalized to other offspring distributions than the Poisson.  相似文献   

14.
Probability matching is a classic theory of decision making that was first developed in models of cognition. Posterior probability matching, a variant in which observers match their response probabilities to the posterior probability of each response being correct, is being used increasingly often in models of perception. However, little is known about whether posterior probability matching is consistent with the vast literature on vision and hearing that has developed within signal detection theory. Here we test posterior probability matching models using two tools from detection theory. First, we examine the models’ performance in a two-pass experiment, where each block of trials is presented twice, and we measure the proportion of times that the model gives the same response twice to repeated stimuli. We show that at low performance levels, posterior probability matching models give highly inconsistent responses across repeated presentations of identical trials. We find that practised human observers are more consistent across repeated trials than these models predict, and we find some evidence that less practised observers more consistent as well. Second, we compare the performance of posterior probability matching models on a discrimination task to the performance of a theoretical ideal observer that achieves the best possible performance. We find that posterior probability matching is very inefficient at low-to-moderate performance levels, and that human observers can be more efficient than is ever possible according to posterior probability matching models. These findings support classic signal detection models, and rule out a broad class of posterior probability matching models for expert performance on perceptual tasks that range in complexity from contrast discrimination to symmetry detection. However, our findings leave open the possibility that inexperienced observers may show posterior probability matching behaviour, and our methods provide new tools for testing for such a strategy.  相似文献   

15.
Representation of sequence similarity by dot matrix plots is a method widely used for comparing biological sequences. The user is presented with an overall view of similarity between two sequences. Computation of this plot has been reconsidered here. An improvement is proposed through the preprocessing of the data into an automation recognizing the word structure of a sequence. The main advantage of this approach is to systematically eliminate the repetitions during word comparison. Simple heuristics are also considered to greatly speed up pattern matching. As a result, large sequences are handled very efficiently. This is illustrated by a comparison of large genomic DNA. The algorithm has been implemented in an interactive application on a microcomputer.  相似文献   

16.
Multiplexed high-throughput pyrosequencing is currently limited in complexity (number of samples sequenced in parallel), and in capacity (number of sequences obtained per sample). Physical-space segregation of the sequencing platform into a fixed number of channels allows limited multiplexing, but obscures available sequencing space. To overcome these limitations, we have devised a novel barcoding approach to allow for pooling and sequencing of DNA from independent samples, and to facilitate subsequent segregation of sequencing capacity. Forty-eight forward–reverse barcode pairs are described: each forward and each reverse barcode unique with respect to at least 4 nt positions. With improved read lengths of pyrosequencers, combinations of forward and reverse barcodes may be used to sequence from as many as n2 independent libraries for each set of ‘n’ forward and ‘n’ reverse barcodes, for each defined set of cloning-linkers. In two pilot series of barcoded sequencing using the GS20 Sequencer (454/Roche), we found that over 99.8% of obtained sequences could be assigned to 25 independent, uniquely barcoded libraries based on the presence of either a perfect forward or a perfect reverse barcode. The false-discovery rate, as measured by the percentage of sequences with unexpected perfect pairings of unmatched forward and reverse barcodes, was estimated to be <0.005%.  相似文献   

17.
The large amount and high quality of genomic data available today enable, in principle, accurate inference of evolutionary histories of observed populations. The Wright-Fisher model is one of the most widely used models for this purpose. It describes the stochastic behavior in time of allele frequencies and the influence of evolutionary pressures, such as mutation and selection. Despite its simple mathematical formulation, exact results for the distribution of allele frequency (DAF) as a function of time are not available in closed analytical form. Existing approximations build on the computationally intensive diffusion limit or rely on matching moments of the DAF. One of the moment-based approximations relies on the beta distribution, which can accurately describe the DAF when the allele frequency is not close to the boundaries (0 and 1). Nonetheless, under a Wright-Fisher model, the probability of being on the boundary can be positive, corresponding to the allele being either lost or fixed. Here we introduce the beta with spikes, an extension of the beta approximation that explicitly models the loss and fixation probabilities as two spikes at the boundaries. We show that the addition of spikes greatly improves the quality of the approximation. We additionally illustrate, using both simulated and real data, how the beta with spikes can be used for inference of divergence times between populations with comparable performance to an existing state-of-the-art method.  相似文献   

18.
The results of quantitative risk assessments are key factors in a risk manager's decision of the necessity to implement actions to reduce risk. The extent of the uncertainty in the assessment will play a large part in the degree of confidence a risk manager has in the reported significance and probability of a given risk. The two main sources of uncertainty in such risk assessments are variability and incertitude. In this paper we use two methods, a second-order two-dimensional Monte Carlo analysis and probability bounds analysis, to investigate the impact of both types of uncertainty on the results of a food-web exposure model. We demonstrate how the full extent of uncertainty in a risk estimate can be fully portrayed in a way that is useful to risk managers. We show that probability bounds analysis is a useful tool for identifying the parameters that contribute the most to uncertainty in a risk estimate and how it can be used to complement established practices in risk assessment. We conclude by promoting the use of probability analysis in conjunction with Monte Carlo analyses as a method for checking how plausible Monte Carlo results are in the full context of uncertainty.  相似文献   

19.
Efficient methods for multiple sequence alignment with guaranteed error bounds   总被引:11,自引:0,他引:11  
Multiple string (sequence) alignment is a difficult and important problem in computational biology, where it is central in two related tasks: finding highly conserved subregions or embedded patterns of a set of biological sequences (strings of DNA, RNA or amino acids), and inferring the evolutionary history of a set of taxa from their associated biological sequences. Several precise measures have been proposed for evaluating the goodness of a multiple alignment, but no efficient methods are known which compute the optimal alignment for any of these measures in any but small cases. In this paper, we consider two previously proposed measures, and given two computationaly efficient multiple alignment methods (one for each measure) whose deviation from the optimal value isguaranteed to be less than a factor of two. This is the novel feature of these methods, but the methods have additional virtues as well. For both methods, the guaranteed bounds are much smaller than two when the number of strings is small (1.33 for three strings of any length); for one of the methods we give a related randomized method which is much faster and which gives, with high probability, multiple alignments with fairly small error bounds; and for the other measure, the method given yields a non-obviouslower bound on the value of the optimal alignment.  相似文献   

20.

Background

DNA Clustering is an important technology to automatically find the inherent relationships on a large scale of DNA sequences. But the DNA clustering quality can still be improved greatly. The DNA sequences similarity metric is one of the key points of clustering. The alignment-free methodology is a very popular way to calculate DNA sequence similarity. It normally converts a sequence into a feature space based on words’ probability distribution rather than directly matches strings. Existing alignment-free models, e.g. k-tuple, merely employ word frequency information and ignore many types of useful information contained in the DNA sequence, such as classifications of nucleotide bases, position and the like. It is believed that the better data mining results can be achieved with compounded information. Therefore, we present a new alignment-free model that employs compounded information to improve the DNA clustering quality.

Results

This paper proposes a Category-Position-Frequency (CPF) model, which utilizes the word frequency, position and classification information of nucleotide bases from DNA sequences. The CPF model converts a DNA sequence into three sequences according to the categories of nucleotide bases, and then yields a 12-dimension feature vector. The feature values are computed by an entropy based model that takes both local word frequency and position information into account. We conduct DNA clustering experiments on several datasets and compare with some mainstream alignment-free models for evaluation, including k-tuple, DMk, TSM, AMI and CV. The experiments show that CPF model is superior to other models in terms of the clustering results and optimal settings.

Conclusions

The following conclusions can be drawn from the experiments. (1) The hybrid information model is better than the model based on word frequency only. (2) For DNA sequences no more than 5000 characters, the preferred size of sliding windows for CPF is two which provides a great advantage to promote system performance. (3) The CPF model is able to obtain an efficient stable performance and broad generalization.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号