共查询到20条相似文献,搜索用时 9 毫秒
1.
Observed patterns in macromolecular sequences are often consideredas words and compared with their probabilities of occurringin random sequences. Calculation of these probabilities, however,often lacks rigour. We have developed an algorithm for exactcomputation of such probabilities for stochastic sequences thatfollow a Markov chain model. The method is applicable to thecase that a random sequence contains one out of two given patternsP and Q, or both simultaneously. Another application yieldsthe probability Junction P(x) that a sequence contains patternP exactly x times. An application to patterns that include wild-cardcharacters yields probabilities for homonucleotide clustersof a given length. We prove the probability of multiple runsof single nucleotides in the SV40 genome to be in accordancewith the dinucleotide composition of the sequence, althoughit is in conflict with mononucleotide composition. Received on January 10, 1990; accepted on April 23, 1990 相似文献
2.
In a random number generation task, participants are asked to generate a random sequence of numbers, most typically the digits 1 to 9. Such number sequences are not mathematically random, and both extent and type of bias allow one to characterize the brain's "internal random number generator". We assume that certain patterns and their variations will frequently occur in humanly generated random number sequences. Thus, we introduce a pattern-based analysis of random number sequences. Twenty healthy subjects randomly generated two sequences of 300 numbers each. Sequences were analysed to identify the patterns of numbers predominantly used by the subjects and to calculate the frequency of a specific pattern and its variations within the number sequence. This pattern analysis is based on the Damerau-Levenshtein distance, which counts the number of edit operations that are needed to convert one string into another. We built a model that predicts not only the next item in a humanly generated random number sequence based on the item's immediate history, but also the deployment of patterns in another sequence generated by the same subject. When a history of seven items was computed, the mean correct prediction rate rose up to 27% (with an individual maximum of 46%, chance performance of 11%). Furthermore, we assumed that when predicting one subject's sequence, predictions based on statistical information from the same subject should yield a higher success rate than predictions based on statistical information from a different subject. When provided with two sequences from the same subject and one from a different subject, an algorithm identifies the foreign sequence in up to 88% of the cases. In conclusion, the pattern-based analysis using the Levenshtein-Damarau distance is both able to predict humanly generated random number sequences and to identify person-specific information within a humanly generated random number sequence. 相似文献
3.
Background
The currently used k th order Markov models estimate the probability of generating a single nucleotide conditional upon the immediately preceding (gap = 0) k units. However, this neither takes into account the joint dependency of multiple neighboring nucleotides, nor does it consider the long range dependency with gap>0. 相似文献4.
First and second moment of counts of words in random texts generated by Markov chains 总被引:3,自引:0,他引:3
An exact expression for the variance of random frequency thata given word has in text generated by a Markov chain is presented.The result is applied to periodic Markov chains, which describethe protein-coding DNA sequences better than simple Markov chains.A new solution to the problem of word overlap is proposed. Itwas found that the expected frequency and overlapping propertiesdetermine most of the variance. The expectation and varianceof counts for triplets are compared with experimental countsin Escherichia coli coding sequences. 相似文献
5.
Incorporating biological pathways via a Markov random field model in genome-wide association studies
Genome-wide association studies (GWAS) examine a large number of markers across the genome to identify associations between genetic variants and disease. Most published studies examine only single markers, which may be less informative than considering multiple markers and multiple genes jointly because genes may interact with each other to affect disease risk. Much knowledge has been accumulated in the literature on biological pathways and interactions. It is conceivable that appropriate incorporation of such prior knowledge may improve the likelihood of making genuine discoveries. Although a number of methods have been developed recently to prioritize genes using prior biological knowledge, such as pathways, most methods treat genes in a specific pathway as an exchangeable set without considering the topological structure of a pathway. However, how genes are related with each other in a pathway may be very informative to identify association signals. To make use of the connectivity information among genes in a pathway in GWAS analysis, we propose a Markov Random Field (MRF) model to incorporate pathway topology for association analysis. We show that the conditional distribution of our MRF model takes on a simple logistic regression form, and we propose an iterated conditional modes algorithm as well as a decision theoretic approach for statistical inference of each gene's association with disease. Simulation studies show that our proposed framework is more effective to identify genes associated with disease than a single gene-based method. We also illustrate the usefulness of our approach through its applications to a real data example. 相似文献
6.
Pavy N Rombauts S Déhais P Mathé C Ramana DV Leroy P Rouzé P 《Bioinformatics (Oxford, England)》1999,15(11):887-899
MOTIVATION: The annotation of the Arabidopsis thaliana genome remains a problem in terms of time and quality. To improve the annotation process, we want to choose the most appropriate tools to use inside a computer-assisted annotation platform. We therefore need evaluation of prediction programs with Arabidopsis sequences containing multiple genes. RESULTS: We have developed AraSet, a data set of contigs of validated genes, enabling the evaluation of multi-gene models for the Arabidopsis genome. Besides conventional metrics to evaluate gene prediction at the site and the exon levels, new measures were introduced for the prediction at the protein sequence level as well as for the evaluation of gene models. This evaluation method is of general interest and could apply to any new gene prediction software and to any eukaryotic genome. The GeneMark.hmm program appears to be the most accurate software at all three levels for the Arabidopsis genomic sequences. Gene modeling could be further improved by combination of prediction software. AVAILABILITY: The AraSet sequence set, the Perl programs and complementary results and notes are available at http://sphinx.rug.ac.be:8080/biocomp/napav/. CONTACT: Pierre.Rouze@gengenp.rug.ac.be. 相似文献
7.
ROC ('receiver operator characteristics') analysis is a visual as well as numerical method used for assessing the performance of classification algorithms, such as those used for predicting structures and functions from sequence data. This review summarizes the fundamental concepts of ROC analysis and the interpretation of results using examples of sequence and structure comparison. We overview the available programs and provide evaluation guidelines for genomic/proteomic data, with particular regard to applications to large and heterogeneous databases used in bioinformatics. 相似文献
8.
The use of Chaos Game Representation (CGR) or its generalization, Universal Sequence Maps (USM), to describe the distribution
of biological sequences has been found objectionable because of the fractal structure of that coordinate system. Consequently,
the investigation of distribution of symbolic motifs at multiple scales is hampered by an inexact association between distance
and sequence dissimilarity. A solution to this problem could unleash the use of iterative maps as phase-state representation
of sequences where its statistical properties can be conveniently investigated. In this study a family of kernel density functions
is described that accommodates the fractal nature of iterative function representations of symbolic sequences and, consequently,
enables the exact investigation of sequence motifs of arbitrary lengths in that scale-independent representation. Furthermore,
the proposed kernel density includes both Markovian succession and currently used alignment-free sequence dissimilarity metrics
as special solutions. Therefore, the fractal kernel described is in fact a generalization that provides a common framework
for a diverse suite of sequence analysis techniques. 相似文献
9.
F C Eden 《Biochemistry》1985,24(1):229-233
Structural relationships within a family of long repeated DNA sequences have been determined by molecular cloning of individual family members. About half of the family members are truncated at one end. There is a short, tandemly repeating region flanked by direct repeats associated with truncation. Recombination in a region near the tandemly repeating segment has apparently generated much of the diversity in this family. 相似文献
10.
A model of perfect tandem repeat with random pattern has been considered. It expands a notion of approximate tandem repeat and describes new kind of latent periodicity in biological sequences, which has been named profile periodicity. Based on this model, an original spectral-statistical approach has been proposed for the estimation of size of the periodicity pattern in sequences of approximate tandem repeats. In contrast to existing approaches, this is applicable in practical conditions of unrepresentative samples (for standard criteria). This approach is suitable and effective for preliminary automated revelation of latent periodicity. The advantages of the spectral-statistical approach to estimation of the periodicity pattern for approximate tandem repeats have been demonstrated in comparison with the other methods. 相似文献
11.
Judith E. Dayhoff 《Bulletin of mathematical biology》1984,46(4):529-543
This paper concerns sequences of letters in which certain “distinguished” words are of interest. Such sequences arise as data
in numerous fields including genetics and neuroscience. A probability distribution is given for the number of occurrences
of a chosen word in a randomized sequence of letters. Such words are considered “favored” if they occur more than expected
at random. Favored words have been discovered in nerve impulse trains and may reflect a neural coding scheme.
This article is dedicated to my mother, Margaret Oakley Dayhoff, whose enthusiasm encouraged me to pursue research in mathematical
biology. 相似文献
12.
Earlier work [Knapp et al.: Hum Hered 1994;44:44-51] focusing on affected sib pair (ASP) data established the equivalence between the mean test and a test based on a simple recessive lod score, as well as equivalences between certain forms of the maximum likelihood score (MLS) statistic [Risch: Am J Hum Genet 1990;46:242-253] and particular forms of the lod score. Here we extend the results of Knapp et al. [1994] by reconsidering these equivalences for ASP data, but in the presence of locus heterogeneity. We show that Risch's MLS statistic under the possible triangle constraints [Holmans: Am J Hum Genet 1993;52:362-374] is locally equivalent to the ordinary heterogeneity lod score assuming a simple recessive model (HLOD/R); while the one-parameter MLS assuming no dominance variance is locally equivalent to the (homogeneity) recessive lod. The companion paper (this issue, pp 199-208) showed that when considering multiple data sets in the presence of locus heterogeneity, the HLOD can suffer appreciable losses in power. We show here that in ASP data, these equivalences ensure that this same loss in power is incurred by both forms of the MLS statistic as well. The companion paper also introduced an adaptation of the lod, the compound lod score (HLOD/C). We confirm that the HLOD/C maintains higher power than these 'model-free' methods when applied to multiple heterogeneous data sets, even when it is calculated assuming the wrong genetic model. 相似文献
13.
A method previously developed for computation of pattern probabilitiesin random sequences under Markov chain models. We extend thismethod to the calculation of the joint distribution for twopatterns. An application yields the distribution of the rightchoice measure for expressivity and how significance boundsdepend on sequence length. These bounds are used to show thatthe choice of pyrimidine in codon position 3 of Escherichiacoli genes deviates considerably from a general Markov processmodel for coding regions. We also derive some statistical evidencethat this significant deviation is limited to codon position3. 相似文献
14.
Recent advances in the selection of biologically active DNA sequences from random populations are reviewed. Within the framework of evolution, forces are considered that have precluded the testing of all possible DNA sequences, purely with regard to their functionality as genetic regulatory elements or protein coding sequences. Examples are drawn from cassette mutagenesis of enzyme active sites, protein domain replacement by fusion with random genomic digests, and the selection of bacterial promoters from random DNA. Efforts to derive new activities are examined, and the likelihood of future success is evaluated. 相似文献
15.
Ensemble-based approaches to RNA secondary structure prediction have become increasingly appreciated in recent years. Here,
we utilize sampling and clustering of the Boltzmann ensemble of RNA secondary structures to investigate whether biological
sequences exhibit ensemble features that are distinct from their random shuffles. Representative messenger RNAs (mRNAs), structural
RNAs, and precursor microRNAs (miRNAs) are analyzed for nine ensemble features. These include structure clustering features,
the energy gap between the minimum free energy (MFE) and the ensemble, the numbers of high-frequency base pairs in the ensemble
and in clusters, the average base-pair distance between the MFE structure and the ensemble, and between-cluster and within-cluster
sums of squares. For each of the features, we observe a lack of significant distinction between mRNAs and their random shuffles.
For five features, significant differences are found between structural RNAs and random counterparts. For seven features including
the five for structural RNAs, much greater differences are observed between precursor miRNAs and random shuffles. These findings
reveal differences in the Boltzmann structure ensemble among different types of functional RNAs. In addition, for two ensemble
features, we observe distinctive, non-overlapping distributions for precursor miRNAs and random shuffles. A distributional
separation can be particularly useful for the prediction of miRNA genes. 相似文献
16.
M. ten Hoopen 《Biological cybernetics》1967,4(1):1-10
Summary A mathematical model is presented that is supposed to describe those types of neuronal discharges which show a preponderance of short intervals, as well as one or more preferred intervals of a longer duration. It is assumed that via two channels impulses impinge upon a nerve cell and that each impulse gives rise to a response. The intervals between impulses in one channel are distributed according to an exponential, or an exponential-like, function; those in the other channel are distributed according to a monomodal, or a multimodal, function.The interval distributions and the expectation density (auto-correlation) functions of the model are in particular compared with data on thalamic neuron discharge patterns reported in the literature.The properties of superimposed time series of events would seem to be of a wider interest, stretching beyond the field of theoretical neurophysiology. It is indicated how the theory is of use in the detection of hidden rhythms in records which are composed of a mixture of different signals. 相似文献
17.
K. L. Svenson Y. C. Cheah K. L. Shultz J. L. Mu B. Paigen W. G. Beamer 《Mammalian genome》1995,6(12):867-872
We typed 147 simple sequence length polymorphisms in the SWXJ recombinant inbred (RI) strain set spanning Chromosomes (Chrs) 1–6. The strain distribution pattern for these loci was combined with data from 18 previously typed loci for SWXJ, resulting in new chromosome maps for this RI set, with an average density of 3.5 cM between loci. This is the first systematic effort to develop a more highly resolved genetic map for the SWXJ RI set and thereby improves the usefulness of this genetic tool for mapping genes underlying both simple and complex genetic disorders. 相似文献
18.
Three-dimensional numerical simulations of multi-cryo-needle surgery were performed with cryo-needle temperature variations taken from matched experimental data. The transient temperatures and frozen volumes generated by simultaneously operating up to three 1.47 mm OD cryo-needles embedded in a phase-changing gel simulating the properties of biological tissues, were studied. In all cases studied, the volumes enclosed by the "lethal", -40 degrees C isotherm, achieved most of their final size in the first few minutes of operation, thus obviating the need for long application times. After 30 min of application of the one-, two- or three-cryo-needles, the ablation ratio attained 3%, 3-6% and 3-8%, respectively, depending on cryo-needle placement configurations. Synergistic effects of using multi-cryo-needles were reflected in the increased expansion of both the radial and axial locations of the isothermal contours. Within each number of cryo-needles used, however, the differences in these locations were rather small, and, as a general rule, tended to somewhat decrease with increasing the placement "density" of the cryo-needles. For each two- and three-cryo-needle application, there is a certain combination of placement configuration and application time that would produce the largest, temperature-specific, volume. As a general guideline, multiple cryo-needles should not be placed too close to each other in order to enhance their synergistic effect. Results of this study should be useful in the design of cryo-needle placement and operation protocols and in understanding the limitations of the freezing-ablation process. 相似文献
19.
Clustering by soft-constraint affinity propagation: applications to gene-expression data 总被引:4,自引:0,他引:4
MOTIVATION: Similarity-measure-based clustering is a crucial problem appearing throughout scientific data analysis. Recently, a powerful new algorithm called Affinity Propagation (AP) based on message-passing techniques was proposed by Frey and Dueck (2007a). In AP, each cluster is identified by a common exemplar all other data points of the same cluster refer to, and exemplars have to refer to themselves. Albeit its proved power, AP in its present form suffers from a number of drawbacks. The hard constraint of having exactly one exemplar per cluster restricts AP to classes of regularly shaped clusters, and leads to suboptimal performance, e.g. in analyzing gene expression data. RESULTS: This limitation can be overcome by relaxing the AP hard constraints. A new parameter controls the importance of the constraints compared to the aim of maximizing the overall similarity, and allows to interpolate between the simple case where each data point selects its closest neighbor as an exemplar and the original AP. The resulting soft-constraint affinity propagation (SCAP) becomes more informative, accurate and leads to more stable clustering. Even though a new a priori free parameter is introduced, the overall dependence of the algorithm on external tuning is reduced, as robustness is increased and an optimal strategy for parameter selection emerges more naturally. SCAP is tested on biological benchmark data, including in particular microarray data related to various cancer types. We show that the algorithm efficiently unveils the hierarchical cluster structure present in the data sets. Further on, it allows to extract sparse gene expression signatures for each cluster. 相似文献
20.
基于马尔科夫链模型的长江源区土地覆盖格局变化特征 总被引:2,自引:0,他引:2
《生态学杂志》2015,34(1)
利用长江源区1986、2000与2014年3期的遥感影像,结合实地野外考察获得该地区在这3个时间点的土地覆盖类型图。根据各时期之间的土地覆盖格局的变化确定土地类型之间的转移概率,进一步完成对该地区马尔科夫链模型的构建、检验与预测。结果表明:1986—2014年,长江源区的土地覆盖格局的变化特征符合马尔科夫过程,通过马尔科夫链模型能够对该地区的覆盖格局变化过程进行有效的模拟;长江源区的土地覆被退化趋势明显,湿地、中高覆盖草地等面积不断下降,裸地、沙地以及低覆盖草地等面积则一直在增加;2000年以后,由于三江源区自然保护区的建立以及降水量的增加等因素影响,长江源区的植被退化状况得到明显改善。 相似文献
