首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Information content of protein sequences   总被引:1,自引:0,他引:1  
The complexity of large sets of non-redundant protein sequences is measured. This is done by estimating the Shannon entropy as well as applying compression algorithms to estimate the algorithmic complexity. The estimators are also applied to randomly generated surrogates of the protein data. Our results show that proteins are fairly close to random sequences. The entropy reduction due to correlations is only about 1%. However, precise estimations of the entropy of the source are not possible due to finite sample effects. Compression algorithms also indicate that the redundancy is in the order of 1%. These results confirm the idea that protein sequences can be regarded as slightly edited random strings. We discuss secondary structure and low-complexity regions as causes of the redundancy observed. The findings are related to numerical and biochemical experiments with random polypeptides.  相似文献   

2.
The language of RNA: a formal grammar that includes pseudoknots   总被引:9,自引:0,他引:9  
MOTIVATION: In a previous paper, we presented a polynomial time dynamic programming algorithm for predicting optimal RNA secondary structure including pseudoknots. However, a formal grammatical representation for RNA secondary structure with pseudoknots was still lacking. RESULTS: Here we show a one-to-one correspondence between that algorithm and a formal transformational grammar. This grammar class encompasses the context-free grammars and goes beyond to generate pseudoknotted structures. The pseudoknot grammar avoids the use of general context-sensitive rules by introducing a small number of auxiliary symbols used to reorder the strings generated by an otherwise context-free grammar. This formal representation of the residue correlations in RNA structure is important because it means we can build full probabilistic models of RNA secondary structure, including pseudoknots, and use them to optimally parse sequences in polynomial time.  相似文献   

3.
MOTIVATION: Computationally identifying non-coding RNA regions on the genome has much scope for investigation and is essentially harder than gene-finding problems for protein-coding regions. Since comparative sequence analysis is effective for non-coding RNA detection, efficient computational methods are expected for structural alignments of RNA sequences. On the other hand, Hidden Markov Models (HMMs) have played important roles for modeling and analysing biological sequences. Especially, the concept of Pair HMMs (PHMMs) have been examined extensively as mathematical models for alignments and gene finding. RESULTS: We propose the pair HMMs on tree structures (PHMMTSs), which is an extension of PHMMs defined on alignments of trees and provides a unifying framework and an automata-theoretic model for alignments of trees, structural alignments and pair stochastic context-free grammars. By structural alignment, we mean a pairwise alignment to align an unfolded RNA sequence into an RNA sequence of known secondary structure. First, we extend the notion of PHMMs defined on alignments of 'linear' sequences to pair stochastic tree automata, called PHMMTSs, defined on alignments of 'trees'. The PHMMTSs provide various types of alignments of trees such as affine-gap alignments of trees and an automata-theoretic model for alignment of trees. Second, based on the observation that a secondary structure of RNA can be represented by a tree, we apply PHMMTSs to the problem of structural alignments of RNAs. We modify PHMMTSs so that it takes as input a pair of a 'linear' sequence and a 'tree' representing a secondary structure of RNA to produce a structural alignment. Further, the PHMMTSs with input of a pair of two linear sequences is mathematically equal to the pair stochastic context-free grammars. We demonstrate some computational experiments to show the effectiveness of our method for structural alignments, and discuss a complexity issue of PHMMTSs.  相似文献   

4.
The multifractal analysis of binary images of DNA is studied in order to define a methodological approach to the classification of DNA sequences. This method is based on the computation of some multifractality parameters on a suitable binary image of DNA, which takes into account the nucleotide distribution. The binary image of DNA is obtained by a dot-plot (recurrence plot) of the indicator matrix. The fractal geometry of these images is characterized by fractal dimension (FD), lacunarity, and succolarity. These parameters are compared with some other coefficients such as complexity and Shannon information entropy. It will be shown that the complexity parameters are more or less equivalent to FD, while the parameters of multifractality have different values in the sense that sequences with higher FD might have lower lacunarity and/or succolarity. In particular, the genome of Drosophila melanogaster has been considered by focusing on the chromosome 3r, which shows the highest fractality with a corresponding higher level of complexity. We will single out some results on the nucleotide distribution in 3r with respect to complexity and fractality. In particular, we will show that sequences with higher FD also have a higher frequency distribution of guanine, while low FD is characterized by the higher presence of adenine.  相似文献   

5.
The information capacity of nucleotide sequences is defined through the specific entropy of frequency dictionary of a sequence determined with respect to another one containing the most probable continuations of shorter strings. This measure distinguishes a sequence both from a random one, and from ordered entity. A comparison of sequences based on their information capacity is studied. An order within the genetic entities is found at the length scale ranged from 3 to 8. Some other applications of the developed methodology to genetics, bioinformatics, and molecular biology are discussed.  相似文献   

6.
According to unconscious thought theory (UTT), unconscious thought is more adept at complex decision-making than is conscious thought. Related research has mainly focused on the complexity of decision-making tasks as determined by the amount of information provided. However, the complexity of the rules generating this information also influences decision making. Therefore, we examined whether unconscious thought facilitates the detection of rules during a complex decision-making task. Participants were presented with two types of letter strings. One type matched a grammatical rule, while the other did not. Participants were then divided into three groups according to whether they made decisions using conscious thought, unconscious thought, or immediate decision. The results demonstrated that the unconscious thought group was more accurate in identifying letter strings that conformed to the grammatical rule than were the conscious thought and immediate decision groups. Moreover, performance of the conscious thought and immediate decision groups was similar. We conclude that unconscious thought facilitates the detection of complex rules, which is consistent with UTT.  相似文献   

7.

Background

DNA Clustering is an important technology to automatically find the inherent relationships on a large scale of DNA sequences. But the DNA clustering quality can still be improved greatly. The DNA sequences similarity metric is one of the key points of clustering. The alignment-free methodology is a very popular way to calculate DNA sequence similarity. It normally converts a sequence into a feature space based on words’ probability distribution rather than directly matches strings. Existing alignment-free models, e.g. k-tuple, merely employ word frequency information and ignore many types of useful information contained in the DNA sequence, such as classifications of nucleotide bases, position and the like. It is believed that the better data mining results can be achieved with compounded information. Therefore, we present a new alignment-free model that employs compounded information to improve the DNA clustering quality.

Results

This paper proposes a Category-Position-Frequency (CPF) model, which utilizes the word frequency, position and classification information of nucleotide bases from DNA sequences. The CPF model converts a DNA sequence into three sequences according to the categories of nucleotide bases, and then yields a 12-dimension feature vector. The feature values are computed by an entropy based model that takes both local word frequency and position information into account. We conduct DNA clustering experiments on several datasets and compare with some mainstream alignment-free models for evaluation, including k-tuple, DMk, TSM, AMI and CV. The experiments show that CPF model is superior to other models in terms of the clustering results and optimal settings.

Conclusions

The following conclusions can be drawn from the experiments. (1) The hybrid information model is better than the model based on word frequency only. (2) For DNA sequences no more than 5000 characters, the preferred size of sliding windows for CPF is two which provides a great advantage to promote system performance. (3) The CPF model is able to obtain an efficient stable performance and broad generalization.  相似文献   

8.
Sequence detectors as a basis of grammar in the brain   总被引:1,自引:0,他引:1  
Summary Grammar processing may build upon serial-order mechanisms known from non-human species. A circuit similar to that underlying direction-sensitive movement detection in arthropods and vertebrates may become selective for sequences of words, thus yielding grammatical sequence detectors in the human brain. Sensitivity to the order of neuronal events arises from unequal connection strengths between two word specific neural units and a third element, the sequence detector. This mechanism, which critically depends on the dynamics of the neural units, can operate at the single neuron level and may be relevant at the level of neuronal ensembles as well. Due to the repeated occurrence of sequences, for example word strings, the sequence-sensitive elements become more firmly established and, by substitution of elements between strings, a process called auto-associative substitution learning (AASL) is triggered. AASL links the neuronal counterparts of the string elements involved in the substitution process to the sequence detector, thereby providing a brain basis of what can be described linguistically as the generalization of rules of grammar. A network of sequence detectors may constitute grammar circuits in the human cortex on which a separate set of mechanisms establishing temporary binding and recursion can operate.  相似文献   

9.
Protein combinatorial libraries provide new ways to probe the determinants of folding and to discover novel proteins. Such libraries are often constructed by expressing an ensemble of partially random gene sequences. Given the intractably large number of possible sequences, some limitation on diversity must be imposed. A non-uniform distribution of nucleotides can be used to reduce the number of possible sequences and encode peptide sequences having a predetermined set of amino acid probabilities at each residue position, i.e., the amino acid sequence profile. Such profiles can be determined by inspection, multiple sequence alignment or physically-based computational methods. Here we present a computational method that takes as input a desired sequence profile and calculates the individual nucleotide probabilities among partially random genes. The calculated gene library can be readily used in the context of standard DNA synthesis to generate a protein library with essentially the desired profile. The fidelity between the desired profile and the calculated one coded by these partially random genes is quantitatively evaluated using the linear correlation coefficient and a relative entropy, each of which provides a measure of profile agreement at each position of the sequence. On average, this method of identifying such codon frequencies performs as well or better than other methods with regard to fidelity to the original profile. Importantly, the method presented here provides much better yields of complete sequences that do not contain stop codons, a feature that is particularly important when all or large fractions of a gene are subject to combinatorial mutation.  相似文献   

10.
11.
The human heartbeat series is more variable and, hence, more complex in healthy subjects than in congestive heart failure (CHF) patients. However, little is known about the complexity of the heart rate variations on a beat-to-beat basis. We present an analysis based on symbolic dynamics that focuses on the dynamic features of such beat-to-beat variations on a small time scale. The sequence of acceleration and deceleration of eight successive heartbeats is represented by a binary sequence consisting of ones and zeros. The regularity of such binary patterns is quantified using approximate entropy (ApEn). Holter electrocardiograms from 30 healthy subjects, 15 patients with CHF, and their surrogate data were analyzed with respect to the regularity of such binary sequences. The results are compared with spectral analysis and ApEn of heart rate variability. Counterintuitively, healthy subjects show a large amount of regular beat-to-beat patterns in addition to a considerable amount of irregular patterns. CHF patients show a predominance of one regular beat-to-beat pattern (alternation of acceleration and deceleration), as well as some irregular patterns similar to the patterns observed in the surrogate data. In healthy subjects, regular beat-to-beat patterns reflect the physiological adaptation to different activities, i.e., sympathetic modulation, whereas irregular patterns may arise from parasympathetic modulation. The patterns observed in CHF patients indicate a largely reduced influence of the autonomic nervous system. In conclusion, analysis of short beat-to-beat patterns with respect to regularity leads to a considerable increase of information compared with spectral analysis or ApEn of heart-rate variations.  相似文献   

12.
Abstract

The process of designing novel RNA sequences by inverse RNA folding, available in tools such as RNAinverse and InfoRNA, can be thought of as a reconstruction of RNAs from secondary structure. In this reconstruction problem, no physical measures are considered as additional constraints that are independent of structure, aside of the goal to reach the same secondary structure as the input using energy minimization methods. An extension of the reconstruction problem can be formulated since in many cases of natural RNAs, it is desired to analyze the sequence and structure of RNA molecules using various physical quantifiable measures. In prior works that used secondary structure predictions, it has been shown that natural RNAs differ significantly from random RNAs in some of these measures. Thus, we relax the problem of reconstructing RNAs from secondary structure into reconstructing RNAs from shapes, and in turn incorporate physical quantities as constraints. This allows for the design of novel RNA sequences by inverse folding while considering various physical quantities of interest such as thermodynamic stability, mutational robustness, and linguistic complexity. At the expense of altering the number of nucleotides in stems and loops, for example, physical measures can be taken into account. We use evolutionary computation for the new reconstruction problem and illustrate the procedure on various natural RNAs.  相似文献   

13.
一种有效的重复序列识别算法   总被引:1,自引:0,他引:1  
李冬冬  王正志  倪青山 《生物信息学》2005,3(4):163-166,174
重复序列的分析是基因组研究中的一个重要课题,进行这一研究的基础则是从基因组序列中快速有效地找出其中的重复序列。一种投影拼接算法,即利用随机投影获得候选片断集合,利用片断拼接对候选片断进行拼接,以发现基因组中的重复序列。分析了算法的计算复杂度,构造了半仿真测试数据,对算法的测试结果表明了其有效性。  相似文献   

14.
Dynamic invariants are often estimated from experimental time series with the aim of differentiating between different physical states in the underlying system. The most popular schemes for estimating dynamic invariants are capable of estimating confidence intervals, however, such confidence intervals do not reflect variability in the underlying dynamics. We propose a surrogate based method to estimate the expected distribution of values under the null hypothesis that the underlying deterministic dynamics are stationary. We demonstrate the application of this method by considering four recordings of human pulse waveforms in differing physiological states and show that correlation dimension and entropy are insufficient to differentiate between these states. In contrast, algorithmic complexity can clearly differentiate between all four rhythms.  相似文献   

15.
It is generally assumed that fetal heart rate variability increases with gestation, reflecting prenatal development of the autonomic nervous system. We examined standard measures quantifying fetal heart rate variability, as well as a complexity measure, approximate entropy, in 66 fetal magnetocardiograms recorded from 22 healthy pregnant women between the 16th and 42nd week of gestation. In particular, regularity in the fetal RR interval time series was assessed on the basis of symbolic dynamics. The results showed that, beside an overall increase in fetal heart rate variability and complexity during pregnancy, there was also an increase in specific sets of binary patterns with low approximate entropy, i.e., a high degree of regularity. These sets were characterized by short epochs of heart rate acceleration and deceleration, and comparison with surrogate data confirmed that their random occurrence is rare. The results most likely reflect the influence of increasingly differentiated fetal behavioral states and transitions between them in association with fetal development.  相似文献   

16.
The process of designing novel RNA sequences by inverse RNA folding, available in tools such as RNAinverse and InfoRNA, can be thought of as a reconstruction of RNAs from secondary structure. In this reconstruction problem, no physical measures are considered as additional constraints that are independent of structure, aside of the goal to reach the same secondary structure as the input using energy minimization methods. An extension of the reconstruction problem can be formulated since in many cases of natural RNAs, it is desired to analyze the sequence and structure of RNA molecules using various physical quantifiable measures. In prior works that used secondary structure predictions, it has been shown that natural RNAs differ significantly from random RNAs in some of these measures. Thus, we relax the problem of reconstructing RNAs from secondary structure into reconstructing RNAs from shapes, and in turn incorporate physical quantities as constraints. This allows for the design of novel RNA sequences by inverse folding while considering various physical quantities of interest such as thermodynamic stability, mutational robustness, and linguistic complexity. At the expense of altering the number of nucleotides in stems and loops, for example, physical measures can be taken into account. We use evolutionary computation for the new reconstruction problem and illustrate the procedure on various natural RNAs.  相似文献   

17.
18.
We describe a new distance measure for comparing DNA sequence profiles. For this measure, columns in a multiple alignment are treated as character frequency vectors (sum of the frequencies equal to one). The distance between two vectors is based on minimum path length along an entropy surface. Path length is estimated using a random graph generated on the entropy surface and Dijkstra's algorithm for all shortest paths to a source. We use the new distance measure to analyze similarities within familes of tandem repeats in the C. elegans genome and show that this new measure gives more accurate refinement of family relationships than a method based on comparing consensus sequences.  相似文献   

19.
Parameterized complexity analysis in computational biology   总被引:2,自引:0,他引:2  
Many computational problems in biology involve par–ametersfor which a small range of values cover important applications.We argue that for many problems in this setting, parameterizedcomputational complexity rather than NP-completeness is theappropriate tool for studying apparent intractability. At issuein the theory of parameter–ized complexity is whethera problem can be solved in time O(n)for each fixed parametervalue, where a is a constant independent of the parameter. Inaddition to surveying this complexity framework, we describea new result for the Longest Common Subsequence problem. Inparticular, we show that the problem is hard for W[t] for allI when parameterized by the number of strings and the size ofthe alphabet. Lower bounds on the complexity of this basic combinatorialproblem imply lower bounds on more general sequence alignmentand consensus discovery problems. We also describe a numberof open problems pertaining to the parameterized complexityof problems in computational biology where small parameter valuesare important  相似文献   

20.
李楠  李春 《生物信息学》2012,10(4):238-240
基于氨基酸的16种分类模型,给出蛋白质序列的派生序列,进而结合加权拟熵和LZ复杂度构造出34维特征向量来表示蛋白质序列。借助于贝叶斯分类器对同源性不超过25%的640数据集进行蛋白质结构类预测,准确度达到71.28%。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号