首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 171 毫秒
1.
在DNA序列相似性的研究中,通常采用的动态规划算法对空位罚分函数缺乏理论依据而带有主观性,从而取得不同的结果,本文提出了一种基于DTW(Dynamic Time Warping,动态时间弯曲)距离的DNA序列相似性度量方法可以解决这一问题.通过DNA序列的图形表示把DNA序列转化为时间序列,然后计算DTW距离来度量序列相似度以表征DNA序列属性,得到能够比较DNA序列相似性度量方法,并用这个方法比较分析了七种东亚钳蝎神经毒素(Buthusmartensi Karsch neurotoxin)基因序列的相似性,验证了该度量方法的有效性和准确性.  相似文献   

2.
真核生物DNA非编码区的组分分析   总被引:4,自引:0,他引:4  
在全基因组水平上,用直方图、混沌表示灰度图、距离差异度和信息熵差异度四种方法,研究了拟南芥、线虫、果蝇的DNA内含子、基因间隔区DNA、外显子三种区域的核苷酸短序列组分及组分复杂度.结果表明:a.不同基因组之间,不管基因数目多少,用4种方法得到的外显子部分其组分复杂度都比较接近,而非编码区部分的组分复杂度却很大.这一点定量地说明了物种之间的复杂程度,主要不体现在编码区部分,而体现在非编码区部分.b.同一基因组中,内含子的核苷酸短序列组分复杂度都是相似的,外显子和intergenic DNA部分的组分复杂度也是相似的.c.内含子和intergenic DNA在转录、剪切、二级结构等方面有很大的不同,但它们在核苷酸短序列组分上的差异却很小,说明内含子和intergenic DNA在转录、剪切、二级结构上的不同并不通过核苷酸短序列组分来进行限制.  相似文献   

3.
提出了神经放电序列模式识别的一种新方法。首先,把放电序列用阶梯状的响应函数来表示,然后定义了其一阶、二阶形式导数以及形式积分。这三个特征量均有着不同的几何和物理意义,因此采用这三个特征量来刻画神经放电序列的模式,就可以较全面地表示其特征。对神经放电序列的重构也表明通过这几个特征量可以很好地反映序列中所包含的信息。作为应用例子,这种量化方法用来研究冷热感受器模型所产生的放电模式,结果表明它能够识别在不同温度条件下的放电模式。  相似文献   

4.
为了更多地挖掘隐藏在蛋白质序列中的信息,本研究将20种氨基酸均匀地排列在单位圆周上,得到每种氨基酸对应的二维坐标,再与氨基酸的6个理化指标结合起来,最终用一个八维向量来刻画蛋白质序列。为避免数据极差对分析结果造成的影响,本研究对蛋白质序列所对应的八维向量作归一化处理。基于归一化后的蛋白质序列的向量表示,运用神经网络对蛋白质序列进行分类,并根据向量之间的欧式距离来量化序列之间的相似性。最后,以9个不同物种的ND5蛋白质序列以及8个不同物种的ND6蛋白质序列为例,Clustal W序列比对方法为基准,对本研究的方法与5-字母方法进行验证和比较,结果表明本研的方法是有效的。  相似文献   

5.
在生物序列的二维图形表示的基础上,利用Balaban指数和信息分布指数比较生物序列的相似性,我们以包括人类等9种不同物种的DNA序列和yar029w等6种蛋白质为例来说明该方法的使用.  相似文献   

6.
基于DNA序列的3D图形表示,通过L/L矩阵的规范化最大特征值组成的3维向量来刻画了DNA序列,并基于这种方法,用β-globin基因的第一个外显子分析了11个物种的相似性问题。  相似文献   

7.
拓扑树间的通经拓扑距离   总被引:1,自引:1,他引:0  
给出了一种新的系统树间的拓扑距离,使用NJ,MP,UPGMA等3种方法对13种动物的线粒体中14个基因(含组合的)DNA序列数据进行系统树的构建,利用分割拓扑距离和本文给出的通经拓扑距离对这14种系统树这间及其与真树进行比较。结果显示,NJ法对获得已知树的有效率最高,MP法次之,UPGMA法最低。这14种DNA序列所构建的系统树与已知树的拓扑距离基本上是随其DNA序列长度增加而减小,但两者的相关系数并未达到显著水平,分割拓扑距离在总体上可反映树间的拓扑结构差异,但其测度精确度比通经拓扑距离要低。  相似文献   

8.
蛋白质结构类预测是生物信息和蛋白质科学中重要的研究领域.基于Chou提出的伪氨基酸离散模型框架,从蛋白质序列出发,设计一种新的伪氨基酸组成方法表示蛋白质序列样本.抽取氨基酸组合(10-D)在序列中出现的频率和疏水氨基酸模式(6-D)表示蛋白质序列的附加特征,用和传统的氨基酸组成(20-D)一起构成的36维的伪氨基酸组成向量来表示蛋白质序列的特征.使用遗传算法来优化附加特征的权重系数.伪氨基酸组成向量作为输入数据,模糊支持向量机作为预测工具.使用三个常用的标准数据集来验证算法的性能.Jack-knife检验结果说明本方法具有较高的准确率,有望成为潜在的预测蛋白质功能的工具.  相似文献   

9.
DNA序列分析中的信息熵应用现状   总被引:1,自引:0,他引:1  
詹青 《生物信息学》2012,10(1):44-49
信息熵理论是生物信息学研究的一个重要工具,它在DNA序列分析中有着广泛的应用。本文详细介绍了近年来诸多DNA序列分析问题中信息熵应用的研究进展,并分析了未来该问题的研究方向。  相似文献   

10.
现今秦岭珍稀野生动物非法贸易频繁发生,野生动物仅通过外表难以准确鉴定。通过构建陕西省秦岭地区羚牛、林麝、斑羚、鬣羚、金丝猴、黑熊、小麂、毛冠鹿、果子狸、豹猫、野猪、大熊猫、黄鼬、鼬獾、獾15种兽类线粒体DNA Cytb基因条码,对3例来自秦岭森林公安收缴的无法鉴定的动物肌肉样品进行分子物种鉴定。利用Neighbor-joining法和非加权配对算数平方法,构建分子系统发生树,对比三种待鉴定样品样本和15种动物的Cytb序列的遗传距离和序列相似性。经分析,待鉴定样品A1与斑羚聚为一枝,序列相似性为99.4%遗传距离为0.006;待鉴定样品A2与鬣羚聚为一枝,序列相似性为98.8%遗传距离为0.012;待鉴定样品A3与黑熊聚为一枝,序列相似性最高为100%遗传距离为0.000。从而鉴定出三种物种分别为斑羚、鬣羚和黑熊。基因条码为物种鉴定和野生动物保护提供便利条件。  相似文献   

11.
Comparing DNA or protein sequences plays an important role in the functional analysis of genomes. Despite many methods available for sequences comparison, few methods retain the information content of sequences. We propose a new approach, the Yau-Hausdorff method, which considers all translations and rotations when seeking the best match of graphical curves of DNA or protein sequences. The complexity of this method is lower than that of any other two dimensional minimum Hausdorff algorithm. The Yau-Hausdorff method can be used for measuring the similarity of DNA sequences based on two important tools: the Yau-Hausdorff distance and graphical representation of DNA sequences. The graphical representations of DNA sequences conserve all sequence information and the Yau-Hausdorff distance is mathematically proved as a true metric. Therefore, the proposed distance can preciously measure the similarity of DNA sequences. The phylogenetic analyses of DNA sequences by the Yau-Hausdorff distance show the accuracy and stability of our approach in similarity comparison of DNA or protein sequences. This study demonstrates that Yau-Hausdorff distance is a natural metric for DNA and protein sequences with high level of stability. The approach can be also applied to similarity analysis of protein sequences by graphic representations, as well as general two dimensional shape matching.  相似文献   

12.
Apoptosis, or programmed cell death, plays an important role in development of an organism. Obtaining information on subcellular location of apoptosis proteins is very helpful to understand the apoptosis mechanism. In this paper, based on the concept that the position distribution information of amino acids is closely related with the structure and function of proteins, we introduce the concept of distance frequency [Matsuda, S., Vert, J.P., Ueda, N., Toh, H., Akutsu, T., 2005. A novel representation of protein sequences for prediction of subcellular location using support vector machines. Protein Sci. 14, 2804-2813] and propose a novel way to calculate distance frequencies. In order to calculate the local features, each protein sequence is separated into p parts with the same length in our paper. Then we use the novel representation of protein sequences and adopt support vector machine to predict subcellular location. The overall prediction accuracy is significantly improved by jackknife test.  相似文献   

13.
The Shannon information entropy of protein sequences.   总被引:6,自引:1,他引:5       下载免费PDF全文
A comprehensive data base is analyzed to determine the Shannon information content of a protein sequence. This information entropy is estimated by three methods: a k-tuplet analysis, a generalized Zipf analysis, and a "Chou-Fasman gambler." The k-tuplet analysis is a "letter" analysis, based on conditional sequence probabilities. The generalized Zipf analysis demonstrates the statistical linguistic qualities of protein sequences and uses the "word" frequency to determine the Shannon entropy. The Zipf analysis and k-tuplet analysis give Shannon entropies of approximately 2.5 bits/amino acid. This entropy is much smaller than the value of 4.18 bits/amino acid obtained from the nonuniform composition of amino acids in proteins. The "Chou-Fasman" gambler is an algorithm based on the Chou-Fasman rules for protein structure. It uses both sequence and secondary structure information to guess at the number of possible amino acids that could appropriately substitute into a sequence. As in the case for the English language, the gambler algorithm gives significantly lower entropies than the k-tuplet analysis. Using these entropies, the number of most probable protein sequences can be calculated. The number of most probable protein sequences is much less than the number of possible sequences but is still much larger than the number of sequences thought to have existed throughout evolution. Implications of these results for mutagenesis experiments are discussed.  相似文献   

14.
Several classifications of protein spatial structures and their structural elements are known. This makes revealing of the relation between these structural elements and sequence fragments rather topical. The most important move in this direction would be the determination of positional sensitivity levels and ranges between the residues in protein sequences. In this work the Shannon-Weaver informational entropy was used as a disorder criterion for solving this problem. This entropy was computed as function of the distance between the amino acid residues in different sets of unhomological protein sequences. Similarity of this function for different sets of protein sequences was shown. Analysis of informational entropy allows detecting a long-range positional correlation (> or =30) between the amino acid residues and oscillations with periods of 3.6 and 2.9. These oscillation periods correspond to periodicity of alpha- and 3(10)-helices.  相似文献   

15.
We consider a novel 2-D graphical representation of DNA sequences according to chemical structures of bases, reflecting distribution of bases with different chemical structure, preserving information on sequential adjacency of bases, and allowing numerical characterization. The representation avoids loss of information accompanying alternative 2-D representations in which the curve standing for DNA overlaps and intersects itself. Based on this representation we present a numerical characterization approach by the leading eigenvalues of the matrices associated with the DNA sequences. The utility of the approach is illustrated on the coding sequences of the first exon of human beta-globin gene.  相似文献   

16.
The usefulness of information-theoretic measures of the Shannon-Weaver type, when applied to molecular biological systems such as DNA or protein sequences, has been critically evaluated. It is shown that entropy can be re-expressed in dimensionless terms, thereby making it commensurate with information. Further, we have identified processes in which entropy S and information H change in opposite directions. These processes of opposing signs for delta S and delta H demonstrate that while the Second Law of Thermodynamics mandates that entropy always increases, it places no such restrictions on changes in information. Additionally, we have developed equations permitting information calculations, incorporating conditional occurrence probabilities, on DNA and protein sequences. When the results of such calculations are compared for sequences of various general types, there are no informational content patterns. We conclude that information-theoretic calculations of the present level of sophistication do not provide any useful insights into molecular biological sequences.  相似文献   

17.
The advent of completely sequenced genomes is leading to an unprecedented growth of sequence information while adequate structure information is often lacking. Genetic algorithm simulations have been refined and applied as a helpful tool for this question. Modified strategies are tested first on simple lattice protein models. This includes consideration of entropy (protein adjacent water shell) and improved search strategies (pioneer search +14%, systematic recombination +50% in search efficiency). Next, extension to grid free simulations of proteins in full main chain representation is examined. Our protein main chain simulations are further refined by independent criteria such as fitness per residue to judge predicted structures obtained at the end of a simulation. Protein families and protein interactions predicted from the complete H. pylori genomic sequence demonstrate how the full main chain simulations are then applied to model new protein sequences and protein families apparent from genome analysis.  相似文献   

18.
19.
A large protein sequence database with over 31,000 sequences and 10 million residues has been analysed. The pair probabilities have been converted to entropies using Boltzmann’s law of statistical thermodynamics. A scoring weight corresponding to “mixing entropy” of the amino acid pairs has been developed from which the entropies of the protein sequences have been calculated. The entropy values of natural sequences are lower than their random counterparts of same length and similar amino acid composition. Based on the results it has been proposed that natural sequences are a special set of polypeptides with additional qualification of biological functionality that can be quantified using the entropy concept as worked out in this paper.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号