Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words

Authors:	P A Pevzner Mironov" target="_blank">A A Borodovsky MYuMironov

Institution:	Institute for Genetics of Microorganisms, Moscow, USSR.

Abstract:	Mathematical models of the generation of genetic texts appeared simultaneously with the first sequencing DNA. They are used to establish functional and evolutionary relations between genetic texts, to predict the number and distribution of specific sites in a sequence and to identify "meaningful" words. The present paper deals with two problems: 1) The significance of deviations from the mean statistical characteristics in a genetic text. Anyone who has addressed himself to the statistical analysis of sequenced DNA is familiar with the question: what deviations from the expected frequencies of occurrence of particular words testify to the "biological" significance of those words? We propose a formula for the variance of the number of word's occurrences in the text, with allowance for word overlaps, making it possible to assess the significance of the deviations from the expected statistical characteristics. 2) A new method for predicting the frequencies of occurrence of particular words in a genetic text using the statistical characteristics of "spaced" L-grams. The method can be used for predicting the number of restriction sites in human DNA and in planning experiments on the physical mapping and sequencing of the human genome.

Keywords:

设为首页 | 免责声明 | 关于勤云 | 加入收藏