首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The information capacity of nucleotide sequences is defined through the specific entropy of frequency dictionary of a sequence determined with respect to another one containing the most probable continuations of shorter strings. This measure distinguishes a sequence both from a random one, and from ordered entity. A comparison of sequences based on their information capacity is studied. An order within the genetic entities is found at the length scale ranged from 3 to 8. Some other applications of the developed methodology to genetics, bioinformatics, and molecular biology are discussed.  相似文献   

2.
A new method to compare two (or several) symbol sequences is developed. The method is based on the comparison of the frequencies of the small fragments of the compared sequences; it requires neither string editing, nor other transformations of the compared objects. The comparison is executed through a calculation of the specific entropy of a frequency dictionary against the special dictionary called the hybrid one; this latter is the statistical ancestor of the group of sequences under comparison. Some applications of the developed method in the fields of genetics and bioinformatics are discussed.  相似文献   

3.
Classification of 16S RNA sequences over their frequency dictionaries, both real ones, and transformed ones was studied. Two entities were considered to be close each other from the point of view of their structure, if their frequency dictionaries were close, in Eucledian metric. A transformation procedure of a frequency dictionary has been implemented that reveals the peculiarities of information structure of a nucleotide sequence. A comparative study of two classification developed over the real frequency dictionary vs. that one developed over the transformed frequency dictionary was carried out. The strong correlation is revealed between the classification and the taxonomy of 16S RNA bearer. For the classes isolated, the information valuable words were identified. These words are the main factors of a difference between the classes. The frequency dictionaries containing the words of the length 3 exhibit the best correlation between a class and a genus. A genus, as a rule, is included into the same class, and the exclusion are sporadic. A development of hierarchy classification over the transformed frequency dictionaries separated one or two taxonomy groups, as each stage of classification. The unexpectedly frequent, or contrary, unexpectedly rare occurred of words (of the length 3) in entities under consideration make the structure difference between the classes of the nucleotide sequences.  相似文献   

4.
Information capacity of nucleotide sequences measures the unexpectedness of a continuation of a given string of nucleotides, thus having a sound relation to a variety of biological issues. A continuation is defined in a way maximizing the entropy of the ensemble of such continuations. The capacity is defined as a mutual entropy of real frequency dictionary of a sequence with respect to the one bearing the most expected continuations; it does not depend on the length of strings contained in a dictionary. Various genomes exhibit a multi-minima pattern of the dependence of information capacity on the string length, thus reflecting an order within a sequence. The strings with significant deviation of an expected frequency from the real one are the words of increased information value. Such words exhibit a non-random distribution alongside a sequence, thus making it possible to retrieve the correlation between a structure, and a function encoded within a sequence.
Alexander S. ShchepanovskyEmail:
  相似文献   

5.
A large protein sequence database with over 31,000 sequences and 10 million residues has been analysed. The pair probabilities have been converted to entropies using Boltzmann’s law of statistical thermodynamics. A scoring weight corresponding to “mixing entropy” of the amino acid pairs has been developed from which the entropies of the protein sequences have been calculated. The entropy values of natural sequences are lower than their random counterparts of same length and similar amino acid composition. Based on the results it has been proposed that natural sequences are a special set of polypeptides with additional qualification of biological functionality that can be quantified using the entropy concept as worked out in this paper.  相似文献   

6.
Recently, there has been a growing interest in the sparse representation of signals over learned and overcomplete dictionaries. Instead of using fixed transforms such as the wavelets and its variants, an alternative way is to train a redundant dictionary from the image itself. This paper presents a novel de-speckling scheme for medical ultrasound and speckle corrupted photographic images using the sparse representations over a learned overcomplete dictionary. It is shown that the proposed algorithm can be used effectively for the removal of speckle by combining an existing pre-processing stage before an adaptive dictionary could be learned for sparse representation. Extensive simulations are carried out to show the effectiveness of the proposed filter for the removal of speckle noise both visually and quantitatively.  相似文献   

7.
Information-theoretical entropy as a measure of sequence variability.   总被引:11,自引:0,他引:11  
We propose the use of the information-theoretical entrophy, S = -sigman pi log2 pi, as a measure of variability at a given position in a set of aligned sequences. pi stands for the fraction of times the i-th type appears at a position. For protein sequences, the sum has up to 20 terms, for nucleotide sequences, up to 4 terms, and for codon sequences, up to 61 terms. We compare S and Vs, a related measure, in detail with Vk, the traditional measure of immunoglobulin sequence variability, both in the abstract and as applied to the immunoglobulins. We conclude that S has desirable mathematical properties that Vk lacks and has intuitive and statistical meanings that accord well with the notion of variability. We find that Vk and the S-based measures are highly correlated for the immunoglobulins. We show by analysis of sequence data and by means of a mathematical model that this correlation is due to a strong tendency for the frequency of occurrence of amino acid types at a given position to be log-linear. It is not known whether the immunoglobulins are typical or atypical of protein families in this regard, nor is the origin of the observed rank-frequency distribution obvious, although we discuss several possible etiologies.  相似文献   

8.
Huang SW  Hwang JK 《Proteins》2005,59(4):802-809
A complete protein sequence can usually determine a unique conformation; however, the situation is different for shorter subsequences--some of them are able to adopt unique conformations, independent of context; while others assume diverse conformations in different contexts. The conformations of subsequences are determined by the interplay between local and nonlocal interactions. A quantitative measure of such structural conservation or variability will be useful in the understanding of the sequence-structure relationship. In this report, we developed an approach using the support vector machine method to compute the conformational variability directly from sequences, which is referred to as the sequence structural entropy. As a practical application, we studied the relationship between sequence structural entropy and the hydrogen exchange for a set of well-studied proteins. We found that the slowest exchange cores usually comprise amino acids of the lowest sequence structural entropy. Our results indicate that structural conservation is closely related to the local structural stability. This relationship may have interesting implications in the protein folding processes, and may be useful in the study of the sequence-structure relationship.  相似文献   

9.
给出了蛋白质序列的一种六维表示方法,根据这种表示方法有3种不同表示形式,利用这3种形式来构造距离矩阵的信息熵,然后通过信息熵向量的欧式距离、夹角来比较序列之间的相似性。  相似文献   

10.
Humulus lupulus is commonly known as hops, a member of the family moraceae. Currently many projects are underway leading to the accumulation of voluminous genomic and expressed sequence tag sequences in public databases. The genetically characterized domains in these databases are limited due to non-availability of reliable molecular markers. The large data of EST sequences are available in hops. The simple sequence repeat markers extracted from EST data are used as molecular markers for genetic characterization, in the present study. 25,495 EST sequences were examined and assembled to get full-length sequences. Maximum frequency distribution was shown by mononucleotide SSR motifs i.e. 60.44% in contig and 62.16% in singleton where as minimum frequency are observed for hexanucleotide SSR in contig (0.09%) and pentanucleotide SSR in singletons (0.12%). Maximum trinucleotide motifs code for Glutamic acid (GAA) while AT/TA were the most frequent repeat of dinucleotide SSRs. Flanking primer pairs were designed in-silico for the SSR containing sequences. Functional categorization of SSRs containing sequences was done through gene ontology terms like biological process, cellular component and molecular function.  相似文献   

11.
Swine genomic DNA segments containing repetitive sequences were isolated from a porcine genomic library using genomic DNA as a probe. Three fragments containing the repetitive sequences from two of the primary phage clones were subcloned for sequence analysis, which revealed six new PRE-1 repetitive families other than those reported earlier by Singer et al. (Nucleic Acids Research 15, 2780, 1987). The frequency of the repetitive sequences in the swine genome was estimated at 2 x 10(6) per diploid genome. Sequence analysis revealed similarities between these repetitive sequences and that of arginine-tRNA gene.  相似文献   

12.
13.
A novel method for predicting the secondary structures of proteins from amino acid sequence has been presented. The protein secondary structure seqlets that are analogous to the words in natural language have been extracted. These seqlets will capture the relationship between amino acid sequence and the secondary structures of proteins and further form the protein secondary structure dictionary. To be elaborate, the dictionary is organism-specific. Protein secondary structure prediction is formulated as an integrated word segmentation and part of speech tagging problem. The word-lattice is used to represent the results of the word segmentation and the maximum entropy model is used to calculate the probability of a seqlet tagged as a certain secondary structure type. The method is markovian in the seqlets, permitting efficient exact calculation of the posterior probability distribution over all possible word segmentations and their tags by viterbi algorithm. The optimal segmentations and their tags are computed as the results of protein secondary structure prediction. The method is applied to predict the secondary structures of proteins of four organisms respectively and compared with the PHD method. The results show that the performance of this method is higher than that of PHD by about 3.9% Q3 accuracy and 4.6% SOV accuracy. Combining with the local similarity protein sequences that are obtained by BLAST can give better prediction. The method is also tested on the 50 CASP5 target proteins with Q3 accuracy 78.9% and SOV accuracy 77.1%. A web server for protein secondary structure prediction has been constructed which is available at http://www.insun.hit.edu.cn:81/demos/biology/index.html.  相似文献   

14.
一个基于Blast程序的多重序列对齐程序——Mblast   总被引:3,自引:0,他引:3  
核酸序列和蛋白质序列的相似性分析日益成为生物信息学研究的核心内容.NCBI的Blast程序是进行此类分析的最有力工具.虽然它提供了初步的将多条序列进行综合对齐的分析方案,但是实际效果却很不理想.在对Blast程序的输出结果进行仔细分析的基础上,基于“求同存异”的思想,我们编制了一个多重序列对齐程序Mblast.该程序与目前流行的序列多重对齐程序相比,更容易检出序列的同源区.  相似文献   

15.
Selection can have a significant effect on sequence evolution and this will be reflected in the information contained within the phylogenetic relationships between species. Selection will reduce the frequency of any deleterious nucleotides, and this can be used to test for the presence of selection. The frequencies of different nucleotides can be predicted theoretically and compared to observed values. If a sample of sequences has an usually low frequency of a particular nucleotide then selection might be inferred to have acted upon these sequences. This conclusion can be true only if the sequences are not too closely related and if sufficient mutations have occurred during their evolution. Otherwise, the unusual pattern of nucleotides in the sequences may be caused by recent common ancestry. An algorithm is presented to obtain maximum-likelihood estimates of selection coefficients using the phylogenetic information contained within sequence data. A k-allele model is developed that uses the phylogeny to measure relative mutation rates and degrees of relatedness and to evaluate the likelihood in the presence of selection. The method is illustrated with examples from the NS2 genes of influenza viruses and the MHC genes of mice. It is shown that the maximum-likelihood estimate for mutation rates are very large for. influenza viruses and that statistically significant selection acts to maintain a specific coding sequence. Overall, the MHC genes also have significant selection to preserve the coding sequence, but at the antigen recognition site, this selection is reversed to promote genetic variation. Maximum-likelihood estimates of these selection coefficients are provided.  相似文献   

16.
A new set of 148 apple microsatellite markers has been developed and mapped on the apple reference linkage map Fiesta x Discovery. One-hundred and seventeen markers were developed from genomic libraries enriched with the repeats GA, GT, AAG, AAC and ATC; 31 were developed from EST sequences. Markers derived from sequences containing dinucleotide repeats were generally more polymorphic than sequences containing trinucleotide repeats. Additional eight SSRs from published apple, pear, and Sorbus torminalis SSRs, whose position on the apple genome was unknown, have also been mapped. The transferability of SSRs across Maloideae species resulted in being efficient with 41% of the markers successfully transferred. For all 156 SSRs, the primer sequences, repeat type, map position, and quality of the amplification products are reported. Also presented are allele sizes, ranges, and number of SSRs found in a set of nine cultivars. All this information and those of the previous CH-SSR series can be searched at the apple SSR database () to which updates and comments can be added. A large number of apple ESTs containing SSR repeats are available and should be used for the development of new apple SSRs. The apple SSR database is also meant to become an international platform for coordinating this effort. The increased coverage of the apple genome with SSRs allowed the selection of a set of 86 reliable, highly polymorphic, and overall the apple genome well-scattered SSRs. These SSRs cover about 85% of the genome with an average distance of one marker per 15 cM.E. Silfverberg-Dilworth and C. L. Matasci contributed equally to this work.  相似文献   

17.
In this study, we describe the first set of SNP markers for the South African abalone, Haliotis midae. A cDNA library was constructed from which ESTs were selected for the screening of SNPs. The observed frequency of SNPs in this species was estimated at one every 185 bp. When characterized in wild-caught abalone, the minor allele frequencies and F(ST) estimates for every SNP indicated that these markers may potentially be useful for population analysis, parentage assignment and linkage mapping in Haliotis midae. No linkage disequilibrium was observed between SNPs originating from different EST sequences. These SNPs, together with additional SNPs currently being developed, will provide a useful complementary set of markers to the currently available genetic markers in abalone.  相似文献   

18.
Summary A high frequency transformation system for the methylotrophic yeast Hansenula polymorpha has been developed. This system depends on complementation of isolated uracil auxotrophs by the URA3 gene of Saccharomyces cerevisiae. Maintenance of the uracil prototrophy is based on integration of plasmid YIp5 at random sites within the H. polymorpha genome and on autonomously replicating plasmids containing ARS1 of S. cerevisiae or related sequences cloned from the host DNA. The sequence of one autonomously replicating sequence (HARS1) from H. polymorpha has been determined showing an AT-rich region of 9 bp with some similarity to the consensus sequence of known eukaryotic replication origins. Mitotic loss of autonomously replicating sequences is high; selection for stable uracil prototrophs yields multiple tandem arrangement of the transformed DNA with no detectable loss of the phenotype on non-selective medium. These features offer the possibility for extensive gene expression in H. polymorpha.  相似文献   

19.
Here we propose a weighted measure for the similarity analysis of DNA sequences. It is based on LZ complexity and (0,1) characteristic sequences of DNA sequences. This weighted measure enables biologists to extract similarity information from biological sequences according to their requirements. For example, by this weighted measure, one can obtain either the full similarity information or a similarity analysis from a given biological aspect. Moreover, the length of DNA sequence is not problematic. The application of the weighted measure to the similarity analysis of β-globin genes from nine species shows its flexibility.  相似文献   

20.
Following the original idea of Maynard Smith on evolution of the protein sequence space, a novel tool is developed that allows the "space walk", from one sequence to its likely evolutionary relative and further on. At a given threshold of identity between consecutive steps, the walks of many steps are possible. The sequences at the ends of the walks may substantially differ from one another. In a sequence space of randomized (shuffled) sequences the walks are very short. The approach opens new perspectives for protein evolutionary studies and sequence annotation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号