首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Chaos game representation of gene structure.   总被引:21,自引:2,他引:19       下载免费PDF全文
This paper presents a new method for representing DNA sequences. It permits the representation and investigation of patterns in sequences, visually revealing previously unknown structures. Based on a technique from chaotic dynamics, the method produces a picture of a gene sequence which displays both local and global patterns. The pictures have a complex structure which varies depending on the sequence. The method is termed Chaos Game Representation (CGR). CGR raises a new set of questions about the structure of DNA sequences, and is a new tool for investigating gene structure.  相似文献   

2.
Summary Chaos game representation (CGR) is a novel holistic approach that provides a visual image of a DNA sequence quite different from the traditional linear arrangement of nucleotides. Although it is known that CGR patterns depict base composition and sequentiality, the biological significance of the specific features of each pattern is not understood. To systematically examine these features, we have examined the coding sequences of 7 human globin genes and 29 relatively conserved alcohol dehydrogenase (Adh) genes from phylogenetically divergent species. The CGRs of human globin cDNAs were similar to one another and to the entire human globin gene complex. Interestingly, human globin CGRs were also strikingly similar to human Adh CGRs. Adh CGRs were similar for genes of the same or closely related species but were different for relatively conserved Adh genes from distantly related species. Dinucleotide frequencies may account for the self-similar pattern that is characteristic of vertebrate CGRs and the genome-specific features of CGR patterns. Mutational frequencies of dinucleotides may vary among genome types. The special features of CG dinucleotides of vertebrates represent such an example. The CGR patterns examined thus far suggest that the evolution of a gene and its coding sequence should not be examined in isolation. Consideration should be given to genome-specific differential mutation rates for different dinucleotides or specific oligonucleotides. Offprint requests to: S. M. Singh  相似文献   

3.
Chaos Game Representation (CGR) can recognize patterns in the nucleotide sequences, obtained from databases, of a class of genes using the techniques of fractal structures and by considering DNA sequences as strings composed of four units, G, A, T and C. Such recognition of patterns relies only on visual identification and no mathematical characterization of CGR is known. The present report describes two algorithms that can predict the presence or absence of a stretch of nucleotides in any gene family. The first algorithm can be used to generate DNA sequences represented by any point in the CGR. The second algorithm can simulate known CGR patterns for different gene families by setting the probabilities of occurrence of different di- or trinucleotides by a trial and error process using some guidelines and approximate rules-of-thumb. The validity of the second algorithm has been tested by simulating sequences that can mimic the CGRs of vertebrate non-oncogenes, proto-oncogenes and oncogenes. These algorithms can provide a mathematical basis of the CGR patterns obtained using nucleotide sequences from databases.  相似文献   

4.
基于混沌游走方法的Rh血型系统中RHD基因的分析   总被引:3,自引:0,他引:3  
高雷  齐斌  朱平 《生命科学研究》2009,13(5):408-412
利用基于经典HP模型的蛋白质序列混沌游走方法(chaos game representation,CGR),给出了RHD基因的蛋白质序列CGR图,可视作蛋白质序列二级结构的一个特征图谱描述.对临床上的血型鉴别有一定的参考价值.另外.还根据由Jeffrey在1990年提出的描绘DNA序列的CGR方法,给出了RHD基因的DNA序列的CGR图.并且根据RHD基因DNA序列的CGR图算出了尺日D基因相应的马尔可夫两步转移概率矩阵,从概率矩阵表可以看出RHD基因对编码氨基酸的三联子的第3个碱基的使用偏好性.  相似文献   

5.
Analysis of genomic sequences by Chaos Game Representation   总被引:4,自引:0,他引:4  
MOTIVATION: Chaos Game Representation (CGR) is an iterative mapping technique that processes sequences of units, such as nucleotides in a DNA sequence or amino acids in a protein, in order to find the coordinates for their position in a continuous space. This distribution of positions has two properties: it is unique, and the source sequence can be recovered from the coordinates such that distance between positions measures similarity between the corresponding sequences. The possibility of using the latter property to identify succession schemes have been entirely overlooked in previous studies which raises the possibility that CGR may be upgraded from a mere representation technique to a sequence modeling tool. RESULTS: The distribution of positions in the CGR plane were shown to be a generalization of Markov chain probability tables that accommodates non-integer orders. Therefore, Markov models are particular cases of CGR models rather than the reverse, as currently accepted. In addition, the CGR generalization has both practical (computational efficiency) and fundamental (scale independence) advantages. These results are illustrated by using Escherichia coli K-12 as a test data-set, in particular, the genes thrA, thrB and thrC of the threonine operon.  相似文献   

6.
Hai ming Ni  Da wei Qi  Hongbo Mu 《Genomics》2018,110(3):180-190
Converting DNA sequence to image by using chaos game representation (CGR) is an effective genome sequence pretreatment technology, which provides the basis for further analysis between the different genes. In this paper, we have constructed 10 mammal species, 48 hepatitis E virus (HEV), and 10 kinds of bacteria genetic CGR images, respectively, to calculate the mean structural similarity (MSSIM) coefficient between every two CGR images. From our analysis, the MSSIM coefficient of gene CGR images can accurately reflect the similarity degrees between different genomes. Hierarchical clustering analysis was used to calculate the class affiliation and construct a dendrogram. Large numbers of experiments showed that this method gives comparable results to the traditional Clustal X phylogenetic tree construction method, and is significantly faster in the clustering analysis process. Meanwhile MSSIM combined CGR method was also able to efficiently clustering of large genome sequences, which the traditional multiple sequence alignment methods (e.g. Clustal X, Clustal Omega, Clustal W, et al.) cannot classify.  相似文献   

7.
《Genomics》2020,112(2):1847-1852
A novel method is proposed to detect the acceptor and donor splice sites using chaos game representation and artificial neural network. In order to achieve high accuracy, inputs to the neural network, or feature vector, shall reflect the true nature of the DNA segments. Therefore it is important to have one-to-one numerical representation, i.e. a feature vector should be able to represent the original data. Chaos game representation (CGR) is an iterative mapping technique that assigns each nucleotide in a DNA sequence to a respective position on the plane in a one-to-one manner. Using CGR, a DNA sequence can be mapped to a numerical sequence that reflects the true nature of the original sequence. In this research, we propose to use CGR as feature input to a neural network to detect splice sites on the NN269 dataset. Computational experiments indicate that this approach gives good accuracy while being simpler than other methods in the literature, with only one neural network component. The code and data for our method can be accessed from this link: https://github.com/thoang3/portfolio/tree/SpliceSites_ANN_CGR.  相似文献   

8.
We describe a mammalian artificial mini-chromosome lacking human alphoid DNA and mouse minor and major satellite DNA repeats. This mini-chromosome, initially recovered in a mouse embryonic stem (ES) cell line (CGR8), is 2.6 Mb in size and consists of sequences derived from the human Y chromosome and mouse chromosomes 12 and 15. It is not stable in the CGR8 cells but replicates and segregates with high fidelity after transfer into chicken DT40 cells. Combined analysis by immunocytochemistry/fluorescence in situ hybridisation (FISH) on metaphase spreads detected an active neo-centromere on the mini-chromosome in these cells. Further analysis by immunocytochemistry/FISH on stretched chromatin allowed the localisation of the CENP-C protein to the DNA sequence derived from interval 5 of the human Y chromosome.  相似文献   

9.
刘娟  高洁 《生物信息学》2011,9(2):97-101
用时间序列模型来分析乙型、丙型这两种流感病毒,对乙流、丙流病毒DNA序列提供了一种新的时间序列模型,即CGR弧度序列。利用CGR坐标将乙流、丙流病毒DNA序列转换成CGR弧度序列,且引入长记忆ARFIMA模型去拟合这两类序列。发现随机找来的10条乙流序列,10条丙流序列都具有长相关性且拟合很好,并且还发现这两种病毒序列可以尝试用不同的ARFIMA模型ARFIMA(0,d,4)模型,ARFIMA(1,d,1)模型去识别。  相似文献   

10.
Comprehensive knowledge of thermophilic mechanisms about some organisms whose optimum growth temperature (OGT) ranges from 50 to 80 °C degree plays a major role for helping to design stable proteins. How to predict function-unknown proteins to be thermophilic is a long but not fairly resolved problem. Chaos game representation (CGR) can investigate hidden patterns in protein sequences, and also can visually reveal their previously unknown structures. In this paper, using the general form of pseudo amino acid composition to represent protein samples, we proposed a novel method for presenting protein sequence to a CGR picture using CGR algorithm. A 24-dimensional vector extracted from these CGR segments and the first two PCA features are used to classify thermophilic and mesophilic proteins by Support Vector Machine (SVM). Our method is evaluated by the jackknife test. For the 24-dimensional vector, the accuracy is 0.8792 and Matthews Correlation Coefficient (MCC) is 0.7587. The 26-dimensional vector by hybridizing with PCA components performs highly satisfaction, in which the accuracy achieves 0.9944 and MCC achieves 0.9888. The results show the effectiveness of the new hybrid method.  相似文献   

11.
Alignment free methods based on Chaos Game Representation (CGR), also known as sequence signature approaches, have proven of great interest for DNA sequence analysis. Indeed, they have been successfully applied for sequence comparison, phylogeny, detection of horizontal transfers or extraction of representative motifs in regulation sequences. Transposing such methods to proteins poses several fundamental questions related to representation space dimensionality. Several studies have tackled these points, but none has, so far, brought the application of CGRs to proteins to their fully expected potential. Yet, several studies have shown that techniques based on n-peptide frequencies can be relevant for proteins. Here, we investigate the effectiveness of a strategy based on the CGR approach using a fixed reverse encoding of amino acids into nucleic sequences. We first explore its relevance to protein classification into functional families. We then attempt to apply it to the prediction of protein structural classes. Our results suggest that the reverse encoding approach could be relevant in both cases. We show that it is able to classify functional families of proteins by extracting signatures close to the ProSite patterns. Applied to structural classification, the approach reaches scores of correct classification close to 84%, i.e. close to the scores of related methods in the field. Various optimizations of the approach are still possible, which open the door for future applications.  相似文献   

12.
A new method to determine entropic profiles in DNA sequences is presented. It is based on the chaos-game representation (CGR) of gene structure, a technique which produces a fractal-like picture of DNA sequences. First, the CGR image was divided into squares 4-m in size (m being the desired resolution), and the point density counted. Second, appropriate intervals were adjusted, and then a histogram of densities was prepared. Third, Shannon's formula was applied to the probability-distribution histogram, thus obtaining a new entropic estimate for DNA sequences, the histogram entropy , a measurement that goes with the level of constraints on the DNA sequence. Lastly, the entropic profile for the sequence was drawn, by considering the entropies at each resolution level, thus providing a way to summarize the complexity of large genomic regions or even entire genomes at different resolution levels. The application of the method to DNA sequences reveals that entropic profiles obtained in this way, as opposed to previously published ones, clearly discriminate between random and natural DNA sequences. Entropic profiles also show a different degree of variability within and between genomes. The results of these analyses are discussed in relation both to the genome compartmentalization in vertebrates and to the differential action of compositional and/or functional constraints on DNA sequences.  相似文献   

13.

Background  

Recently, Almeida and Vinga offered a new approach for the representation of arbitrary discrete sequences, referred to as Universal Sequence Maps (USM), and discussed its applicability to genomic sequence analysis. Their work generalizes and extends Chaos Game Representation (CGR) of DNA for arbitrary discrete sequences.  相似文献   

14.
Obtaining soluble proteins in sufficient concentrations is a major obstacle in various experimental studies. How to predict the propensity of targets in large-scale proteomics projects to be soluble is a significant but not fairly resolved scientific problem. Chaos game representation (CGR) can investigate the patterns hiding in protein sequences, and can visually reveal previously unknown structure. Fractal dimensions are good tools to measure sizes of complex, highly irregular geometric objects. In this paper, we convert each protein sequence into a high-dimensional vector by CGR algorithm and fractal dimension, and then predict protein solubility by these fractal features together with Chou's pseudo amino acid composition features and support vector machine (SVM). We extract and study six groups of features computed directly from the primary sequence, and each group is evaluated by the 10-fold cross-validation test. As the results of comparisons, the group of 445-dimensional vector gets the best results, the average accuracy is 0.8741 and average MCC is 0.7358. The resulting predictor is also compared with existing methods and shows significant improvement.  相似文献   

15.
基于CGR的DNA序列的时间序列模型(英文)   总被引:1,自引:0,他引:1  
高洁  蒋丽丽  徐振源 《生物信息学》2010,8(2):156-160,164
利用DNA序列的混沌游戏表示(chaos game representation,CGR),提出了将2维DNA图谱转化成相应的类谱格式的方法。该方法不仅提供了一个较好的视觉表示,而且可将DNA序列转化成一个时间序列。利用CGR坐标将DNA序列转化成CGR弧度序列,并引入长记忆ARFIMA(p,d,q)模型去拟合此类序列,发现此类序列中有显著的长相关性且拟合度很好。  相似文献   

16.
Similar to the chaos game representation (CGR) of DNA sequences proposed by Jeffrey (Nucleic Acid Res. 18 (1990) 2163), a new CGR of protein sequences based on the detailed HP model is proposed. Multifractal and correlation analyses of the measures based on the CGR of protein sequences from complete genomes are performed. The Dq spectra of all organisms studied are multifractal-like and sufficiently smooth for the Cq curves to be meaningful. The Cq curves of bacteria resemble a classical phase transition at a critical point. The correlation distance of the difference between the measure based on the CGR of protein sequences and its fractal background is also proposed to construct a more precise phylogenetic tree of bacteria.  相似文献   

17.
We used RNA fingerprinting of arbitrarily primed PCR to isolate genes upregulated during the yeast-hyphal transition in Candida albicans. The sequence and expression of one of these genes (CGR1, Candida growth regulation) are presented. Our results suggest that CGR1 expression is associated with a growth cessation of yeast cells, a prerequisite for germination in this organism.  相似文献   

18.
Chaos Game Representation (CGR) is a generalized scale-independent Markov transition table, which is useful for the visualization and comparative study of genomic signature, or for the study of characteristic sequence motifs. However, in order to fully utilize the scale-independent properties of CGR, it should be accessible through scale-independent user interface instead of static images. Here we describe a web server and Perl library for generating zoomable CGR images utilizing Google Maps API, which is also easily searchable for specific motifs. The web server is freely accessible at , and the Perl library as well as the source code is distributed with the G-language Genome Analysis Environment under GNU General Public License.  相似文献   

19.

Background  

In a recent report the authors presented a new measure of continuous entropy for DNA sequences, which allows the estimation of their randomness level. The definition therein explored was based on the Rényi entropy of probability density estimation (pdf) using the Parzen's window method and applied to Chaos Game Representation/Universal Sequence Maps (CGR/USM). Subsequent work proposed a fractal pdf kernel as a more exact solution for the iterated map representation. This report extends the concepts of continuous entropy by defining DNA sequence entropic profiles using the new pdf estimations to refine the density estimation of motifs.  相似文献   

20.

Background

Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2 -L distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations.

Results

The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm.

Conclusions

The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号