首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Long-range correlations in genomic base composition are a ubiquitous statistical feature among many eukaryotic genomes. In this article, these correlations are shown to substantially influence the statistics of sequence alignment scores. Using a Gaussian approximation to model the correlated score landscape, we calculate the corrections to the scale parameter lambda of the extreme value distribution of alignment scores. Our approximate analytic results are supported by a detailed numerical study based on a simple algorithm to efficiently generate long-range correlated random sequences. We find both, mean and exponential tail of the score distribution for long-range correlated sequences to be substantially shifted compared to random sequences with independent nucleotides. The significance of measured alignment scores will therefore change upon incorporation of the correlations in the null model. We discuss the magnitude of this effect in a biological context.  相似文献   

2.
Fast-sequencing throughput methods have increased the number of completely sequenced bacterial genomes to about 400 by December 2006, with the number increasing rapidly. These include several strains. In silico methods of comparative genomics are of use in categorizing and phylogenetically sorting these bacteria. Various word-based tools have been used for quantifying the similarities and differences between entire genomes. The simple di-nucleotide frequency comparison, codon specificity and k-mer repeat detection are among some of the well-known methods. In this paper, we show that the Mutual Information function, which is a measure of correlations and a concept from Information Theory, is very effective in determining the similarities and differences among genome sequences of various strains of bacteria such as the plant pathogen Xylella fastidiosa, marine Cyanobacteria Prochlorococcus marinus or animal and human pathogens such as species of Ehrlichia and Legionella. The short-range three-base periodicity, small sequence repeats and long-range correlations taken together constitute a genome signature that can be used as a technique for identifying new bacterial strains with the help of strains already catalogued in the database. There have been several applications of using the Mutual Information function as a measure of correlations in genomics but this is the first whole genome analysis done to detect strain similarities and differences.  相似文献   

3.
Evolutionary branching has been suggested as a mechanism to explain ecological speciation processes. Recent studies indicate however that demographic stochasticity and environmental fluctuations may prevent branching through stochastic competitive exclusion. Here we extend previous theory in several ways; we use a more mechanistic ecological model, we incorporate environmental fluctuations in a more realistic way and we include environmental autocorrelation in the analysis. We present a single, comprehensible analytical result which summarizes most effects of environmental fluctuations on evolutionary branching driven by resource competition. Corroborating earlier findings, we show that branching may be delayed or impeded if the underlying resources have uncorrelated or negatively correlated responses to environmental fluctuations. There is also a strong impeding effect of positive environmental autocorrelation, which can be related to results from recent experiments on adaptive radiation in bacterial microcosms. In addition, we find that environmental fluctuations can lead to cycles of repeated branching and extinction.  相似文献   

4.
Fast-sequencing throughput methods have increased the number of completely sequenced bacterial genomes to about 400 by December 2006, with the number increasing rapidly. These include several strains. In silico methods of comparative genomics are of use in categorizing and phylogenetically sorting these bacteria. Various word-based tools have been used for quantifying the similarities and differences between entire genomes. The simple di-nucleotide frequency comparison, codon specificity and k-mer repeat detection are among some of the well-known methods.In this paper, we show that the Mutual Information function, which is a measure of correlations and a concept from Information Theory, is very effective in determining the similarities and differences among genome sequences of various strains of bacteria such as the plant pathogen Xylella fastidiosa, marine Cyanobacteria Prochlorococcus marinus or animal and human pathogens such as species of Ehrlichia and Legionella. The short-range three-base periodicity, small sequence repeats and long-range correlations taken together constitute a genome signature that can be used as a technique for identifying new bacterial strains with the help of strains already catalogued in the database.There have been several applications of using the Mutual Information function as a measure of correlations in genomics but this is the first whole genome analysis done to detect strain similarities and differences.  相似文献   

5.
Analyses of genomic DNA sequences have shown in previous works that base pairs are correlated at large distances with scale-invariant statistical properties. We show in the present study that these correlations between nucleotides (letters) result in fact from long-range correlations (LRC) between sequence-dependent DNA structural elements (words) involved in the packaging of DNA in chromatin. Using the wavelet transform technique, we perform a comparative analysis of the DNA text and of the corresponding bending profiles generated with curvature tables based on nucleosome positioning data. This exploration through the optics of the so-called `wavelet transform microscope' reveals a characteristic scale of 100-200 bp that separates two regimes of different LRC. We focus here on the existence of LRC in the small-scale regime ( 200 bp). Analysis of genomes in the three kingdoms reveals that this regime is specifically associated to the presence of nucleosomes. Indeed, small scale LRC are observed in eukaryotic genomes and to a less extent in archaeal genomes, in contrast with their absence in eubacterial genomes. Similarly, this regime is observed in eukaryotic but not in bacterial viral DNA genomes. There is one exception for genomes of Poxviruses, the only animal DNA viruses that do not replicate in the cell nucleus and do not present small scale LRC. Furthermore, no small scale LRC are detected in the genomes of all examined RNA viruses, with one exception in the case of retroviruses. Altogether, these results strongly suggest that small-scale LRC are a signature of the nucleosomal structure. Finally, we discuss possible interpretations of these small-scale LRC in terms of the mechanisms that govern the positioning, the stability and the dynamics of the nucleosomes along the DNA chain. This paper is maily devoted to a pedagogical presentation of the theoretical concepts and physical methods which are well suited to perform a statistical analysis of genomic sequences. We review the results obtained with the so-called wavelet-based multifractal analysis when investigating the DNA sequences of various organisms in the three kingdoms. Some of these results have been announced in B. Audit et al. [1, 2].  相似文献   

6.
7.
We have studied the presence of long-range correlations in the complete genomes of ten different dsDNA viruses and Saccharomyces cerevisiae (bakers' yeast) chromosome I. We have also studied the correlation between the distribution of the gene length and the domain of "1/f region" of their genomes. Linear regression analysis was done for the power-law region of these organisms and the slope values obtained were approximately -1, which signify the existence of "1/f noise" in the low and medium (intermediate) frequency regions. This suggests the presence of long-range correlations in their genomes. The presence of 1/f noise in a given frequency interval indicates the existence of a fractal (self-similar) structure in the corresponding range of wavelengths. The results of our study suggest that genes have correlations within themselves, and the correlations appear to be related with the scaling exponent alpha.  相似文献   

8.
It has been established that the precise positioning of nucleosomes on genomic DNA can be achieved, at least for a minority of them, through sequence-dependent processes. However, to what extent DNA sequences play a role in the positioning of the major part of nucleosomes is still debated. The aim of the present study is to examine to what extent long-range correlations (LRC) are related to the presence of nucleosomes. Using the wavelet transform technique, we perform a comparative analysis of the DNA text and of the corresponding bending profiles generated with curvature tables based on nucleosome positioning data. The exploration of a number of eukaryotic and bacterial genomes through the optics of the so-called "wavelet transform microscope" reveals a characteristic scale of 100-200 bp that separates two regimes of different LRC. Here, we focus on the existence of LRC in the small-scale regime (10-200 bp) which are actually observed in eukaryotic genomes, in contrast to their absence in eubacterial genomes. Analysis of viral DNA genomes shows that, like their host's genomes, eukaryotic viruses present LRC but eubacterial viruses do not. There is one exception for genomes of poxviruses (Vaccinia and Melamoplus sanguinipes) which do not replicate in the cell nucleus and do not exhibit LRC. No small-scale LRC are detected in the genomes of all examined RNA viruses, with the exception of retroviruses. These results together with the observation of LRC between particular sequence motifs known to participate in the formation of nucleosomes (e.g. AA dinucleotides) strongly suggest that the 10-200 bp LRC are a signature of the sequence-dependence of nucleosome positioning. Finally, we discuss possible interpretations of these LRC in terms of the physical mechanisms that might govern the positioning and the dynamics of the nucleosomes along the DNA chain through cooperative processes.  相似文献   

9.
Kihara D  Skolnick J 《Proteins》2004,55(2):464-473
The genome scale threading of five complete microbial genomes is revisited using our state-of-the-art threading algorithm, PROSPECTOR_Q. Considering that structure assignment to an ORF could be useful for predicting biochemical function as well as for analyzing pathways, it is important to assess the current status of genome scale threading. The fraction of ORFs to which we could assign protein structures with a reasonably good confidence level to each genome sequences is over 72%, which is significantly higher than earlier studies. Using the assigned structures, we have predicted the function of several ORFs through "single-function" template structures, obtained from an analysis of the relationship between protein fold and function. The fold distribution of the genomes and the effect of the number of homologous sequences on structure assignment are also discussed.  相似文献   

10.
The sequential organization of genomes, i.e. the relations between distant base pairs and regions within sequences, and its connection to the three-dimensional organization of genomes is still a largely unresolved problem. Long-range power-law correlations were found using correlation analysis on almost the entire observable scale of 132 completely sequenced chromosomes of 0.5 × 106 to 3.0 × 107 bp from Archaea, Bacteria, Arabidopsis thaliana, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster, and Homo sapiens. The local correlation coefficients show a species-specific multi-scaling behaviour: close to random correlations on the scale of a few base pairs, a first maximum from 40 to 3,400 bp (for Arabidopsis thaliana and Drosophila melanogaster divided in two submaxima), and often a region of one or more second maxima from 105 to 3 × 105 bp. Within this multi-scaling behaviour, an additional fine-structure is present and attributable to codon usage in all except the human sequences, where it is related to nucleosomal binding. Computer-generated random sequences assuming a block organization of genomes, the codon usage, and nucleosomal binding explain these results. Mutation by sequence reshuffling destroyed all correlations. Thus, the stability of correlations seems to be evolutionarily tightly controlled and connected to the spatial genome organization, especially on large scales. In summary, genomes show a complex sequential organization related closely to their three-dimensional organization. This article has been submitted as a contribution to the festschrift entitled “Uncovering cellular sub-structures by light microscopy” in honor of Professor Cremer’s 65th birthday.  相似文献   

11.
The detection and quantification of long-range correlations in time series is a fundamental tool to characterize the properties of different dynamical systems, and is applied in many different fields, including physics, biology or engineering. Due to the diversity of applications, many techniques for measuring correlations have been designed. Here, we study systematically the influence of the length of a time series on the results obtained from several techniques commonly used to detect and quantify long-range correlations: the autocorrelation analysis, Hursts analysis, and detrended fluctuation analysis (DFA). Using the Fourier filtering method, we generate artificial time series with known and controlled long-range correlations and with a broad range of lengths, and apply on them the different correlation measures we have studied. Our results indicate that while the DFA method is practically unaffected by the length of the time series, and almost always provides accurate results, the results from Hursts analysis and the autocorrelation analysis strongly depend on the length of the time series.  相似文献   

12.
All amino acid sequences derived from 248 prokaryotic genomes, 10 invertebrate genomes (plants and fungi) and 10 vertebrate genomes were analysed by the autocorrelation function of charge sequences. The analysis of the total amino acid sequences derived from the 268 biological genomes showed that a significant periodicity of 28 residues is observable for the vertebrate genomes, but not for the other genomes. When proteins with a charge periodicity of 28 residues (PCP28) were selected from the total proteomes, we found that PCP28 in fact exists in all proteomes, but the number of PCP28 is much larger for the vertebrate proteomes than for the other proteomes. Although excess PCP28 in the vertebrate proteomes are only poorly characterized, a detailed inspection of the databases suggests that most excess PCP28 are nuclear proteins.  相似文献   

13.
Immunogenicity arises via many synergistic mechanisms, yet the overall dissimilarity of pathogenic proteins versus the host proteome has been proposed as a key arbiter. We have previously explored this concept in relation to Bacterial antigens; here we extend our analysis to antigens of viral and fungal origin. Sets of known viral and fungal antigenic and non-antigenic protein sequences were compared to human and mouse proteomes. Both antigenic and non-antigenic sequences lacked human or mouse homologues. Observed distributions were compared using the non-parametric Mann-Whitney test. The statistical null hypothesis was accepted, indicating that antigen and non-antigens did not differ significantly. Likewise, we could not determine a threshold able meaningfully to separate non-antigen from antigen. We conclude that viral and fungal antigens cannot be predicted from pathogen genomes based solely on their dissimilarity to mammalian genomes.  相似文献   

14.
The genome of human immunodeficiency virus (HIV) has an average nucleotide composition strongly biased as compared to the human genome. The consequence of such nucleotide composition on HIV pathogenicity has not been investigated yet. To address this question, we analyzed the role of nucleotide bias of HIV-derived nucleic acids in stimulating type-I interferon response in vitro. We found that the biased nucleotide composition of HIV is detected in human cells as compared to humanized sequences, and triggers a strong innate immune response, suggesting the existence of cellular immune mechanisms able to discriminate RNA sequences according to their nucleotide composition or to detect specific secondary structures or linear motifs within biased RNA sequences. We then extended our analysis to the entire genome scale by testing more than 1300 HIV-1 complete genomes to look for an association between nucleotide composition of HIV-1 group M subtypes and their pathogenicity. We found that subtype D, which has an increased pathogenicity compared to the other subtypes, has the most divergent nucleotide composition relative to the human genome. These data support the hypothesis that the biased nucleotide composition of HIV-1 may be related to its pathogenicity.  相似文献   

15.
2019新型冠状病毒基因组的生物信息学分析   总被引:1,自引:1,他引:0       下载免费PDF全文
2019年12月,中国武汉报道了冠状病毒引起的肺炎,其临床症状与2003年爆发的严重急性呼吸综合征(Severe Acute Respiratory Syndrome, SARS)不同,因此推断该病毒可能是冠状病毒的一个新变种。不同于简单使用全基因组序列的其它研究,我们于2018年在国际上首次提出分子功能与进化分析相结合的研究思想,并应用于Beta冠状病毒B亚群(BB冠状病毒)基因组的研究。在这一思想指导下,本研究使用BB冠状病毒基因组中的一个互补回文序列(命名为Nankai complemented palindrome)与其所在的编码区(命名为Nankai CDS)对新发布的2019新型冠状病毒基因组(GenBank:MN908947)进行分析以期准确溯源,并对BB冠状病毒的跨物种传播和宿主适应性进行初步研究。溯源分析的结果支持2019新型冠状病毒源自蝙蝠,但与SARS冠状病毒差异巨大,这一结果与两者临床症状差异一致。本研究的最重要发现是BB冠状病毒存在大量的可变翻译,从分子水平揭示了BB冠状病毒变异快、多样性高的特点。从BB冠状病毒可变翻译中获取的信息可应用于(但不限于)其快速检测、基因分型、疫苗开发以及药物设计。另外,我们推断BB冠状病毒可能通过可变翻译以适应不同宿主。基于大量基因组数据的实证分析,本研究在国际上首次从分子水平尝试解释了BB冠状病毒变异快、宿主多且具有较强的宿主适应性的原因。  相似文献   

16.
Many studies have demonstrated the presence of scale invariance and long-range correlation in animal and human neuronal spike trains. The methodologies to extract the fractal or scale-invariant properties, however, do not address the issue as to the existence within the train of fine temporal structures embedded in the global fractal organisation. The present study addresses this question in human spike trains by the chaos game representation (CGR) approach, a graphical analysis with which specific temporal sequences reveal themselves as geometric structures in the graphical representation. The neuronal spike train data were obtained from patients whilst undergoing pallidotomy. Using this approach, we observed highly structured regions in the representation, indicating the presence of specific preferred sequences of interspike intervals within the train. Furthermore, we observed that for a given spike train, the higher the magnitude of its scaling exponent, the more pronounced the geometric patterns in the representation and, hence, higher probability of occurrence of specific subsequences. Given its ability to detect and specify in detail the preferred sequences of interspike intervals, we believe that CGR is a useful adjunct to the existing set of methodologies for spike train analysis.  相似文献   

17.
Mapping nucleotide sequences onto a "DNA walk" produces a novel representation of DNA that can then be studied quantitatively using techniques derived from fractal landscape analysis. We used this method to analyze 11 complete genomic and cDNA myosin heavy chain (MHC) sequences belonging to 8 different species. Our analysis suggests an increase in fractal complexity for MHC genes with evolution with vertebrate > invertebrate > yeast. The increase in complexity is measured by the presence of long-range power-law correlations, which are quantified by the scaling exponent alpha. We develop a simple iterative model, based on known properties of polymeric sequences, that generates long-range nucleotide correlations from an initially noncorrelated coding region. This new model-as well as the DNA walk analysis-both support the intron-late theory of gene evolution.  相似文献   

18.
Costantini M  Bernardi G 《Gene》2008,410(2):241-248
Many years ago compositional correlations were found to hold between coding and contiguous non-coding sequences. These correlations were essentially studied in whole genomes of mammals, which are characterized by strong compositional heterogeneities. Here we investigated whether these correlations also hold within the much more homogeneous isochore families. This point was checked not only in the case of mammals, but also in that of phylogenetically distant vertebrates, which are characterized by very different compositional patterns. Indeed, these are remarkably different in cold- and warm-blooded vertebrates. Fish genomes, for instance, are much more homogeneous than those of mammals and birds. The compositional correlations between coding sequences and the corresponding introns, or their 5′ and 3′ flanking regions, were studied in the isochore families of the fully sequenced genomes from four fishes (Brachydanio rerio, Oryzias latipes, Gasterosteus aculeatus and Tetraodon nigroviridis), human and chicken.  相似文献   

19.

Background

The periodical occurrence of dinucleotides with a period of 10.4 bases now is undeniably a hallmark of nucleosome positioning. Whereas many eukaryotic genomes contain visible and even strong signals for periodic distribution of dinucleotides, the human genome is rather featureless in this respect. The exact sequence features in the human genome that govern the nucleosome positioning remain largely unknown.

Results

When analyzing the human genome sequence with the positional autocorrelation method, we found that only the dinucleotide CG shows the 10.4 base periodicity, which is indicative of the presence of nucleosomes. There is a high occurrence of CG dinucleotides that are either 31 (10.4 × 3) or 62 (10.4 × 6) base pairs apart from one another - a sequence bias known to be characteristic of Alu-sequences. In a similar analysis with repetitive sequences removed, peaks of repeating CG motifs can be seen at positions 10, 21 and 31, the nearest integers of multiples of 10.4.

Conclusions

Although the CG dinucleotides are dominant, other elements of the standard nucleosome positioning pattern are present in the human genome as well. The positional autocorrelation analysis of the human genome demonstrates that the CG dinucleotide is, indeed, one visible element of the human nucleosome positioning pattern, which appears both in Alu sequences and in sequences without repeats. The dominant role that CG dinucleotides play in organizing human chromatin is to indicate the involvement of human nucleosomes in tuning the regulation of gene expression and chromatin structure, which is very likely due to cytosine-methylation/-demethylation in CG dinucleotides contained in the human nucleosomes. This is further confirmed by the positions of CG-periodical nucleosomes on Alu sequences. Alu repeats appear as monomers, dimers and trimers, harboring two to six nucleosomes in a run. Considering the exceptional role CG dinucleotides play in the nucleosome positioning, we hypothesize that Alu-nucleosomes, especially, those that form tightly positioned runs, could serve as "anchors" in organizing the chromatin in human cells.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号