首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Storage of sequence data is a big concern as the amount of data generated is exponential in nature at several locations. Therefore, there is a need to develop techniques to store data using compression algorithm. Here we describe optimal storage algorithm (OPTSDNA) for storing large amount of DNA sequences of varying length. This paper provides performance analysis of optimal storage algorithm (OPTSDNA) of a distributed bioinformatics computing system for analysis of DNA sequences. OPTSDNA algorithm is used for storing various sizes of DNA sequences into database. DNA sequences of different lengths were stored by using this algorithm. These input DNA sequences are varied in size from very small to very large. Storage size is calculated by this algorithm. Response time is also calculated in this work. The efficiency and performance of the algorithm is high (in size calculation with percentage) when compared with other known with sequential approach.  相似文献   

2.
Repseek, a tool to retrieve approximate repeats from large DNA sequences   总被引:2,自引:0,他引:2  
Chromosomes or other long DNA sequences contain many highly similar repeated sub-sequences. While there are efficient methods for detecting strict repeats or detecting already characterized repeats, there is no software available for detecting approximate repeats in large DNA sequences allowing for weighted substitutions and indels in a coherent statistical framework. Here, we present an implementation of a two-steps method (seed detection followed by their extension) that detects those approximate repeats. Our method is computationally efficient enough to handle large sequences and is flexible enough to account for influencing factors, such as sequence-composition biases both at the seed detection and alignment levels. AVAILABILITY: http://wwwabi.snv.jussieu.fr/public/RepSeek/  相似文献   

3.
 One of the critical requirements of data analysis involving large DNA sequences is an effective statistical summarization of those sequences. In this article DNA sequences have been analyzed based on word frequencies. Our analysis focuses on the detection of structural signature of a genome reflected in word frequencies and identification of phylogenetic relationships among different species reflected in the variation of word distributions in their DNA sequences. We have carried out a statistical study of the complete genome of baker's yeast, of various ribosomal RNA sequences from different prokaryotic and eukaryotic organisms and of the full genomes of some bacteriophages. Our exploratory analysis amply demonstrates the usefulness of DNA word frequencies in reducing the dimensionality of large sequences while retaining some of the structural information there that can have biological significance. Some conceptual issues that arise in course of our investigation have been addressed. A few interesting problems related to the statistics of DNA words have been pointed out with some indication of their possible solutions. The work has been partially motivated by the fact that sequence alignment and homology techniques that are quite popular for comparing and analyzing relatively smaller DNA sequences of nearly equal sizes are not applicable to data consisting of large sequences with widely varying sizes, which may contain segments with unknown or no biological functions, and consequently their comparison through functional homology is either impossible or extremely difficult. Received: 15 October 2000 / Revised version: 8 October 2002 Published online: 28 February 2003 Current address: CF186, Salt lake, Calcutta 700064, India Research presented here was supported in part by a grant from Indian Statistical Institute. Key words or phrases: Average linkage clustering – Chernoff's faces – Dendrograms – DNA words – F-ranks of words – F-ratios of words – l 1-distance – Phylogenetic relationships – Rank correlation – Single linkage clustering  相似文献   

4.
Precision genetic engineering based on stable chromosomal insertion of exogenous DNA in the genomes of large mammals is immensely important for the development of improved biomedical models, pharmaceutical research and an accelerated breeding progress. Precision genetic engineering requires (i) a known locus of genomic integration, (ii) a defined status of foreign DNA, (iii) that transgene expression is unaffected by neighbouring chromosomal sequences, (iv) endogenous genes are not mutated and (v) no unwanted DNA sequences are present. Recently, advanced molecular techniques exploiting exogenous enzymes have opened the possibilities for more sophisticated genetic engineering. Here, we critically review current developments of enzyme-catalysed approaches for targeted transgenesis in large mammals.  相似文献   

5.
DNA条形码是一段短的、标准化的DNA序列,DNA条形码技术通过对DNA条形码序列分析实现物种的有效鉴定.随着生物DNA条形码序列的大量测定,DNA条形码分析方法得到迅速发展,推动了其在生物分子鉴定中的应用.2003年以来,DNA条形码技术已广泛应用于动物、植物和真菌等物种的鉴定,并有力地推动了生物分类学、生物多样性和生态学等学科的发展.本文在综述DNA条形码技术的基础上,总结了5类主要的DNA条形码分析方法,即基于遗传距离的分析、基于遗传相似度的分析、基于系统发育树的分析、基于序列特征的分析和基于统计分类法的分析,并进一步展望了DNA条形码技术的发展与应用.  相似文献   

6.
A finite-context (Markov) model of order k yields the probability distribution of the next symbol in a sequence of symbols, given the recent past up to depth k. Markov modeling has long been applied to DNA sequences, for example to find gene-coding regions. With the first studies came the discovery that DNA sequences are non-stationary: distinct regions require distinct model orders. Since then, Markov and hidden Markov models have been extensively used to describe the gene structure of prokaryotes and eukaryotes. However, to our knowledge, a comprehensive study about the potential of Markov models to describe complete genomes is still lacking. We address this gap in this paper. Our approach relies on (i) multiple competing Markov models of different orders (ii) careful programming techniques that allow orders as large as sixteen (iii) adequate inverted repeat handling (iv) probability estimates suited to the wide range of context depths used. To measure how well a model fits the data at a particular position in the sequence we use the negative logarithm of the probability estimate at that position. The measure yields information profiles of the sequence, which are of independent interest. The average over the entire sequence, which amounts to the average number of bits per base needed to describe the sequence, is used as a global performance measure. Our main conclusion is that, from the probabilistic or information theoretic point of view and according to this performance measure, multiple competing Markov models explain entire genomes almost as well or even better than state-of-the-art DNA compression methods, such as XM, which rely on very different statistical models. This is surprising, because Markov models are local (short-range), contrasting with the statistical models underlying other methods, where the extensive data repetitions in DNA sequences is explored, and therefore have a non-local character.  相似文献   

7.
转录因子结合位点生物信息学研究进展   总被引:7,自引:0,他引:7  
侯琳  钱敏平  朱云平  邓明华 《遗传》2009,31(4):365-373
By using genome in situ hybridization (GISH) on root somatic chromosomes of allotetraploid derived from the cross Gossypium arboreum × G. bickii with genomic DNA (gDNA) of G. bickii as a probe, two sets of chromosomes, consisting of 26 chromosomes each, were easily distinguished from each other by their distinctive hybridization signals. GISH analysis directly proved that the hybrid G.arboreum×G. bickii is an allotetraploid amphiploid. The karyotype formula of the species was 2n = 4x = 52 = 46m (4sat) + 6sm (4sat). We identified four pairs of satellites with two pairs in each sub-genome. FISH analysis using 45S rDNA as a probe showed that the cross G. arboreum×G. bickii contained 14 NORs. At least five pairs of chromosomes in the G sub-genome showed double hybridization (red and blue) in their long arms, which indicates that chromatin introgression from the A sub-genome had occurred.  相似文献   

8.
9.
If DNA were a random string over its alphabet {A, C, G, T}, an optimal code would assign two bits to each nucleotide. DNA may be imagined to be a highly ordered, purposeful molecule, and one might therefore reasonably expect statistical models of its string representation to produce much lower entropy estimates. Surprisingly, this has not been the case for many natural DNA sequences, including portions of the human genome. We introduce a new statistical model (compression algorithm), the strongest reported to date, for naturally occurring DNA sequences. Conventional techniques code a nucleotide using only slightly fewer bits (1.90) than one obtains by relying only on the frequency statistics of individual nucleotides (1.95). Our method in some cases increases this gap by more than fivefold (1.66) and may lead to better performance in microbiological pattern recognition applications. One of our main contributions, and the principle source of these improvements, is the formal inclusion of inexact match information in the model. The existence of matches at various distances forms a panel of experts which are then combined into a single prediction. The structure of this combination is novel and its parameters are learned using Expectation Maximization (EM). Experiments are reported using a wide variety of DNA sequences and compared whenever possible with earlier work. Four reasonable notions for the string distance function used to identify near matches, are implemented and experimentally compared. We also report lower entropy estimates for coding regions extracted from a large collection of nonredundant human genes. The conventional estimate is 1.92 bits. Our model produces only slightly better results (1.91 bits) when considering nucleotides, but achieves 1.84-1.87 bits when the prediction problem is divided into two stages: (i) predict the next amino acid-based on inexact polypeptide matches, and (ii) predict the particular codon. Our results suggest that matches at the amino acid level play some role, but a small one, in determining the statistical structure of nonredundant coding sequences.  相似文献   

10.
Experimental evidence suggests DNA mechanical properties, in particular intrinsic curvature and flexibility, have a role in many relevant biological processes. Systematic investigations about the origin of DNA curvature and flexibility have been carried out; however, most of the applied experimental techniques need simplifying models to interpret the data, which can affect the results. Progress in the direct visualization of macromolecules allows the analysis of morphological properties and structural changes of DNAs directly from the digitised micrographs of single molecules. In addition, the statistical analysis of a large number of molecules gives information both on the local intrinsic curvature and the flexibility of DNA tracts at nanometric scale in relatively long sequences. However, it is necessary to extend the classical worm-like chain model (WLC) for describing conformations of intrinsically straight homogeneous polymers to DNA. This review describes the various methodologies proposed by different authors.  相似文献   

11.
An evolutionary model for maximum likelihood alignment of DNA sequences   总被引:16,自引:0,他引:16  
Summary Most algorithms for the alignment of biological sequences are not derived from an evolutionary model. Consequently, these alignment algorithms lack a strong statistical basis. A maximum likelihood method for the alignment of two DNA sequences is presented. This method is based upon a statistical model of DNA sequence evolution for which we have obtained explicit transition probabilities. The evolutionary model can also be used as the basis of procedures that estimate the evolutionary parameters relevant to a pair of unaligned DNA sequences. A parameter-estimation approach which takes into account all possible alignments between two sequences is introduced; the danger of estimating evolutionary parameters from a single alignment is discussed.  相似文献   

12.
Static DNA curvature distributions of full-sequenced genomes and large DNA contigs from different organisms were calculated. Very distinctive differences among histogram profiles coming from archaebacteria, eubacteria, and eukaryotes were observed. Eubacterial profiles were, on average, more curved than were archaeal and eukaryotic profiles. A comparative analysis between real and randomized DNA sequences revealed that eubacterial genomes presented, overall, higher curvature values than random sequences. An opposite portrait was exhibited by archaeal and eukaryotic genomes. They displayed a lower frequency of curved regions than their corresponding randomized sequences. The contributions of coding and intergenic regions to the curvature profile were also analyzed. Intergenic regions, on average, were found to be more curved than the overall genomic sequences, especially in prokaryotic organisms. Nevertheless, because of their small size with respect to coding regions, the contribution of intergenic sequences to the overall curvature profile tended to be minor. A clear relationship between codon usage and DNA curvature was demonstrated, and a proposal of the possible coevolution of both systems is discussed. Finally, we present a procedure to quantify the deviation of a curvature profile from randomness through a formal statistical analysis.  相似文献   

13.
Organisms living in or on the sediment layer of water bodies constitute the benthos fauna, which is known to harbour a large number of species of diverse taxonomic groups. The benthos plays a significant role in the nutrient cycle and it is, therefore, of high ecological relevance. Here, we have explored a DNA-taxonomic approach to access the meiobenthic organismic diversity, by focusing on obtaining signature sequences from a part of the large ribosomal subunit rRNA (28S), the D3-D5 region. To obtain a broad representation of taxa, benthos samples were taken from 12 lakes in Germany, representing different ecological conditions. In a first approach, we have extracted whole DNA from these samples, amplified the respective fragment by PCR, cloned the fragments and sequenced individual clones. However, we found a relatively large number of recombinant clones that must be considered PCR artefacts. In a second approach we have, therefore, directly sequenced PCR fragments that were obtained from DNA extracts of randomly picked individual organisms. In total, we have obtained 264 new unique sequences, which can be readily placed into taxon groups, based on phylogenetic comparison with currently available database sequences. The group with the highest taxon abundance were nematodes and protozoa, followed by chironomids. However, we find also that we have by far not exhausted the diversity of organisms in the samples. Still, our data provide a framework within which a meiobenthos DNA signature sequence database can be constructed, that will allow to develop the necessary techniques for studying taxon diversity in the context of ecological analysis. Since many taxa in our analysis are initially only identified via their signature sequences, but not yet their morphology, we propose to call this approach 'reverse taxonomy'.  相似文献   

14.
Polymerase chain reaction (PCR)-based genome walking techniques are commonly used to clone unknown genomic regions flanking known sequences. However, these methods are typically problematic when applied to highly complex DNA templates isolated from plants with large genomes. Here we describe a reliable and efficient genome walking method that is particularly effective for plants with large genomes. Our ligation-mediated PCR method, Straight Walk, has improved sensitivity and specificity due to optimization of sequences of adaptors and adaptor primers. Successful genome walking in lily, which has one of the largest genomes in plants, indicates that Straight Walk is applicable for most plant species.  相似文献   

15.
The question of where retroviral DNA becomes integrated in chromosomes is important for understanding (i) the mechanisms of viral growth, (ii) devising new anti-retroviral therapy, (iii) understanding how genomes evolve, and (iv) developing safer methods for gene therapy. With the completion of genome sequences for many organisms, it has become possible to study integration targeting by cloning and sequencing large numbers of host–virus DNA junctions, then mapping the host DNA segments back onto the genomic sequence. This allows statistical analysis of the distribution of integration sites relative to the myriad types of genomic features that are also being mapped onto the sequence scaffold. Here we present methods for recovering and analyzing integration site sequences.  相似文献   

16.
Four rodent species with very large heterochromatic regions on the sex chromosomes have been studied using in situ DNA/DNA hybridization techniques. Repetitious DNA fractions were obtained at C0t 0-0.01. Heterochromatic regions of X and X chromosomes of Cricetulus barabensis and Phodopus sungorus, and the heterochromatic long arm of the Y chromosome of Mesocricetus auratus do not contain disproportionately high amounts of repeated DNA sequences. Heterochromatic regions on sex chromosomes of Microtus subarvalis contain high amounts of repeated DNA sequences. Additional heterochromatic autosomal arms, a heterochromatic arm of the X chromosome, and a short arm of the Y chromosome of Mesocricetus auratus contain high amounts of repeated DNA sequences too.  相似文献   

17.
To study the properties of DNA sequences we have transformed the sequences of bases into the sequences of twist angles along the chain of DNA double helix by using the Dickerson sum function. The Fourier transform and the auto-correlation function of the twist angles sequences have been used to study the periodicity and randomness of the original DNA sequences. Basing on the correlation coefficient, a "distance" between two DNA fragments has been defined and used to compare some realistic DNA sequences. It is hoped that the techniques developed here could be used to analyze more realistic DNA sequences.  相似文献   

18.
The statistical distribution of nucleic acid similarities.   总被引:18,自引:6,他引:12       下载免费PDF全文
All pairs of a large set of known vertebrate DNA sequences were searched by computer for most similar segments. Analysis of this data shows that the computed similarity scores are distributed proportionally to the logarithm of the product of the lengths of the sequences involved. This distribution is closely related to recent results of Erdos and others on the longest run of heads in coin tossing. A simple rule is derived for determination of statistical significance of the similarity scores and to assist in relating statistical and biological significance.  相似文献   

19.
We show that the number of lineages ancestral to a sample, as a function of time back into the past, which we call the number of lineages as a function of time (NLFT), is a nearly deterministic property of large-sample gene genealogies. We obtain analytic expressions for the NLFT for both constant-sized and exponentially growing populations. The low level of stochastic variation associated with the NLFT of a large sample suggests using the NLFT to make estimates of population parameters. Based on this, we develop a new computational method of inferring the size and growth rate of a population from a large sample of DNA sequences at a single locus. We apply our method first to a sample of 1,212 mitochondrial DNA (mtDNA) sequences from China, confirming a pattern of recent population growth previously identified using other techniques, but with much smaller confidence intervals for past population sizes due to the low variation of the NLFT. We further analyze a set of 63 mtDNA sequences from blue whales (BWs), concluding that the population grew in the past. This calls for reevaluation of previous studies that were based on the assumption that the BW population was fixed.  相似文献   

20.
While standard DNA‐sequencing approaches readily yield genotypic sequence data, haplotype information is often of greater utility for population genetic analyses. However, obtaining individual haplotype sequences can be costly and time‐consuming and sometimes requires statistical reconstruction approaches that are subject to bias and error. Advancements have recently been made in determining individual chromosomal sequences in large‐scale genomic studies, yet few options exist for obtaining this information from large numbers of highly polymorphic individuals in a cost‐effective manner. As a solution, we developed a simple PCR‐based method for obtaining sequence information from individual DNA strands using standard laboratory equipment. The method employs a water‐in‐oil emulsion to separate the PCR mixture into thousands of individual microreactors. PCR within these small vesicles results in amplification from only a single starting DNA template molecule and thus a single haplotype. We improved upon previous approaches by including SYBR Green I and a melted agarose solution in the PCR, allowing easy identification and separation of individually amplified DNA molecules. We demonstrate the use of this method on a highly polymorphic estuarine population of the copepod Eurytemora affinis for which current molecular and computational methods for haplotype determination have been inadequate.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号