首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 437 毫秒
1.
DNA序列信号频谱3-周期特性被认为是用来区分编码区和非编码区的一个重要特征.针对信噪比计算量大的问题,给出基于Z-Curve映射下快速计算信噪比的方法,该方法有效避开计算离散傅立叶变换(DFT),从序列本身直接得到信噪比.实验结果表明,快速算法计算效率是DFT方法的百倍之上,极大减小基因的信噪比计算时间.  相似文献   

2.
随着以功能基因组学和蛋白质组学为主要研究内容的后基因组时代的来临,人们面对着生物信息的数据呈指数增长,如何通过有效的计算方法由核酸和蛋白质的序列推导出它们的结构和功能,特别是识别DNA序列中编码蛋白质的基因预测问题是迫切需要解决的研究课题之一.本文在CpG岛对研究基因编码的特殊生物意义下,通过三种方法确定CpG岛的位置,并在此基础上,结合一种新的DNA序列字母向量,利用信息熵离散量预测基因序列,提高了识别基因编码的效率,而且计算的时间有显著的减少.  相似文献   

3.
真核生物DNA非编码区的组分分析   总被引:4,自引:0,他引:4  
在全基因组水平上,用直方图、混沌表示灰度图、距离差异度和信息熵差异度四种方法,研究了拟南芥、线虫、果蝇的DNA内含子、基因间隔区DNA、外显子三种区域的核苷酸短序列组分及组分复杂度.结果表明:a.不同基因组之间,不管基因数目多少,用4种方法得到的外显子部分其组分复杂度都比较接近,而非编码区部分的组分复杂度却很大.这一点定量地说明了物种之间的复杂程度,主要不体现在编码区部分,而体现在非编码区部分.b.同一基因组中,内含子的核苷酸短序列组分复杂度都是相似的,外显子和intergenic DNA部分的组分复杂度也是相似的.c.内含子和intergenic DNA在转录、剪切、二级结构等方面有很大的不同,但它们在核苷酸短序列组分上的差异却很小,说明内含子和intergenic DNA在转录、剪切、二级结构上的不同并不通过核苷酸短序列组分来进行限制.  相似文献   

4.
测定DNA分子的核苷酸排列顺序的快速方法已经出现,在一个DNA分子的序列被阐明后,如何尽快地搞清分子上哪些区域是蛋白质的编码区将是一个重要的问题。近年内,Staden在这方面做了一些工作。程序介绍我们在TRS-80Ⅰ型微计算机上实现了一组从DNA分子的核苷酸顺序出发来预测蛋白质基因区的计算机程序。程序是用Basic语言写的,它能够把计算结果在WX4671型绘图仪  相似文献   

5.
根据DNA序列的‘终止密码子’及其‘逆补终止密码子’的分布情况,给出一种新的DNA序列向量的构建方法,运用Shannon熵相关理论,对Jensen-Shannon离散量和KL离散量进行了修正和比较。试验表明,该方法在预测DNA序列的基因编码与非编码区边界的效率上是86%显著高于Bernaola等人提出的70%。  相似文献   

6.
秦丹  徐存拴 《遗传》2013,35(11):1253-1264
非编码DNA序列是指基因组中不编码蛋白质的DNA序列。这些序列可以结合调节因子、转录为功能性RNA、单独或协同地调节生理活动和病理过程。文章围绕基因表达调控作用, 总结了近几年非编码DNA序列的研究成果, 对其结构、功能和可能的作用机制进行了初步阐述, 介绍了目前鉴定非编码DNA序列中功能元件的计算方法和实验技术, 并对非编码DNA未来的研究进行了展望。  相似文献   

7.
针对DNA序列单碱基的不同类型突变,利用数字信号处理方法,研究了单碱基替换突变、删除突变、插入突变对DNA序列三周期功率谱的影响。研究结果表明:对于不同长度的编码序列,替换突变对序列功率谱的影响较小,删除突变和插入突变对序列功率谱的影响较大;随着序列编码区长度的减小,替换、删除、插入突变对序列编码区的功率谱影响会越来越大。对于中等长度外显子,插入突变对序列三周期功率谱影响最大,对于短外显子,删除突变对序列三周期功率谱的影响最大。研究结果可为含突变基因编码区的识别与检测提供参考。  相似文献   

8.
基因预测是指预测DNA序列中编码蛋白质的部分。随着多数生物基因组的测序工作的完成 ,基因预测更显得尤为重要。基因预测主要包括两种方法 ,首先是同源方法 ,也称为“外在方法” ,其次是基因预测方法或称为“内在方法”。主要对隐马尔可夫模型、傅立叶变换、动态规划等几种“外在方法”进行介绍。  相似文献   

9.
不具有3-碱基周期性的编码序列初探   总被引:4,自引:0,他引:4  
对120个较短编码序列(<1 200 bp)的Fourier频谱进行分析表明,3-碱基周期性在短编码序列中并不是绝对存在的.统计分析提示,编码序列有无3-碱基周期性与序列的碱基组成和分布、所编码蛋白质氨基酸的选用和顺序以及同义密码子的使用都有一定的关系.一般地,非周期-3序列中A+U含量高于G+C含量,周期-3序列的情况则相反;非周期-3序列中碱基在密码子三个位点上的分布比周期-3序列中的分布均匀;非周期-3序列密码子和氨基酸的使用偏向没有周期-3序列的大.在利用Fourier分析方法预测DNA序列中的基因和外显子时,应充分考虑到这些现象.  相似文献   

10.
基于差错控制原理,将遗传信息中由海量碱基构成的DNA序列看作经过编码而获得的具有某种编码特性的编码序列。在此基础上,将编码理论中分组码的分析方法用于对DNA序列进行分析,选用了(6,3)分组码对三种原核生物和两种真核生物DNA序列进行分析,在ORF起始端和结束端观察到明显的码间距离变化。证明该方法对DNA序列分析有较好的指导作用。  相似文献   

11.
We describe a genomic DNA-based signal sequence trap method, signal-exon trap (SET), for the identification of genes encoding secreted and membrane-bound proteins. SET is based on the coupling of an exon trap to the translation of captured exons, which allows screening of the exon-encoded polypeptides for signal peptide function. Since most signal sequences are expected to be located in the 5′-terminal exons of genes, we first demonstrate that trapping of these exons is feasible. To test the applicability of SET for the screening of complex genomic DNA, we evaluated two critical features of the method. Specificity was assessed by the analysis of random genomic DNA and efficiency was demonstrated by screening a 425 kb YAC known to contain the genes of four secretory or membrane-bound proteins. All trapped clones contained a translation initiation signal followed by a hydrophobic stretch of amino acids representing either a known signal peptide, transmembrane domain or novel sequence. Our results suggest that SET is a potentially useful method for the isolation of signal sequence-containing genes and may find application in the discovery of novel members of known secretory gene clusters, as well as in other positional cloning approaches.  相似文献   

12.
Gene identification in genomic DNA from eukaryotes is complicated by the vast combinatorial possibilities of potential exon assemblies. If the gene encodes a protein that is closely related to known proteins, gene identification is aided by matching similarity of potential translation products to those target proteins. The genomic DNA and protein sequences can be aligned directly by scoring the implied residues of in-frame nucleotide triplets against the protein residues in conventional ways, while allowing for long gaps in the alignment corresponding to introns in the genomic DNA. We describe a novel method for such spliced alignment. The method derives an optimal alignment based on scoring for both sequence similarity of the predicted gene product to the protein sequence and intrinsic splice site strength of the predicted introns. Application of the method to a representative set of 50 known genes from Arabidopsis thaliana showed significant improvement in prediction accuracy compared to previous spliced alignment methods. The method is also more accurate than ab initio gene prediction methods, provided sufficiently close target proteins are available. In view of the fast growth of public sequence repositories, we argue that close targets will be available for the majority of novel genes, making spliced alignment an excellent practical tool for high-throughput automated genome annotation.  相似文献   

13.
The Gibbs sampling method has been widely used for sequence analysis after it was successfully applied to the problem of identifying regulatory motif sequences upstream of genes. Since then, numerous variants of the original idea have emerged: however, in all cases the application has been to finding short motifs in collections of short sequences (typically less than 100 nucleotides long). In this paper, we introduce a Gibbs sampling approach for identifying genes in multiple large genomic sequences up to hundreds of kilobases long. This approach leverages the evolutionary relationships between the sequences to improve the gene predictions, without explicitly aligning the sequences. We have applied our method to the analysis of genomic sequence from 14 genomic regions, totaling roughly 1.8 Mb of sequence in each organism. We show that our approach compares favorably with existing ab initio approaches to gene finding, including pairwise comparison based gene prediction methods which make explicit use of alignments. Furthermore, excellent performance can be obtained with as little as four organisms, and the method overcomes a number of difficulties of previous comparison based gene finding approaches: it is robust with respect to genomic rearrangements, can work with draft sequence, and is fast (linear in the number and length of the sequences). It can also be seamlessly integrated with Gibbs sampling motif detection methods.  相似文献   

14.
We used cDNA amplification for identification of genomic expressed sequences (CAIGES) to identify genes in the glycerol kinase region of the human X chromosome. During these investigations we identified the sequence for a ferritin light chain (FTL) pseudogene in this portion of Xp21. A human liver cDNA library was amplified by vector primers, labeled, and hybridized to Southern blots ofEcoRIdigested human genomic DNA from cosmids isolated from yeast artificial chromosomes in the glycerol kinase region of Xp21. A 3.1-kb restriction fragment hybridized with the cDNA library, was subcloned and sequenced, and a 440-bp intronless sequence was found with strong similarity to the FTL coding sequence. Therefore, the FTL pseudogene that had been mapped previously to Xp22.3–21.2 was localized specifically to the glycerol kinase region. The CAIGES method permits rapid screening of genomic material and will identify genomic sequences with similarities to genes expressed in the cDNA library used to probe the cloned genomic DNA, including pseudogenes.  相似文献   

15.
Rapid and inexpensive sequencing technologies are making it possible to collect whole genome sequence data on multiple individuals from a population. This type of data can be used to quickly identify genes that control important ecological and evolutionary phenotypes by finding the targets of adaptive natural selection, and we therefore refer to such approaches as "reverse ecology." To quantify the power gained in detecting positive selection using population genomic data, we compare three statistical methods for identifying targets of selection: the McDonald-Kreitman test, the mkprf method, and a likelihood implementation for detecting d(N)/d(S) > 1. Because the first two methods use polymorphism data we expect them to have more power to detect selection. However, when applied to population genomic datasets from human, fly, and yeast, the tests using polymorphism data were actually weaker in two of the three datasets. We explore reasons why the simpler comparative method has identified more genes under selection, and suggest that the different methods may really be detecting different signals from the same sequence data. Finally, we find several statistical anomalies associated with the mkprf method, including an almost linear dependence between the number of positively selected genes identified and the prior distributions used. We conclude that interpreting the results produced by this method should be done with some caution.  相似文献   

16.
Computational prediction of RNA editing sites   总被引:1,自引:0,他引:1  
MOTIVATION: Some organisms edit their messenger RNA resulting in differences between the genomic sequence for a gene and the corresponding messenger RNA sequence. This difference complicates experimental and computational attempts to find and study genes in organisms with RNA editing even if the full genomic sequence is known. Nevertheless, knowledge of these editing sites is crucial for understanding the editing machinery of these organisms. RESULTS: We present a computational technique that predicts the position of editing sites in the genomic sequence. It uses a statistical approach drawing on the protein sequences of related genes and general features of editing sites of the organism. We apply the method to the mitochondrion of the slime mold Physarum polycephalum. It correctly predicts over 90% of the amino acids and over 70% of the editing sites.  相似文献   

17.
ABSTRACT: BACKGROUND: Gene finding is a complicated procedure that encapsulates algorithms for coding sequence modeling, identification of promoter regions, issues concerning overlapping genes and more. In the present study we focus on coding sequence modeling algorithms; that is, algorithms for identification and prediction of the actual coding sequences from genomic DNA. In this respect, we promote a novel multivariate method known as Canonical Powered Partial Least Squares (CPPLS) as an alternative to the commonly used Interpolated Markov model (IMM). Comparisons between the methods were performed on DNA, codon and protein sequences with highly conserved genes taken from several species with different genomic properties. RESULTS: The multivariate CPPLS approach classified coding sequence substantially better than the commonly used IMM on the same set of sequences. We also found that the use of CPPLS with codon representation gave significantly better classification results than both IMM with protein (p < 0.001) and with DNA (p < 0.001). Further, although the mean performance was similar, the variation of CPPLS performance on codon representation was significantly smaller than for IMM (p < 0.001). CONCLUSIONS: The performance of coding sequence modeling can be substantially improved by using an algorithm based on the multivariate CPPLS method applied to codon or DNA frequencies.  相似文献   

18.
19.
20.
Phylogeny estimation is extremely crucial in the study of molecular evolution. The increase in the amount of available genomic data facilitates phylogeny estimation from multilocus sequence data. Although maximum likelihood and Bayesian methods are available for phylogeny reconstruction using multilocus sequence data, these methods require heavy computation, and their application is limited to the analysis of a moderate number of genes and taxa. Distance matrix methods present suitable alternatives for analyzing huge amounts of sequence data. However, the manner in which distance methods can be applied to multilocus sequence data remains unknown. Here, we suggest new procedures to estimate molecular phylogeny using multilocus sequence data and evaluate its significance in the framework of the distance method. We found that concatenation of the multilocus sequence data may result in incorrect phylogeny estimation with an extremely high bootstrap probability (BP), which is due to incorrect estimation of the distances and intentional ignorance of the intergene variations. Therefore, we suggest that the distance matrices for multilocus sequence data be estimated separately and these matrices be subsequently combined to reconstruct phylogeny instead of phylogeny reconstruction using concatenated sequence data. To calculate the BPs of the reconstructed phylogeny, we suggest that 2-stage bootstrap procedures be adopted; in this, genes are resampled followed by resampling of the sequence columns within the resampled genes. By resampling the genes during calculation of BPs, intergene variations are properly considered. Via simulation studies and empirical data analysis, we demonstrate that our 2-stage bootstrap procedures are more suitable than the conventional bootstrap procedure that is adopted after sequence concatenation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号