首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
三周期性是大多数基因组序列的编码区所具有的主要特征.本文提出只计算1/3频率点的傅里叶频谱的快速计算方法,并用它分析DNA序列的三周期性,再利用小波变换在一定尺度下滤波来实现对DNA序列编码区的预测.理论分析和大量计算机实验证实了方法的有效性,预测效果良好.该方法运算快速,不需要任何训练组,也不依赖于现有数据库的信息.  相似文献   

2.
The identification of gene coding regions of DNA sequences through digital signal processing techniques based on the so-called 3-base periodicity has been an emerging problem in bioinformatics. The signal to noise ratio (SNR) of a DNA sequence is computed after mapping the DNA symbolic sequence into numerical sequences. Typical mapping schemes include the Voss, Z-curve and tetrahedron representations and the like, which have been used to construct gene coding region detecting algorithms. In this paper, an extended definition of SNR is proposed, which has less computational cost and wider applicability than its original ones. Furthermore, we analyze the SNRs of different mapping schemes and derive the general relationship between Voss based SNR and that of its general affine transformations. We conclude that the SNRs of Z-curve and tetrahedron map are also linearly proportional to that of Voss map. Not only is our conclusion instructional for the design of other affine transformations, but it is also of much significance in understanding the role of the symbolic-to-numerical mapping in the detection of gene coding regions.  相似文献   

3.
With the ever-increasing pace of genome sequencing, there is a great need for fast and accurate computational tools to automatically identify genes in these genomes. Although great progress has been made in the development of gene-finding algorithms during the past decades, there is still room for further improvement. In particular, the issue of recognizing short exons in eukaryotes is still not solved satisfactorily. This article is devoted to assessing various linear and kernel-based classification algorithms and selecting the best combination of Z-curve features for further improvement of the issue. Eight state-of-the-art linear and kernel-based supervised pattern recognition techniques were used to identify the short (21-192?bp) coding sequences of human genes. By measuring the prediction accuracy, the tradeoff between sensitivity and specificity and the time consumption, partial least squares (PLS) and kernel partial least squares (KPLS) algorithms were verified to be the most optimal linear and kernel-based classifiers, respectively. A surprising result was that, by making good use of the interpretability of the PLS and the Z-curve methods, 93 Z-curve features were proved to be the best selective combination. Using them, the average recognition accuracy was improved as high as 7.7% by means of KPLS when compared with what was obtained by the Fisher discriminant analysis using 189 Z-curve variables (Gao and Zhang, 2004 ). The used codes are freely available from the following approaches (implemented in MATLAB and supported on Linux and MS Windows): (1) SVM: http://www.support-vector-machines.org/SVM_soft.html. (2) GP: http://www.gaussianprocess.org. (3) KPLS and KFDA: Taylor, J.S., and Cristianini, N. 2004. Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, UK. (4) PLS: Wise, B.M., and Gallagher, N.B. 2011. PLS-Toolbox for use with MATLAB: ver 1.5.2. Eigenvector Technologies, Manson, WA. Supplementary Material for this article is available at www.liebertonline.com/cmb.  相似文献   

4.
5.
The statistical correlation of nucleotides in protein-coding DNA sequences   总被引:5,自引:0,他引:5  
The statistical correlation of nucleotides in a DNA sequence is described by a set of redundanciesD 1,D 2,D 3,... By calculation of {D n} of 2341 coding regions of nucleic acid sequences it is demonstrated that about 2/3 of sequences has correlation length ≤2, 10% of sequences—correlation with 3-periodicity and others—long range aperiodic correlations. The implications of the results from the interactions of random mutation and natural selection are discussed briefly. Project supported by National Science Foundation of China.  相似文献   

6.
Identifying protein-coding regions in DNA sequences is an active issue in computational biology. In this study, we present a self adaptive spectral rotation (SASR) approach, which visualizes coding regions in DNA sequences, based on investigation of the Triplet Periodicity property, without any preceding training process. It is proposed to help with the rough coding regions prediction when there is no extra information for the training required by other outstanding methods. In this approach, at each position in the DNA sequence, a Fourier spectrum is calculated from the posterior subsequence. Following the spectrums, a random walk in complex plane is generated as the SASR's graphic output. Applications of the SASR on real DNA data show that patterns in the graphic output reveal locations of the coding regions and the frame shifts between them: arcs indicate coding regions, stable points indicate non-coding regions and corners' shapes reveal frame shifts. Tests on genomic data set from Saccharomyces Cerevisiae reveal that the graphic patterns for coding and non-coding regions differ to a great extent, so that the coding regions can be visually distinguished. Meanwhile, a time cost test shows that the SASR can be easily implemented with the computational complexity of O(N).  相似文献   

7.
The regions of initiation of replication of some bacterial genomes were studied by the method of Fourier matrix analysis. A generalized spectral portrait of the primary structures of E. coli-like regions of initiation of replication in bacteria was obtained, which reflects the features of their structural and functional organization. It contains well-pronounced peaks that correspond to the periods T = 2, 11, 17, 27, 86-105 of nucleotides. The peaks corresponding to T = 9, 13, 14, 18, 19, 33-35, 45-47, 74-85, 106-110 are less pronounced. The uniqueness of the Fourier spectrum corresponding to the region of initiation of replication of E. coli oriC was considered by the example of the complete genome of E. coli. Some regions of the E. coli genome were identified that differ from oriC in the primary structure but have Fourier spectra resembling the spectrum of oriC. A number of these regions are alternative points of initiation of replication in sdrA(rnh) mutants of E. coli, the others are localized in yet unidentified regions of the E. coli genome but are capable, in our opinion, to participate in the initiation of replication. Thus, from the similarity of spectral portraits of different regions of the genome, it was possible to reveal several regions that have similar functions, i.e., are involved in initiation of replication.  相似文献   

8.
Linear algebraic concept of subspace plays a significant role in the recent techniques of spectrum estimation. In this article, the authors have utilized the noise subspace concept for finding hidden periodicities in DNA sequence. With the vast growth of genomic sequences, the demand to identify accurately the protein-coding regions in DNA is increasingly rising. Several techniques of DNA feature extraction which involves various cross fields have come up in the recent past, among which application of digital signal processing tools is of prime importance. It is known that coding segments have a 3-base periodicity, while non-coding regions do not have this unique feature. One of the most important spectrum analysis techniques based on the concept of subspace is the least-norm method. The least-norm estimator developed in this paper shows sharp period-3 peaks in coding regions completely eliminating background noise. Comparison of proposed method with existing sliding discrete Fourier transform (SDFT) method popularly known as modified periodogram method has been drawn on several genes from various organisms and the results show that the proposed method has better as well as an effective approach towards gene prediction. Resolution, quality factor, sensitivity, specificity, miss rate, and wrong rate are used to establish superiority of least-norm gene prediction method over existing method.  相似文献   

9.
Periodicities in the position of E.coli RNA polymerase promoter contacts on several promoters (lacUV5, T7 A3, tetR, lambda cin, lambda c17, RNA1, and trp S.t.) were found by means of Fourier analysis. The comparison of the Fourier spectrum of core RNA polymerase contacts on the lacUV5 promoter and that of holoenzyme revealed a more prominent 7-periodicity in the Fourier spectrum of holoenzyme contacts. 6-, 7-, and 8-periodicities were found in the primary structure of the majority of E.coli promoters. It is shown that RNA polymerase recognizes specific periodic patterns in the promoter structure.  相似文献   

10.
Noguchi T  Sugiura M 《Biochemistry》2001,40(6):1497-1502
Fourier transform infrared (FTIR) difference spectra of all flash-induced S-state transitions of the oxygen-evolving complex were measured using photosystem II (PSII) core complexes of Synechococcus elongatus. The PSII core sample was given eight successive flashes with 1 s intervals at 10 degrees C, and FTIR difference spectra upon individual flashes were measured. The obtained difference spectra upon the first to fourth flashes showed considerably different spectral features from each other, whereas the fifth, sixth, seventh, and eighth flash spectra were similar to the first, second, third, and fourth flash spectra, respectively. The intensities at the wavenumbers of prominent peaks of the first and second flash spectra showed clear period four oscillation patterns. These oscillation patterns were well fitted with the Kok model with 13% misses. These results indicate that the first, second, third, and fourth flash spectra represent the difference spectra upon the S(1) --> S(2), S(2) --> S(3), S(3) --> S(0), and S(0) --> S(1) transitions, respectively. In these spectra, prominent bands were observed in the symmetric (1300-1450 cm(-)(1)) and asymmetric (1500-1600 cm(-)(1)) stretching regions of carboxylate groups and in the amide I region (1600-1700 cm(-)(1)). Comparison of the band features suggests that the drastic coordination changes of carboxylate groups and the protein conformational changes in the S(1) --> S(2) and S(2) --> S(3) transitions are reversed in the S(3) --> S(0) and S(0) --> S(1) transitions. The flash-induced FTIR measurements during the S-state cycle will be a promising method to investigate the detailed molecular mechanism of photosynthetic oxygen evolution.  相似文献   

11.
Gene recognition from questionable ORFs in bacterial and archaeal genomes   总被引:1,自引:0,他引:1  
The ORFs of microbial genomes in annotation files are usually classified into two groups: the first corresponds to known genes; whereas the second includes 'putative', 'probable', 'conserved hypothetical', 'hypothetical', 'unknown' and 'predicted' ORFs etc. Since the annotation is not 100% accurate, it is essential to confirm which ORF of the latter group is coding and which is not. Starting from known genes in the former, this paper describes an improved Z curve method to recognize genes in the latter. Ten-fold cross-validation tests show that the average accuracy of the algorithm is greater than 99% for recognizing the known genes in 57 bacterial and archaeal genomes. The method is then applied to recognize genes of the latter group. The likely non-coding ORFs in each of the 57 bacterial or archaeal genomes studied here are recognized and listed at the website http://tubic.tju.edu.cn/ZCURVE_C_html/noncoding.html. The working mechanism of the algorithm has been discussed in details. A computer program, called ZCURVE_C, was written to calculate a coding score called Z-curve score for ORFs in the above 57 bacterial and archaeal genomes. Coding/non-coding is simply determined by the criterion of Z-curve score > 0/ Z-curve score < 0. A website has been set up to provide the service to calculate the Z-curve score. A user may submit the DNA sequence of an ORF to the server at http://tubic.tju.edu.cn/ZCURVE_C/Default.cgi, and the Z-curve score of the ORF is calculated and returned to the user immediately.  相似文献   

12.

Background

Many open problems in bioinformatics involve elucidating underlying functional signals in biological sequences. DNA sequences, in particular, are characterized by rich architectures in which functional signals are increasingly found to combine local and distal interactions at the nucleotide level. Problems of interest include detection of regulatory regions, splice sites, exons, hypersensitive sites, and more. These problems naturally lend themselves to formulation as classification problems in machine learning. When classification is based on features extracted from the sequences under investigation, success is critically dependent on the chosen set of features.

Methodology

We present an algorithmic framework (EFFECT) for automated detection of functional signals in biological sequences. We focus here on classification problems involving DNA sequences which state-of-the-art work in machine learning shows to be challenging and involve complex combinations of local and distal features. EFFECT uses a two-stage process to first construct a set of candidate sequence-based features and then select a most effective subset for the classification task at hand. Both stages make heavy use of evolutionary algorithms to efficiently guide the search towards informative features capable of discriminating between sequences that contain a particular functional signal and those that do not.

Results

To demonstrate its generality, EFFECT is applied to three separate problems of importance in DNA research: the recognition of hypersensitive sites, splice sites, and ALU sites. Comparisons with state-of-the-art algorithms show that the framework is both general and powerful. In addition, a detailed analysis of the constructed features shows that they contain valuable biological information about DNA architecture, allowing biologists and other researchers to directly inspect the features and potentially use the insights obtained to assist wet-laboratory studies on retainment or modification of a specific signal. Code, documentation, and all data for the applications presented here are provided for the community at http://www.cs.gmu.edu/~ashehu/?q=OurTools.  相似文献   

13.
14.
Most of the gene prediction algorithms for prokaryotes are based on Hidden Markov Models or similar machine-learning approaches, which imply the optimization of a high number of parameters. The present paper presents a novel method for the classification of coding and non-coding regions in prokaryotic genomes, based on a suitably defined compression index of a DNA sequence. The main features of this new method are the non-parametric logic and the costruction of a dictionary of words extracted from the sequences. These dictionaries can be very useful to perform further analyses on the genomic sequences themselves. The proposed approach has been applied on some prokaryotic complete genomes, obtaining optimal scores of correctly recognized coding and non-coding regions. Several false-positive and false-negative cases have been investigated in detail, which have revealed that this approach can fail in the presence of highly structured coding regions (e.g., genes coding for modular proteins) or quasi-random non-coding regions (e.g., regions hosting non-functional fragments of copies of functional genes; regions hosting promoters or other protein-binding sequences). We perform an overall comparison with other gene-finder software, since at this step we are not interested in building another gene-finder system, but only in exploring the possibility of the suggested approach.  相似文献   

15.
Frequency-domain analysis of biomolecular sequences   总被引:7,自引:0,他引:7  
MOTIVATION: Frequency-domain analysis of biomolecular sequences is hindered by their representation as strings of characters. If numerical values are assigned to each of these characters, then the resulting numerical sequences are readily amenable to digital signal processing. RESULTS: We introduce new computational and visual tools for biomolecular sequences analysis. In particular, we provide an optimization procedure improving upon traditional Fourier analysis performance in distinguishing coding from noncoding regions in DNA sequences. We also show that the phase of a properly defined Fourier transform is a powerful predictor of the reading frame of protein coding regions. Resulting color maps help in visually identifying not only the existence of protein coding areas for both DNA strands, but also the coding direction and the reading frame for each of the exons. Furthermore, we demonstrate that color spectrograms can visually provide, in the form of local 'texture', significant information about biomolecular sequences, thus facilitating understanding of local nature, structure and function.  相似文献   

16.
17.
描写了云南产秋海棠属6个新种1个新变种,它们是澜沧秋海棠、角果秋海棠、盈江秋海棠、粉叶秋海棠、蔓耗秋海棠、斜叶秋海棠、红毛香花秋海棠,补充描述了8个种及新命名1种,即四棱秋海棠、不显秋海棠、薄叶秋海棠、截裂秋海棠、长柔毛秋海棠、光叶秋海棠、变色秋海棠、假厚叶秋海棠、河口秋海棠。  相似文献   

18.
采用末端终止法对蓝藻类颤藻科Oscilatoriasp.rDNA16S-23S基因间隔区进行了序列测定,获得了Oscilatoriasp.rDNA基因间隔区427个核苷酸,其中包含1个异亮氨酸tRNA基因(tRNAIle)。并通过计算机联网从国际分子生物学数据弹库中获取颤藻科其它种的rDNA基因间隔区序列,通过比较分析,从分子水平对颤藻科Oscilatoriaceae属间的某些分类学问题进行了讨论,并根据序列中核苷酸差异值探讨了颤藻科属间界定的分子标准。提出了rDNA基因间隔区是良好的分子标记,可用于“赤潮”或“水华”蓝藻专一性核酸分子探针的研制  相似文献   

19.
20.
Metabolomics offers the potential to assess the effects of toxicants on metabolite levels. To fully realize this potential, a robust analytical workflow for identifying and quantifying treatment-elicited changes in metabolite levels by nuclear magnetic resonance (NMR) spectrometry has been developed that isolates and aligns spectral regions across treatment and vehicle groups to facilitate analytical comparisons. The method excludes noise regions from the resulting reduced spectra, significantly reducing data size. Principal components analysis (PCA) identifies data clusters associated with experimental parameters. Cluster-centroid scores, derived from the principal components that separate treatment from vehicle samples, are used to reconstruct the mean spectral estimates for each treatment and vehicle group. Peak amplitudes are determined by scanning the reconstructed mean spectral estimates. Confidence levels from Mann–Whitney order statistics and amplitude change ratios are used to identify treatment-related changes in peak amplitudes. As a demonstration of the method, analysis of 13C NMR data from hepatic lipid extracts of immature, ovariectomized C57BL/6 mice treated with 30 μg/kg 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD) or sesame oil vehicle, sacrificed at 72, 120, or 168 h, identified 152 salient peaks. PCA clustering showed a prominent treatment effect at all three time points studied, and very little difference between time points of treated animals. Phenotypic differences between two animal cohorts were also observed. Based on spectral peak identification, hepatic lipid extracts from treated animals exhibited redistribution of unsaturated fatty acids, cholesterols, and triacylglycerols. This method identified significant changes in peaks without the loss of information associated with spectral binning, increasing the likelihood of identifying treatment-elicited metabolite changes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号