首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
《Biologicals》2014,42(1):22-28
The advent of modern high-throughput sequencing has made it possible to generate vast quantities of genomic sequence data. However, the processing of this volume of information, including prediction of gene-coding and regulatory sequences remains an important bottleneck in bioinformatics research. In this work, we integrated DNA duplex stability into the repertoire of a Neural Network (NN) capable of predicting promoter regions with augmented accuracy, specificity and sensitivity. We took our method beyond a simplistic analysis based on a single sigma subunit of RNA polymerase, incorporating the six main sigma-subunits of Escherichia coli. This methodology employed successfully re-discovered known promoter sequences recognized by E. coli RNA polymerase subunits σ24, σ28, σ32, σ38, σ54 and σ70, with highlighted accuracies for σ28- and σ54- dependent promoter sequences (values obtained were 80% and 78.8%, respectively). Furthermore, the discrimination of promoters according to the σ factor made it possible to extract functional commonalities for the genes expressed by each type of promoter. The DNA duplex stability rises as a distinctive feature which improves the recognition and classification of σ28- and σ54- dependent promoter sequences. The findings presented in this report underscore the usefulness of including DNA biophysical parameters into NN learning algorithms to increase accuracy, specificity and sensitivity in promoter beyond what is accomplished based on sequence alone.  相似文献   

2.
3.
Neural network models for promoter recognition   总被引:8,自引:0,他引:8  
The problem of recognition of promoter sites in the DNA sequence has been treated with models of learning neural networks. The maximum network capacity admissible for this problem has been estimated on the basis of the total of experimental data available on the determined promoter sequences. The model of a block neural network has been constructed to satisfy this estimate and rules have been elaborated for its learning and testing. The learning process involves a small (of the order of 10%) part of the total set of promoter sequences. During this procedure the neural network develops a system of distinctive features (key words) to be used as a reference in identifying promoters against the background of random sequences. The learning quality is then tested with the whole set. The efficiency of promoter recognition has been found to amount to 94 to 99%. The probability of an arbitrary sequence being identified as a promoter is 2 to 6%.  相似文献   

4.
5.
Abstract

The problem of recognition of promoter sites in the DNA sequence has been treated with models of learning neural networks. The maximum network capacity admissible for this problem has been estimated on the basis of the total of experimental data available on the determined promoter sequences. The model of a block neural network has been constructed to satisfy this estimate and rules have been elaborated for its learning and testing. The learning process involves a small (of the order of 10%) part of the total set of promoter sequences. During this procedure the neural network develops a system of distinctive features (key words) to be used as a reference in identifying promoters against the background of random sequences. The learning quality is then tested with the whole set. The efficiency of promoter recognition has been found to amount to 94 to 99%. The probability of an arbitrary sequence being identified as a promoter is 2 to 6%.  相似文献   

6.
7.
Ab initio gene identification in metagenomic sequences   总被引:1,自引:0,他引:1  
We describe an algorithm for gene identification in DNA sequences derived from shotgun sequencing of microbial communities. Accurate ab initio gene prediction in a short nucleotide sequence of anonymous origin is hampered by uncertainty in model parameters. While several machine learning approaches could be proposed to bypass this difficulty, one effective method is to estimate parameters from dependencies, formed in evolution, between frequencies of oligonucleotides in protein-coding regions and genome nucleotide composition. Original version of the method was proposed in 1999 and has been used since for (i) reconstructing codon frequency vector needed for gene finding in viral genomes and (ii) initializing parameters of self-training gene finding algorithms. With advent of new prokaryotic genomes en masse it became possible to enhance the original approach by using direct polynomial and logistic approximations of oligonucleotide frequencies, as well as by separating models for bacteria and archaea. These advances have increased the accuracy of model reconstruction and, subsequently, gene prediction. We describe the refined method and assess its accuracy on known prokaryotic genomes split into short sequences. Also, we show that as a result of application of the new method, several thousands of new genes could be added to existing annotations of several human and mouse gut metagenomes.  相似文献   

8.
9.
Kaur H  Raghava GP 《FEBS letters》2004,564(1-2):47-57
In this study, an attempt has been made to develop a neural network-based method for predicting segments in proteins containing aromatic-backbone NH (Ar-NH) interactions using multiple sequence alignment. We have analyzed 3121 segments seven residues long containing Ar-NH interactions, extracted from 2298 non-redundant protein structures where no two proteins have more than 25% sequence identity. Two consecutive feed-forward neural networks with a single hidden layer have been trained with standard back-propagation as learning algorithm. The performance of the method improves from 0.12 to 0.15 in terms of Matthews correlation coefficient (MCC) value when evolutionary information (multiple alignment obtained from PSI-BLAST) is used as input instead of a single sequence. The performance of the method further improves from MCC 0.15 to 0.20 when secondary structure information predicted by PSIPRED is incorporated in the prediction. The final network yields an overall prediction accuracy of 70.1% and an MCC of 0.20 when tested by five-fold cross-validation. Overall the performance is 15.2% higher than the random prediction. The method consists of two neural networks: (i) a sequence-to-structure network which predicts the aromatic residues involved in Ar-NH interaction from multiple alignment of protein sequences and (ii) a structure-to structure network where the input consists of the output obtained from the first network and predicted secondary structure. Further, the actual position of the donor residue within the 'potential' predicted fragment has been predicted using a separate sequence-to-structure neural network. Based on the present study, a server Ar_NHPred has been developed which predicts Ar-NH interaction in a given amino acid sequence. The web server Ar_NHPred is available at and (mirror site).  相似文献   

10.
11.
Kaur H  Raghava GP 《Proteins》2004,55(1):83-90
In this paper a systematic attempt has been made to develop a better method for predicting alpha-turns in proteins. Most of the commonly used approaches in the field of protein structure prediction have been tried in this study, which includes statistical approach "Sequence Coupled Model" and machine learning approaches; i) artificial neural network (ANN); ii) Weka (Waikato Environment for Knowledge Analysis) Classifiers and iii) Parallel Exemplar Based Learning (PEBLS). We have also used multiple sequence alignment obtained from PSIBLAST and secondary structure information predicted by PSIPRED. The training and testing of all methods has been performed on a data set of 193 non-homologous protein X-ray structures using five-fold cross-validation. It has been observed that ANN with multiple sequence alignment and predicted secondary structure information outperforms other methods. Based on our observations we have developed an ANN-based method for predicting alpha-turns in proteins. The main components of the method are two feed-forward back-propagation networks with a single hidden layer. The first sequence-structure network is trained with the multiple sequence alignment in the form of PSI-BLAST-generated position specific scoring matrices. The initial predictions obtained from the first network and PSIPRED predicted secondary structure are used as input to the second structure-structure network to refine the predictions obtained from the first net. The final network yields an overall prediction accuracy of 78.0% and MCC of 0.16. A web server AlphaPred (http://www.imtech.res.in/raghava/alphapred/) has been developed based on this approach.  相似文献   

12.
13.
A finite-context (Markov) model of order k yields the probability distribution of the next symbol in a sequence of symbols, given the recent past up to depth k. Markov modeling has long been applied to DNA sequences, for example to find gene-coding regions. With the first studies came the discovery that DNA sequences are non-stationary: distinct regions require distinct model orders. Since then, Markov and hidden Markov models have been extensively used to describe the gene structure of prokaryotes and eukaryotes. However, to our knowledge, a comprehensive study about the potential of Markov models to describe complete genomes is still lacking. We address this gap in this paper. Our approach relies on (i) multiple competing Markov models of different orders (ii) careful programming techniques that allow orders as large as sixteen (iii) adequate inverted repeat handling (iv) probability estimates suited to the wide range of context depths used. To measure how well a model fits the data at a particular position in the sequence we use the negative logarithm of the probability estimate at that position. The measure yields information profiles of the sequence, which are of independent interest. The average over the entire sequence, which amounts to the average number of bits per base needed to describe the sequence, is used as a global performance measure. Our main conclusion is that, from the probabilistic or information theoretic point of view and according to this performance measure, multiple competing Markov models explain entire genomes almost as well or even better than state-of-the-art DNA compression methods, such as XM, which rely on very different statistical models. This is surprising, because Markov models are local (short-range), contrasting with the statistical models underlying other methods, where the extensive data repetitions in DNA sequences is explored, and therefore have a non-local character.  相似文献   

14.
A simple model is put forward to explain the long-known three-base periodicity in coding DNA. We propose the concept of same-phase triplet clustering, i.e. a condition wherein a triplet appears several times in one phase without interruption by the two other possible phases. For instance, in the sequence (i): NTT_GNN_NTT_GNN_NTT_GNN_NNN_NTT_GNN (where N is any nucleotide but combinations producing TTG are excluded) there would be clustering of same-phase TTG because this triplet appears uninterruptedly in phase 2. In contrast, in the sequence (ii): TTG_NTT_GNN_NNT_TGN_NNN_NTT_GNN there is no same-phase clustering because neighboring TTGs are all in different phases. Observe also that in sequence (i) TTG triplets are separated by 3, 3 and 6 nucleotides (3n distances), while in sequence (ii) they are separated by 1, 4 and 5 nucleotides (non-3n distances). In this work, we demonstrate that in coding DNA the 3n distances generated by (i)-type sequences proportionally outnumber the non-3n distances generated by (ii)-type sequences, this condition would be the basis of three-base periodicity. Randomized sequences had (i)- and (ii)-type sequences too but clustering was statistically different. To prove our model we generated (i)-type sequences in a randomized sequence by inducing clustering of same-phase triplets. In agreement with the model this sequence displayed three-base periodicity. Furthermore, two- and four-base periodicities could also be induced by artificially inducing clustering of duplets and tetraplets.  相似文献   

15.
Paul TK  Iba H 《Bio Systems》2005,82(3):208-225
Recently, DNA microarray-based gene expression profiles have been used to correlate the clinical behavior of cancers with the differential gene expression levels in cancerous and normal tissues. To this end, after selection of some predictive genes based on signal-to-noise (S2N) ratio, unsupervised learning like clustering and supervised learning like k-nearest neighbor (k NN) classifier are widely used. Instead of S2N ratio, adaptive searches like Probabilistic Model Building Genetic Algorithm (PMBGA) can be applied for selection of a smaller size gene subset that would classify patient samples more accurately. In this paper, we propose a new PMBGA-based method for identification of informative genes from microarray data. By applying our proposed method to classification of three microarray data sets of binary and multi-type tumors, we demonstrate that the gene subsets selected with our technique yield better classification accuracy.  相似文献   

16.
Infection of Samsun NN tobacco with tobacco mosaic virus (TMV) was found to induce the synthesis of mRNA encoding a basic protein with a 67% amino acid sequence homology to the known acidic pathogenesis-related (PR) proteins 1a, 1b and 1c. By Southern blot hybridization it was shown that the tobacco genome contains at least eight genes for acidic PR-1 proteins and a similar number of genes encoding the basic homologues. Clones corresponding to three of the genes for acidic PR-1 proteins were isolated from a genomic library of Samsun NN tobacco. The nucleotide sequence of these genes and their flanking sequences were determined. One clone was found to correspond to the PR-1a gene; the two other clones do not correspond to known TMV-induced PR-1 mRNA's and may represent silent genes. Compared to the PR-1a gene, these genes contain an insertion or deletion in the putative promoter region and mutations affecting the PR-1 reading frame.  相似文献   

17.
18.
Broadly, computational approaches for ortholog assignment is a three steps process: (i) identify all putative homologs between the genomes, (ii) identify gene anchors and (iii) link anchors to identify best gene matches given their order and context. In this article, we engineer two methods to improve two important aspects of this pipeline [specifically steps (ii) and (iii)]. First, computing sequence similarity data [step (i)] is a computationally intensive task for large sequence sets, creating a bottleneck in the ortholog assignment pipeline. We have designed a fast and highly scalable sort-join method (afree) based on k-mer counts to rapidly compare all pairs of sequences in a large protein sequence set to identify putative homologs. Second, availability of complex genomes containing large gene families with prevalence of complex evolutionary events, such as duplications, has made the task of assigning orthologs and co-orthologs difficult. Here, we have developed an iterative graph matching strategy where at each iteration the best gene assignments are identified resulting in a set of orthologs and co-orthologs. We find that the afree algorithm is faster than existing methods and maintains high accuracy in identifying similar genes. The iterative graph matching strategy also showed high accuracy in identifying complex gene relationships. Standalone afree available from http://vbc.med.monash.edu.au/~kmahmood/afree. EGM2, complete ortholog assignment pipeline (including afree and the iterative graph matching method) available from http://vbc.med.monash.edu.au/~kmahmood/EGM2.  相似文献   

19.
20.
To study the role of the signal sequences in the biogenesis of outer membrane proteins, we have constructed two hybrid genes: a phoE-ompF hybrid gene, which encodes the signal sequence of outer membrane PhoE protein and the structural sequence of outer membrane OmpF protein, and a bla-phoE hybrid gene which encodes the signal sequence as well as 158 amino acids of the structural sequence of the periplasmic enzyme beta-lactamase and the complete structural sequence of PhoE protein. The products of these genes are normally transported to and assembled into the outer membrane These results show: (i) that signal sequences of exported proteins are export signals which function independently of the structural sequence, and (ii) that the information which determines the ultimate location of an outer membrane protein is located in the structural sequence of this protein, and not in the signal sequence.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号