首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Neural network models for promoter recognition   总被引:8,自引:0,他引:8  
The problem of recognition of promoter sites in the DNA sequence has been treated with models of learning neural networks. The maximum network capacity admissible for this problem has been estimated on the basis of the total of experimental data available on the determined promoter sequences. The model of a block neural network has been constructed to satisfy this estimate and rules have been elaborated for its learning and testing. The learning process involves a small (of the order of 10%) part of the total set of promoter sequences. During this procedure the neural network develops a system of distinctive features (key words) to be used as a reference in identifying promoters against the background of random sequences. The learning quality is then tested with the whole set. The efficiency of promoter recognition has been found to amount to 94 to 99%. The probability of an arbitrary sequence being identified as a promoter is 2 to 6%.  相似文献   

2.
3.
A new method based on neural networks to cluster proteins into families is described. The network is trained with the Kohonen unsupervised learning algorithm, using matrix pattern representations of the protein sequences as inputs. The components (x, y) of these 20×20 matrix patterns are the normalized frequencies of all pairs xy of amino acids in each sequence. We investigate the influence of different learning parameters in the final topological maps obtained with a learning set of ten proteins belonging to three established families. In all cases, except in those where the synaptic vectors remains nearly unchanged during learning, the ten proteins are correctly classified into the expected families. The classification by the trained network of mutated or incomplete sequences of the learned proteins is also analysed. The neural network gives a correct classification for a sequence mutated in 21.5%±7% of its amino acids and for fragments representing 7.5%±3% of the original sequence. Similar results were obtained with a learning set of 32 proteins belonging to 15 families. These results show that a neural network can be trained following the Kohonen algorithm to obtain topological maps of protein sequences, where related proteins are finally associated to the same winner neuron or to neighboring ones, and that the trained network can be applied to rapidly classify new sequences. This approach opens new possibilities to find rapid and efficient algorithms to organize and search for homologies in the whole protein database.  相似文献   

4.
目的 基于位点特异性打分矩阵(position-specific scoring matrices,PSSM)的预测模型已经取得了良好的效果,基于PSSM的各种优化方法也在不断发展,但准确率相对较低,为了进一步提高预测准确率,本文基于卷积神经网络(convolutional neural networks,CNN)算法做了进一步研究。方法 采用PSSM将启动子序列处理成数值矩阵,通过CNN算法进行分类。大肠杆菌K-12(Escherichia coli K-12,E.coli K-12,下文简称大肠杆菌)的Sigma38、Sigma54和Sigma70 3种启动子序列被作为正集,编码(Coding)区和非编码(Non-coding)区的序列为负集。结果 在预测大肠杆菌启动子的二分类中,准确率达到99%,启动子预测的成功率接近100%;在对Sigma38、Sigma54、Sigma70 3种启动子的三分类中,预测准确率为98%,并且针对每一种序列的预测准确率均可以达到98%以上。最后,本文以Sigma38、Sigma54、Sigma70 3种启动子分别和Coding区或者Non-coding区序列做四分类,预测得到的准确性为0.98,对3种Sigma启动子均衡样本的十交叉检验预测精度均可以达到0.95以上,海明距离为0.016,Kappa系数为0.97。结论 相较于支持向量机(support vector machine,SVM)等其他分类算法,CNN分类算法更具优势,并且基于CNN的分类优势,编码方式亦可以得到简化。  相似文献   

5.
A three layered back-propagation neural network was trained to recognize E. coli promoters of the 17 base spacing class. To this end, the network was presented with 39 promoter sequences and derivatives of those sequences as positive inputs; 60% A + T random sequences and sequences containing 2 promoter-down point mutations were used as negative inputs. The entire promoter sequence of 58 bases, approximately -50 to +8, was entered as input. The network was asked to associate an output of 1.0 with promoter sequence input and 0.0 with non-promoter input. Generally, after 100,000 input cycles, the network was virtually perfect in classifying the training set. A trained network was about 80% effective in recognizing 'new' promoters which were not in the training set, with a false positive rate below 0.1%. Network searches on pBR322 and on the lambda genome were also performed. Overall the results were somewhat better than the best rule-based procedures. The trained network can be analyzed both for its choice of base and relative weighting, positive and negative, at each position of the sequence. This method, which requires only appropriate input/output training pairs, can be used to define and search for any DNA regulatory sequence for which there are sufficient exemplars.  相似文献   

6.
本文提出了一种基于卷积神经网络和循环神经网络的深度学习模型,通过分析基因组序列数据,识别人基因组中环形RNA剪接位点.首先,根据预处理后的核苷酸序列,设计了2种网络深度、8种卷积核大小和3种长短期记忆(long short term memory,LSTM)参数,共8组16个模型;其次,进一步针对池化层进行均值池化和最大池化的测试,并加入GC含量提高模型的预测能力;最后,对已经实验验证过的人类精浆中环形RNA进行了预测.结果表明,卷积核尺寸为32×4、深度为1、LSTM参数为32的模型识别率最高,在训练集上为0.9824,在测试数据集上准确率为0.95,并且在实验验证数据上的正确识别率为83%.该模型在人的环形RNA剪接位点识别方面具有较好的性能.  相似文献   

7.
Abstract

A series of CYC1 constructions in which the upstream promoter portion has been replaced by a variety of HIS4 synthetic fragments has demonstrated that the 5′ TGACTC 3′ repeat is crucial for conferring amino acid general control. Efficient regulation, however, is obtained only with fragments containing both the repeat and flanking sequences. Analysis of the flanks shows the presence of a 16 nucleotide long sequence composed of alternations of two purines and two pyrimidines between the upstream and downstream repeats. Such a sequence has very large twist angle variations. Homologous sequences are observed in HIS1, HIS3, and in TRP5 upstream regions between copies of the repeat. Sequences which confer special structural characteristics may aid in protein recognition of the promoter region.  相似文献   

8.
《Biologicals》2014,42(1):22-28
The advent of modern high-throughput sequencing has made it possible to generate vast quantities of genomic sequence data. However, the processing of this volume of information, including prediction of gene-coding and regulatory sequences remains an important bottleneck in bioinformatics research. In this work, we integrated DNA duplex stability into the repertoire of a Neural Network (NN) capable of predicting promoter regions with augmented accuracy, specificity and sensitivity. We took our method beyond a simplistic analysis based on a single sigma subunit of RNA polymerase, incorporating the six main sigma-subunits of Escherichia coli. This methodology employed successfully re-discovered known promoter sequences recognized by E. coli RNA polymerase subunits σ24, σ28, σ32, σ38, σ54 and σ70, with highlighted accuracies for σ28- and σ54- dependent promoter sequences (values obtained were 80% and 78.8%, respectively). Furthermore, the discrimination of promoters according to the σ factor made it possible to extract functional commonalities for the genes expressed by each type of promoter. The DNA duplex stability rises as a distinctive feature which improves the recognition and classification of σ28- and σ54- dependent promoter sequences. The findings presented in this report underscore the usefulness of including DNA biophysical parameters into NN learning algorithms to increase accuracy, specificity and sensitivity in promoter beyond what is accomplished based on sequence alone.  相似文献   

9.
10.
11.
The singing behavior of songbirds has been investigated as a model of sequence learning and production. The song of the Bengalese finch, Lonchura striata var. domestica, is well described by a finite state automaton including a stochastic transition of the note sequence, which can be regarded as a higher-order Markov process. Focusing on the neural structure of songbirds, we propose a neural network model that generates higher-order Markov processes. The neurons in the robust nucleus of the archistriatum (RA) encode each note; they are activated by RA-projecting neurons in the HVC (used as a proper name). We hypothesize that the same note included in different chunks is encoded by distinct RA-projecting neuron groups. From this assumption, the output sequence of RA is a higher-order Markov process, even though the RA-projecting neurons in the HVC fire on first-order Markov processes. We developed a neural network model of the local circuits in the HVC that explains the mechanism by which RA-projecting neurons transit stochastically on first-order Markov processes. Numerical simulation showed that this model can generate first-order Markov process song sequences.  相似文献   

12.

Background  

Many processes in molecular biology involve the recognition of short sequences of nucleic-or amino acids, such as the binding of immunogenic peptides to major histocompatibility complex (MHC) molecules. From experimental data, a model of the sequence specificity of these processes can be constructed, such as a sequence motif, a scoring matrix or an artificial neural network. The purpose of these models is two-fold. First, they can provide a summary of experimental results, allowing for a deeper understanding of the mechanisms involved in sequence recognition. Second, such models can be used to predict the experimental outcome for yet untested sequences. In the past we reported the development of a method to generate such models called the Stabilized Matrix Method (SMM). This method has been successfully applied to predicting peptide binding to MHC molecules, peptide transport by the transporter associated with antigen presentation (TAP) and proteasomal cleavage of protein sequences.  相似文献   

13.
Computational analysis of core promoters in the Drosophila genome   总被引:1,自引:0,他引:1       下载免费PDF全文
Ohler U  Liao GC  Niemann H  Rubin GM 《Genome biology》2002,3(12):research0087.1-8712
  相似文献   

14.
王丽  赵云  杨茜  戴欣  朱雅新  董志扬 《微生物学报》2019,59(11):2218-2228
【目的】自极端环境来源的微生物的基因组中筛选新型的可用于合成生物学底盘细胞设计的启动子元件。【方法】本研究以含有绿色荧光蛋白结构基因和核糖体结合位点的探针型质粒pUC18-GFP为载体,通过构建瘤胃微生物元基因组质粒文库,从文库中快速高效筛选具有启动子功能的DNA片段。并且通过基于神经网络的启动子预测分析,获得可能的启动子区域。以绿色荧光蛋白和施氏假单胞菌Pseudomonas stutzeri来源的麦芽四糖淀粉酶作为报告基因验证所获得的新启动子片段的功能。【结果】我们从约3750个转化子中筛选到22条具有组成型启动子功能的DNA片段。这些片段与NCBI数据库中已报道的基因序列同源性较低,启动效率高低不等。我们通过启动子预测和亚克隆的方法获得两条全新的启动子片段RFa1p2 (76 bp)和RFb4p (547 bp)。此新的组成型启动子可以在不添加任何诱导剂的情况下启动异源蛋白在大肠杆菌基因工程菌中高效表达。  相似文献   

15.
We present a method based on hierarchical self-organizing maps (SOMs) for recognizing patterns in protein sequences. The method is fully automatic, does not require prealigned sequences, is insensitive to redundancy in the training set, and works surprisingly well even with small learning sets. Because it uses unsupervised neural networks, it is able to extract patterns that are not present in all of the unaligned sequences of the learning set. The identification of these patterns in sequence databases is sensitive and efficient. The procedure comprises three main training stages. In the first stage, one SOM is trained to extract common features from the set of unaligned learning sequences. A feature is a number of ungapped sequence segments (usually 4-16 residues long) that are similar to segments in most of the sequences of the learning set according to an initial similarity matrix. In the second training stage, the recognition of each individual feature is refined by selecting an optimal weighting matrix out of a variety of existing amino acid similarity matrices. In a third stage of the SOM procedure, the position of the features in the individual sequences is learned. This allows for variants with feature repeats and feature shuffling. The procedure has been successfully applied to a number of notoriously difficult cases with distinct recognition problems: helix-turn-helix motifs in DNA-binding proteins, the CUB domain of developmentally regulated proteins, and the superfamily of ribokinases. A comparison with the established database search procedure PROFILE (and with several others) led to the conclusion that the new automatic method performs satisfactorily.  相似文献   

16.
Zhao  Chengshuai  Qiu  Yang  Zhou  Shuang  Liu  Shichao  Zhang  Wen  Niu  Yanqing 《BMC genomics》2020,21(13):1-12
Background

Researchers discover LncRNA–miRNA regulatory paradigms modulate gene expression patterns and drive major cellular processes. Identification of lncRNA-miRNA interactions (LMIs) is critical to reveal the mechanism of biological processes and complicated diseases. Because conventional wet experiments are time-consuming, labor-intensive and costly, a few computational methods have been proposed to expedite the identification of lncRNA-miRNA interactions. However, little attention has been paid to fully exploit the structural and topological information of the lncRNA-miRNA interaction network.

Results

In this paper, we propose novel lncRNA-miRNA prediction methods by using graph embedding and ensemble learning. First, we calculate lncRNA-lncRNA sequence similarity and miRNA-miRNA sequence similarity, and then we combine them with the known lncRNA-miRNA interactions to construct a heterogeneous network. Second, we adopt several graph embedding methods to learn embedded representations of lncRNAs and miRNAs from the heterogeneous network, and construct the ensemble models using two ensemble strategies. For the former, we consider individual graph embedding based models as base predictors and integrate their predictions, and develop a method, named GEEL-PI. For the latter, we construct a deep attention neural network (DANN) to integrate various graph embeddings, and present an ensemble method, named GEEL-FI. The experimental results demonstrate both GEEL-PI and GEEL-FI outperform other state-of-the-art methods. The effectiveness of two ensemble strategies is validated by further experiments. Moreover, the case studies show that GEEL-PI and GEEL-FI can find novel lncRNA-miRNA associations.

Conclusion

The study reveals that graph embedding and ensemble learning based method is efficient for integrating heterogeneous information derived from lncRNA-miRNA interaction network and can achieve better performance on LMI prediction task. In conclusion, GEEL-PI and GEEL-FI are promising for lncRNA-miRNA interaction prediction.

  相似文献   

17.
《IRBM》2023,44(1):100732
ObjectiveClustered Regularly Interspaced Short Palindromic Repeats (CRISPR) is a powerful genome editing technology. Guide RNA (gRNA) plays an essential guiding role in the CRISPR system by complementary base pairing with target DNA. Since the CRISPR targeting mechanism problem has not yet been fully resolved, it remains a challenge to predict gRNA on-target efficiency. Current gRNA design tools often lack efficient information extraction and cannot learn the target efficiency patterns thoroughly.Material and methodsIn this study, CRISPR-OTE is proposed to consider both multi-dimensional sequence information and important complementary prior knowledge based on a simple but effective framework. CRISPR-OTE consists of the local-contextual information branch and the prior knowledge branch. The local-contextual information branch extracts multi-dimensional sequence features from the DNA primary sequence by a parallel framework of Convolutional Neural Networks (CNN) and bidirectional Long Short-Term Memory networks (biLSTM). The prior knowledge branch selects the optimal subset of physicochemical features to provide the neural network with complementary knowledge, such as complex secondary structures. A simple feature fusion strategy is also adopted to fully utilize multi-modal data from the two branches.ResultsThe experimental results show that the optimal subset of physicochemical features (RNA secondary structure and melting temperature of 34nt target) can effectively improve the prediction performance. Additionally, combining multi-dimensional sequence features and multi-modal features can extract information more comprehensively. Through transfer learning, CRISPR-OTE trained on the CRISPR-Cpf1 system can also be successfully applied to the CRISPR-Cas9 system.ConclusionThe performance of CRISPR-OTE is superior to other methods in different CRISPR systems and species. Therefore, CRISPR-OTE is a simple on-target efficiency prediction framework with better accuracy and generalization performance.  相似文献   

18.
19.
Designing protein sequences that can fold into a given structure is a well‐known inverse protein‐folding problem. One important characteristic to attain for a protein design program is the ability to recover wild‐type sequences given their native backbone structures. The highest average sequence identity accuracy achieved by current protein‐design programs in this problem is around 30%, achieved by our previous system, SPIN. SPIN is a program that predicts sequences compatible with a provided structure using a neural network with fragment‐based local and energy‐based nonlocal profiles. Our new model, SPIN2, uses a deep neural network and additional structural features to improve on SPIN. SPIN2 achieves over 34% in sequence recovery in 10‐fold cross‐validation and independent tests, a 4% improvement over the previous version. The sequence profiles generated from SPIN2 are expected to be useful for improving existing fold recognition and protein design techniques. SPIN2 is available at http://sparks-lab.org .  相似文献   

20.
Kaur H  Raghava GP 《FEBS letters》2004,564(1-2):47-57
In this study, an attempt has been made to develop a neural network-based method for predicting segments in proteins containing aromatic-backbone NH (Ar-NH) interactions using multiple sequence alignment. We have analyzed 3121 segments seven residues long containing Ar-NH interactions, extracted from 2298 non-redundant protein structures where no two proteins have more than 25% sequence identity. Two consecutive feed-forward neural networks with a single hidden layer have been trained with standard back-propagation as learning algorithm. The performance of the method improves from 0.12 to 0.15 in terms of Matthews correlation coefficient (MCC) value when evolutionary information (multiple alignment obtained from PSI-BLAST) is used as input instead of a single sequence. The performance of the method further improves from MCC 0.15 to 0.20 when secondary structure information predicted by PSIPRED is incorporated in the prediction. The final network yields an overall prediction accuracy of 70.1% and an MCC of 0.20 when tested by five-fold cross-validation. Overall the performance is 15.2% higher than the random prediction. The method consists of two neural networks: (i) a sequence-to-structure network which predicts the aromatic residues involved in Ar-NH interaction from multiple alignment of protein sequences and (ii) a structure-to structure network where the input consists of the output obtained from the first network and predicted secondary structure. Further, the actual position of the donor residue within the 'potential' predicted fragment has been predicted using a separate sequence-to-structure neural network. Based on the present study, a server Ar_NHPred has been developed which predicts Ar-NH interaction in a given amino acid sequence. The web server Ar_NHPred is available at and (mirror site).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号