首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Abstract

The problem of recognition of promoter sites in the DNA sequence has been treated with models of learning neural networks. The maximum network capacity admissible for this problem has been estimated on the basis of the total of experimental data available on the determined promoter sequences. The model of a block neural network has been constructed to satisfy this estimate and rules have been elaborated for its learning and testing. The learning process involves a small (of the order of 10%) part of the total set of promoter sequences. During this procedure the neural network develops a system of distinctive features (key words) to be used as a reference in identifying promoters against the background of random sequences. The learning quality is then tested with the whole set. The efficiency of promoter recognition has been found to amount to 94 to 99%. The probability of an arbitrary sequence being identified as a promoter is 2 to 6%.  相似文献   

2.
3.
A three layered back-propagation neural network was trained to recognize E. coli promoters of the 17 base spacing class. To this end, the network was presented with 39 promoter sequences and derivatives of those sequences as positive inputs; 60% A + T random sequences and sequences containing 2 promoter-down point mutations were used as negative inputs. The entire promoter sequence of 58 bases, approximately -50 to +8, was entered as input. The network was asked to associate an output of 1.0 with promoter sequence input and 0.0 with non-promoter input. Generally, after 100,000 input cycles, the network was virtually perfect in classifying the training set. A trained network was about 80% effective in recognizing 'new' promoters which were not in the training set, with a false positive rate below 0.1%. Network searches on pBR322 and on the lambda genome were also performed. Overall the results were somewhat better than the best rule-based procedures. The trained network can be analyzed both for its choice of base and relative weighting, positive and negative, at each position of the sequence. This method, which requires only appropriate input/output training pairs, can be used to define and search for any DNA regulatory sequence for which there are sufficient exemplars.  相似文献   

4.
本文提出了一种基于卷积神经网络和循环神经网络的深度学习模型,通过分析基因组序列数据,识别人基因组中环形RNA剪接位点.首先,根据预处理后的核苷酸序列,设计了2种网络深度、8种卷积核大小和3种长短期记忆(long short term memory,LSTM)参数,共8组16个模型;其次,进一步针对池化层进行均值池化和最大池化的测试,并加入GC含量提高模型的预测能力;最后,对已经实验验证过的人类精浆中环形RNA进行了预测.结果表明,卷积核尺寸为32×4、深度为1、LSTM参数为32的模型识别率最高,在训练集上为0.9824,在测试数据集上准确率为0.95,并且在实验验证数据上的正确识别率为83%.该模型在人的环形RNA剪接位点识别方面具有较好的性能.  相似文献   

5.
6.
A new method based on neural networks to cluster proteins into families is described. The network is trained with the Kohonen unsupervised learning algorithm, using matrix pattern representations of the protein sequences as inputs. The components (x, y) of these 20×20 matrix patterns are the normalized frequencies of all pairs xy of amino acids in each sequence. We investigate the influence of different learning parameters in the final topological maps obtained with a learning set of ten proteins belonging to three established families. In all cases, except in those where the synaptic vectors remains nearly unchanged during learning, the ten proteins are correctly classified into the expected families. The classification by the trained network of mutated or incomplete sequences of the learned proteins is also analysed. The neural network gives a correct classification for a sequence mutated in 21.5%±7% of its amino acids and for fragments representing 7.5%±3% of the original sequence. Similar results were obtained with a learning set of 32 proteins belonging to 15 families. These results show that a neural network can be trained following the Kohonen algorithm to obtain topological maps of protein sequences, where related proteins are finally associated to the same winner neuron or to neighboring ones, and that the trained network can be applied to rapidly classify new sequences. This approach opens new possibilities to find rapid and efficient algorithms to organize and search for homologies in the whole protein database.  相似文献   

7.
We present a method based on hierarchical self-organizing maps (SOMs) for recognizing patterns in protein sequences. The method is fully automatic, does not require prealigned sequences, is insensitive to redundancy in the training set, and works surprisingly well even with small learning sets. Because it uses unsupervised neural networks, it is able to extract patterns that are not present in all of the unaligned sequences of the learning set. The identification of these patterns in sequence databases is sensitive and efficient. The procedure comprises three main training stages. In the first stage, one SOM is trained to extract common features from the set of unaligned learning sequences. A feature is a number of ungapped sequence segments (usually 4-16 residues long) that are similar to segments in most of the sequences of the learning set according to an initial similarity matrix. In the second training stage, the recognition of each individual feature is refined by selecting an optimal weighting matrix out of a variety of existing amino acid similarity matrices. In a third stage of the SOM procedure, the position of the features in the individual sequences is learned. This allows for variants with feature repeats and feature shuffling. The procedure has been successfully applied to a number of notoriously difficult cases with distinct recognition problems: helix-turn-helix motifs in DNA-binding proteins, the CUB domain of developmentally regulated proteins, and the superfamily of ribokinases. A comparison with the established database search procedure PROFILE (and with several others) led to the conclusion that the new automatic method performs satisfactorily.  相似文献   

8.
Kaur H  Raghava GP 《FEBS letters》2004,564(1-2):47-57
In this study, an attempt has been made to develop a neural network-based method for predicting segments in proteins containing aromatic-backbone NH (Ar-NH) interactions using multiple sequence alignment. We have analyzed 3121 segments seven residues long containing Ar-NH interactions, extracted from 2298 non-redundant protein structures where no two proteins have more than 25% sequence identity. Two consecutive feed-forward neural networks with a single hidden layer have been trained with standard back-propagation as learning algorithm. The performance of the method improves from 0.12 to 0.15 in terms of Matthews correlation coefficient (MCC) value when evolutionary information (multiple alignment obtained from PSI-BLAST) is used as input instead of a single sequence. The performance of the method further improves from MCC 0.15 to 0.20 when secondary structure information predicted by PSIPRED is incorporated in the prediction. The final network yields an overall prediction accuracy of 70.1% and an MCC of 0.20 when tested by five-fold cross-validation. Overall the performance is 15.2% higher than the random prediction. The method consists of two neural networks: (i) a sequence-to-structure network which predicts the aromatic residues involved in Ar-NH interaction from multiple alignment of protein sequences and (ii) a structure-to structure network where the input consists of the output obtained from the first network and predicted secondary structure. Further, the actual position of the donor residue within the 'potential' predicted fragment has been predicted using a separate sequence-to-structure neural network. Based on the present study, a server Ar_NHPred has been developed which predicts Ar-NH interaction in a given amino acid sequence. The web server Ar_NHPred is available at and (mirror site).  相似文献   

9.
Designing protein sequences that can fold into a given structure is a well‐known inverse protein‐folding problem. One important characteristic to attain for a protein design program is the ability to recover wild‐type sequences given their native backbone structures. The highest average sequence identity accuracy achieved by current protein‐design programs in this problem is around 30%, achieved by our previous system, SPIN. SPIN is a program that predicts sequences compatible with a provided structure using a neural network with fragment‐based local and energy‐based nonlocal profiles. Our new model, SPIN2, uses a deep neural network and additional structural features to improve on SPIN. SPIN2 achieves over 34% in sequence recovery in 10‐fold cross‐validation and independent tests, a 4% improvement over the previous version. The sequence profiles generated from SPIN2 are expected to be useful for improving existing fold recognition and protein design techniques. SPIN2 is available at http://sparks-lab.org .  相似文献   

10.
Neural network optimization for E. coli promoter prediction.   总被引:9,自引:5,他引:4  
Methods for optimizing the prediction of Escherichia coli RNA polymerase promoter sequences by neural networks are presented. A neural network was trained on a set of 80 known promoter sequences combined with different numbers of random sequences. The conserved -10 region and -35 region of the promoter sequences and a combination of these regions were used in three independent training sets. The prediction accuracy of the resulting weight matrix was tested against a separate set of 30 known promoter sequences and 1500 random sequences. The effects of the network's topology, the extent of training, the number of random sequences in the training set and the effects of different data representations were examined and optimized. Accuracies of 100% on the promoter test set and 98.4% on the random test set were achieved with the optimal parameters.  相似文献   

11.
A sensitive technique for protein sequence motif recognition based on neural networks has been developed. It involves three major steps. (1) At each appropriate alignment position of a set of N matched sequences, a set of N aligned oligopeptides is specified with preselected window length. N neural nets are subsequently and successively trained on N-1 amino acid spans after eliminating each ith oligopeptide. A test for recognition of each of the ith spans is performed. The average neural net recognition over N such trials is used as a measure of conservation for the particular windowed region of the multiple alignment. This process is repeated for all possible spans of given length in the multiple alignment. (2) The M most conserved regions are regarded as motifs and the oligopeptides within each are used to train intensively M individual neural networks. (3) The M networks are then applied in a search for related primary structures in a databank of known protein sequences. The oligopeptide spans in the database sequence with strongest neural net output for each of the M networks are saved and then scored according to the output signals and the proper combination that follows the expected N- to C-terminal sequence order. The motifs from the database with highest similarity scores can then be used to retrain the M neural nets, which can be subsequently utilized for further searches in the databank, thus providing even greater sensitivity to recognize distant familial proteins. This technique was successfully applied to the integrase, DNA-polymerase and immunoglobulin families.  相似文献   

12.
This paper describes a method to combine near-infrared spectroscopy and a three layer back-propagation artificial neural network in order to identify official and unofficial rhubarbs. Thirty-three samples were taken as the training set, and 62 samples as the test set. The effects of input node number, learning rate and momentum on the final error and recognition accuracy for the training set, and on prediction accuracy for the test set were determined. A neural network with eight input nodes, a 0.5 learning rate, and a momentum of 0.3 can achieve a recognition accuracy of 100% for the training set and a prediction accuracy of 96.8% for the test set. The method described offers a quick and efficient means of identifying rhubarbs.  相似文献   

13.
The fungal transamidase complex that executes glycosylphosphatidylinositol (GPI) lipid anchoring of precursor proteins has overlapping but distinct sequence specificity compared with the animal system. Therefore, a taxon-specific prediction tool for the recognition of the C-terminal signal in fungal sequences is necessary. We have collected a learning set of fungal precursor protein sequences from the literature and fungal proteomes. Although the general four segment scheme of the recognition signal is maintained also in fungal precursors, there are taxon specificities in details. A fungal big-Pi predictor has been developed for the assessment of query sequence concordance with fungi-specific recognition signal requirements. The sensitivity of this predictor is close to 90%. The rate of false positive prediction is in the range of 0.1%. The fungal big-Pi tool successfully predicts the Gas1 mutation series described by C. Nuoffer and co-workers, and recognizes that the human PLAP C terminus is not a target for the fungal transamidase complex. Lists of potentially GPI lipid anchored proteins for five fungal proteomes have been generated and the hits have been functionally classified. The fungal big-Pi prediction WWW server as well as precursor lists are available at  相似文献   

14.
Majority of the promoter elements of mycobacteria do not function well in other eubacterial systems and analysis of their sequences has established the presence of only single conserved sequence located at the -10 position. Additional sequences for the appropriate functioning of these promoters have been proposed but not characterized, probably due to the absence of sufficient number of strong mycobacterial promoters. In the current study, we have isolated functional promoter-like sequences of mycobacteria from the pool of random DNA sequences. Based on the promoter activity in Mycobacterium smegmatis and score assigned by neural network promoter prediction program, we selected one of these promoter sequences, namely A37 for characterization in order to understand the structure of housekeeping promoters of mycobacteria. A37-RNAP complexes were subjected to DNase I footprinting and subsequent mutagenesis. Our results demonstrate that in addition to -10 sequences, DNA sequence at -35 site can also influence the activity of mycobacterial promoters by modulating the promoter recognition by RNA polymerase and subsequent formation of open complex. We also provide evidence that despite exhibiting similarities in -10 and -35 sequences, promoter regions of mycobacteria and Escherichia coli differ from each other due to differences in their requirement of spacer sequences between the two positions.  相似文献   

15.
为了探索基于深度神经网络模型的牙形刺图像智能识别效果,研究选取奥陶纪8种牙形刺作为研究对象,通过体视显微镜采集牙形刺图像1188幅,收集整理公开发表文献的牙形刺图像778幅,将图像数据集划分为训练集和测试集。通过对训练集图像进行旋转、翻转、滤波增强处理,解决了训练样本不足的问题。基于ResNet-18、ResNet-34、ResNet-50、ResNet-101、ResNet-152五种残差神经网络模型,采用迁移学习方法,对网络模型进行训练以获取模型参数,五种模型测试Top-1准确率分别为85.37%、85.85%、83.90%、81.95%、80.00%, Top-2准确率分别为94.63%、94.63%、94.15%、93.17%、93.66%,模型对牙形刺图像具有较好的识别效果。通过对比研究发现,ResNet-34识别准确率最高,说明对于特征简单的牙形刺属种,增加网络深度并不一定能提升准确率,而确定合适深度的模型则不仅可以提高识别准确率,还可以节约计算资源。通过ResNet-34模型的迁移学习训练和重新训练效果对比可以看出,迁移学习不仅可以获得较高的准确率,而且可以较快获取模型参...  相似文献   

16.
Promoters are DNA sequences located upstream of the gene region and play a central role in gene expression. Computational techniques show good accuracy in gene prediction but are less successful in predicting promoters, primarily because of the high number of false positives that reflect characteristics of the promoter sequences. Many machine learning methods have been used to address this issue. Neural Networks (NN) have been successfully used in this field because of their ability to recognize imprecise and incomplete patterns characteristic of promoter sequences. In this paper, NN was used to predict and recognize promoter sequences in two data sets: (i) one based on nucleotide sequence information and (ii) another based on stability sequence information. The accuracy was approximately 80% for simulation (i) and 68% for simulation (ii). In the rules extracted, biological consensus motifs were important parts of the NN learning process in both simulations.  相似文献   

17.
Saha S  Raghava GP 《Proteins》2006,65(1):40-48
B-cell epitopes play a vital role in the development of peptide vaccines, in diagnosis of diseases, and also for allergy research. Experimental methods used for characterizing epitopes are time consuming and demand large resources. The availability of epitope prediction method(s) can rapidly aid experimenters in simplifying this problem. The standard feed-forward (FNN) and recurrent neural network (RNN) have been used in this study for predicting B-cell epitopes in an antigenic sequence. The networks have been trained and tested on a clean data set, which consists of 700 non-redundant B-cell epitopes obtained from Bcipep database and equal number of non-epitopes obtained randomly from Swiss-Prot database. The networks have been trained and tested at different input window length and hidden units. Maximum accuracy has been obtained using recurrent neural network (Jordan network) with a single hidden layer of 35 hidden units for window length of 16. The final network yields an overall prediction accuracy of 65.93% when tested by fivefold cross-validation. The corresponding sensitivity, specificity, and positive prediction values are 67.14, 64.71, and 65.61%, respectively. It has been observed that RNN (JE) was more successful than FNN in the prediction of B-cell epitopes. The length of the peptide is also important in the prediction of B-cell epitopes from antigenic sequences. The webserver ABCpred is freely available at www.imtech.res.in/raghava/abcpred/.  相似文献   

18.
19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号