首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Bayesian segmentation of protein secondary structure.   总被引:12,自引:0,他引:12  
We present a novel method for predicting the secondary structure of a protein from its amino acid sequence. Most existing methods predict each position in turn based on a local window of residues, sliding this window along the length of the sequence. In contrast, we develop a probabilistic model of protein sequence/structure relationships in terms of structural segments, and formulate secondary structure prediction as a general Bayesian inference problem. A distinctive feature of our approach is the ability to develop explicit probabilistic models for alpha-helices, beta-strands, and other classes of secondary structure, incorporating experimentally and empirically observed aspects of protein structure such as helical capping signals, side chain correlations, and segment length distributions. Our model is Markovian in the segments, permitting efficient exact calculation of the posterior probability distribution over all possible segmentations of the sequence using dynamic programming. The optimal segmentation is computed and compared to a predictor based on marginal posterior modes, and the latter is shown to provide significant improvement in predictive accuracy. The marginalization procedure provides exact secondary structure probabilities at each sequence position, which are shown to be reliable estimates of prediction uncertainty. We apply this model to a database of 452 nonhomologous structures, achieving accuracies as high as the best currently available methods. We conclude by discussing an extension of this framework to model nonlocal interactions in protein structures, providing a possible direction for future improvements in secondary structure prediction accuracy.  相似文献   

2.
通过研究神经网络权值矩阵的算法,挖掘蛋白质二级结构与氨基酸序列间的内在规律,提高一级序列预测二级结构的准确度。神经网络方法在特征分类方面具有良好表现,经过学习训练后的神经元连接权值矩阵包含样本的内在特征和规律。研究使用神经网络权值矩阵打分预测;采用错位比对方法寻找敏感的氨基酸邻域;分析测试集在不同加窗长度下的共性表现。实验表明,在滑动窗口长度L=7时,预测性能变化显著;邻域位置P=4的氨基酸残基对预测性能有加强作用。该研究方法为基于局部序列特征的蛋白质二级结构预测提供了新的算法设计。  相似文献   

3.
Hu L  Cui W  He Z  Shi X  Feng K  Ma B  Cai YD 《PloS one》2012,7(6):e39369
Amyloid fibrillar aggregates of polypeptides are associated with many neurodegenerative diseases. Short peptide segments in protein sequences may trigger aggregation. Identifying these stretches and examining their behavior in longer protein segments is critical for understanding these diseases and obtaining potential therapies. In this study, we combined machine learning and structure-based energy evaluation to examine and predict amyloidogenic segments. Our feature selection method discovered that windows consisting of long amino acid segments of ~30 residues, instead of the commonly used short hexapeptides, provided the highest accuracy. Weighted contributions of an amino acid at each position in a 27 residue window revealed three cooperative regions of short stretch, resemble the β-strand-turn-β-strand motif in A-βpeptide amyloid and β-solenoid structure of HET-s(218-289) prion (C). Using an in-house energy evaluation algorithm, the interaction energy between two short stretches in long segment is computed and incorporated as an additional feature. The algorithm successfully predicted and classified amyloid segments with an overall accuracy of 75%. Our study revealed that genome-wide amyloid segments are not only dependent on short high propensity stretches, but also on nearby residues.  相似文献   

4.
Transmembrane helices predicted at 95% accuracy.   总被引:27,自引:1,他引:27       下载免费PDF全文
We describe a neural network system that predicts the locations of transmembrane helices in integral membrane proteins. By using evolutionary information as input to the network system, the method significantly improved on a previously published neural network prediction method that had been based on single sequence information. The input data were derived from multiple alignments for each position in a window of 13 adjacent residues: amino acid frequency, conservation weights, number of insertions and deletions, and position of the window with respect to the ends of the protein chain. Additional input was the amino acid composition and length of the whole protein. A rigorous cross-validation test on 69 proteins with experimentally determined locations of transmembrane segments yielded an overall two-state per-residue accuracy of 95%. About 94% of all segments were predicted correctly. When applied to known globular proteins as a negative control, the network system incorrectly predicted fewer than 5% of globular proteins as having transmembrane helices. The method was applied to all 269 open reading frames from the complete yeast VIII chromosome. For 59 of these, at least two transmembrane helices were predicted. Thus, the prediction is that about one-fourth of all proteins from yeast VIII contain one transmembrane helix, and some 20%, more than one.  相似文献   

5.
寡聚蛋白质广泛地参与多种生命活动,对其预测研究有重要的意义。文章从蛋白质序列出发,提出多策略滑动伸缩窗特征提取方法,采用“ 一对一”的多类分类策略,对蛋白质同源寡聚体进行预测研究。结果表明,在Jackknife检验下,基于支持向量机的多策略滑动伸缩窗特征和氨基酸组成成分构成的特征集在加权情况下,其总分类精度最高达到了75.37%,比单纯的氨基酸组成成分法提高10.05%,比参考文献最好特征BG_Zhang提高了3.82%。 说明多策略滑动伸缩窗特征提取方法对于蛋白质同源寡聚体分类,是一种非常有效的特征提取方法。  相似文献   

6.
In the post-genome era, there is a great need for protein-specific affinity reagents to explore the human proteome. Antibodies are suitable as reagents, but generation of antibodies with low cross-reactivity to other human proteins requires careful selection of antigens. Here we show the results from a proteome-wide effort to map linear epitopes based on uniqueness relative to the entire human proteome. The analysis was based on a sliding window sequence similarity search using short windows (8, 10, and 12 amino acid residues). A comparison of exact string matching (Hamming distance) and a heuristic method (BLAST) was performed, showing that the heuristic method combined with a grid strategy allows for whole proteome analysis with high accuracy and feasible run times. The analysis shows that it is possible to find unique antigens for a majority of the human proteins, with relatively strict rules involving low sequence identity of the possible linear epitopes. The implications for human antibody-based proteomics efforts are discussed.  相似文献   

7.
提出一种新颖的方案使蛋白质结构信息可视化。在滑动窗口方法基础上,每一个天然氨基酸采用从氨基酸索引数据库中挑选的48种特性参数描述,在某一特定窗口下的所有氨基酸残基的参数就组成一个矩阵,通过矩阵变换得到一个方矩阵,再经过窗口的滑动就得到基于整个蛋白质的所有这些窗口矩阵的本征值矩阵。对本征值矩阵元素作图得到一系列的本征值曲线,这种曲线的轮廓不随窗口的变化而变化,这些曲线被称为蛋白质的特征曲线。为选择合适的窗口宽度、对同一类型蛋白质不同窗口宽度及不同类型蛋白质相同窗口宽度下的本征值矩阵进行了比较研究,对其潜在的用途进行了讨论。  相似文献   

8.
The release window for a given dismount from the asymmetric bars is the period of time within which release results in a successful dismount. Larger release windows are likely to be associated with more consistent performance because they allow a greater margin for error in timing the release. A computer simulation model was used to investigate optimum technique for maximizing release windows in asymmetric bars dismounts. The model comprised four rigid segments with the elastic properties of the gymnast and bar modeled using damped linear springs. Model parameters were optimized to obtain a close match between simulated and actual performances of three gymnasts in terms of rotation angle (1.5 degrees ), bar displacement (0.014 m), and release velocities (<1%). Three optimizations to maximize the release window were carried out for each gymnast involving no perturbations, 10-ms perturbations, and 20-ms perturbations in the timing of the shoulder and hip joint movements preceding release. It was found that the optimizations robust to 20-ms perturbations produced release windows similar to those of the actual performances whereas the windows for the unperturbed optimizations were up to twice as large. It is concluded that robustness considerations must be included in optimization studies in order to obtain realistic results and that elite performances are likely to be robust to timing perturbations of the order of 20 ms.  相似文献   

9.
We present a new method for predicting the secondary structure of globular proteins based on non-linear neural network models. Network models learn from existing protein structures how to predict the secondary structure of local sequences of amino acids. The average success rate of our method on a testing set of proteins non-homologous with the corresponding training set was 64.3% on three types of secondary structure (alpha-helix, beta-sheet, and coil), with correlation coefficients of C alpha = 0.41, C beta = 0.31 and Ccoil = 0.41. These quality indices are all higher than those of previous methods. The prediction accuracy for the first 25 residues of the N-terminal sequence was significantly better. We conclude from computational experiments on real and artificial structures that no method based solely on local information in the protein sequence is likely to produce significantly better results for non-homologous proteins. The performance of our method of homologous proteins is much better than for non-homologous proteins, but is not as good as simply assuming that homologous sequences have identical structures.  相似文献   

10.
A protein secondary structure prediction method from multiply aligned homologous sequences is presented with an overall per residue three-state accuracy of 70.1%. There are two aims: to obtain high accuracy by identification of a set of concepts important for prediction followed by use of linear statistics; and to provide insight into the folding process. The important concepts in secondary structure prediction are identified as: residue conformational propensities, sequence edge effects, moments of hydrophobicity, position of insertions and deletions in aligned homologous sequence, moments of conservation, auto-correlation, residue ratios, secondary structure feedback effects, and filtering. Explicit use of edge effects, moments of conservation, and auto-correlation are new to this paper. The relative importance of the concepts used in prediction was analyzed by stepwise addition of information and examination of weights in the discrimination function. The simple and explicit structure of the prediction allows the method to be reimplemented easily. The accuracy of a prediction is predictable a priori. This permits evaluation of the utility of the prediction: 10% of the chains predicted were identified correctly as having a mean accuracy of > 80%. Existing high-accuracy prediction methods are "black-box" predictors based on complex nonlinear statistics (e.g., neural networks in PHD: Rost & Sander, 1993a). For medium- to short-length chains (> or = 90 residues and < 170 residues), the prediction method is significantly more accurate (P < 0.01) than the PHD algorithm (probably the most commonly used algorithm). In combination with the PHD, an algorithm is formed that is significantly more accurate than either method, with an estimated overall three-state accuracy of 72.4%, the highest accuracy reported for any prediction method.  相似文献   

11.
12.
Secondary structure prediction from the primary sequence of a protein is fundamental to understanding its structure and folding properties. Although several prediction methodologies are in vogue, their performances are far from being completely satisfactory. Among these, non-linear neural networks have been shown to be relatively effective, especially for predicting beta-turns, where dominant interactions are local, arising from four sequence-contiguous residues. Most 3(10)-helices in proteins are also short, comprising of three sequence-contiguous residues and two capping residues. In order to understand the extent of local interactions in these 3(10)-helices, we have applied a neural network model with varying window size to predict 3(10)-helices in proteins. We found the prediction accuracy of 3(10)-helices (approximately 14%), as judged by the Matthew's Correlation Coefficient, to be less than that of beta-turns (approximately 20%). The optimal window size for the prediction of 3(10)-helices was about 9 residues. The significance and implications of these results in understanding the occurrence of 3(10)-helices and preferences of amino acid residues in 3(10)-helices are discussed.  相似文献   

13.
The GOR program for predicting protein secondary structure is extended to include triple correlation. A score system for a residue pair to be at certain conformation state is derived from the conditional weight matrix describing amino acid frequencies at each position of a window flanking the pair under the condition for the pair to be at the fixed state. A program using this score system to predict protein secondary structure is established. After training the model with a learning set created from PDB_SELECT, the program is tested with two test sets. As a method using single sequence for predicting secondary structures, the approach achieves a high accuracy near 70%.  相似文献   

14.
Flavors of protein disorder   总被引:1,自引:0,他引:1  
Intrinsically disordered proteins are characterized by long regions lacking 3-D structure in their native states, yet they have been so far associated with 28 distinguishable functions. Previous studies showed that protein predictors trained on disorder from one type of protein often achieve poor accuracy on disorder of proteins of a different type, thus indicating significant differences in sequence properties among disordered proteins. Important biological problems are identifying different types, or flavors, of disorder and examining their relationships with protein function. Innovative use of computational methods is needed in addressing these problems due to relative scarcity of experimental data and background knowledge related to protein disorder. We developed an algorithm that partitions protein disorder into flavors based on competition among increasing numbers of predictors, with prediction accuracy determining both the number of distinct predictors and the partitioning of the individual proteins. Using 145 variously characterized proteins with long (>30 amino acids) disordered regions, 3 flavors, called V, C, and S, were identified by this approach, with the V subset containing 52 segments and 7743 residues, C containing 39 segments and 3402 residues, and S containing 54 segments and 5752 residues. The V, C, and S flavors were distinguishable by amino acid compositions, sequence locations, and biological function. For the sequences in SwissProt and 28 genomes, their protein functions exhibit correlations with the commonness and usage of different disorder flavors, suggesting different flavor-function sets across these protein groups. Overall, the results herein support the flavor-function approach as a useful complement to structural genomics as a means for automatically assigning possible functions to sequences.  相似文献   

15.
Vocal individuality has been documented in a variety of mammalian species and it has been proposed that this individuality can be used as a vocal fingerprint to monitor individuals. Here we provide and test a classification method using Mel-frequency cepstral coefficients (MFCCs) to extract features from Bornean gibbon female calls. Our method is semi-automated as it requires manual pre-processing to identify and extract calls from the original recordings. We compared two methods of MFCC feature extraction: (1) averaging across all time windows and (2) creating a standardized number of time windows for each call. We analysed 376 calls from 33 gibbon females and, using linear discriminant analysis, found that we were able to improve female identification accuracy from 95.7% with spectrogram features to 98.4% accuracy when averaging MFCCs across time windows, and 98.9% accuracy when using a standardized number of windows. We divided our data randomly into training and test data-sets, and tested the accuracy of support vector machine (SVM) predictions over 100 iterations. We found that we could predict female identity in the test data-set with a 98.8% accuracy. Using SVM on our entire data-set, we were able to predict female identity with 99.5% accuracy (validated by leave-one-out cross-validation). Lastly, we used the method presented here to classify four females recorded during three or more recording seasons using SVM with limited success. We provide evidence that MFCC feature extraction is effective for distinguishing between female Bornean gibbons, and make suggestions for future vocal fingerprinting applications.  相似文献   

16.
Transmembrane helix (TMH) topology prediction is becoming a focal problem in bioinformatics because the structure of TM proteins is difficult to determine using experimental methods. Therefore, methods that can computationally predict the topology of helical membrane proteins are highly desirable. In this paper we introduce TMHindex, a method for detecting TMH segments using only the amino acid sequence information. Each amino acid in a protein sequence is represented by a Compositional Index, which is deduced from a combination of the difference in amino acid occurrences in TMH and non-TMH segments in training protein sequences and the amino acid composition information. Furthermore, a genetic algorithm was employed to find the optimal threshold value for the separation of TMH segments from non-TMH segments. The method successfully predicted 376 out of the 378 TMH segments in a dataset consisting of 70 test protein sequences. The sensitivity and specificity for classifying each amino acid in every protein sequence in the dataset was 0.901 and 0.865, respectively. To assess the generality of TMHindex, we also tested the approach on another standard 73-protein 3D helix dataset. TMHindex correctly predicted 91.8% of proteins based on TM segments. The level of the accuracy achieved using TMHindex in comparison to other recent approaches for predicting the topology of TM proteins is a strong argument in favor of our proposed method. Availability: The datasets, software together with supplementary materials are available at: http://faculty.uaeu.ac.ae/nzaki/TMHindex.htm.  相似文献   

17.
Cheng Y  Oldfield CJ  Meng J  Romero P  Uversky VN  Dunker AK 《Biochemistry》2007,46(47):13468-13477
Previously described algorithms for mining alpha-helix-forming molecular recognition elements (MoREs), described by Oldfield et al. (Oldfield, C. J., Cheng, Y., Cortese, M. S., Brown, C. J., Uversky, V. N., and Dunker, A. K. (2005) Comparing and combining predictors of mostly disordered proteins, Biochemistry 44, 1989-2000), also known as molecular recognition features (MoRFs) (Mohan, A., Oldfield, C. J., Radivojac, P., Vacic, V., Cortese, M. S., Dunker, A. K., and Uversky, V. N. (2006) Analysis of Molecular Recognition Features (MoRFs), J. Mol. Biol. 362, 1043-1059), revealed that regions undergoing disorder-to-order transition are involved in many molecular recognition events and are crucial for protein-protein interactions. However, these algorithms were developed using a training data set of a limited size. Here we propose to improve the prediction algorithms by (1) including additional alpha-MoRF examples and their cross species homologues in the positive training set, (2) carefully extracting monomer structure chains from the Protein Data Bank (PDB) as the negative training set, (3) including attributes from recently developed disorder predictors, secondary structure predictions, and amino acid indices, and (4) constructing neural network based predictors and performing validation. Over 50 regions which undergo disorder-to-order transition that were identified in the PDB together with a set of corresponding cross species homologues of each structure-based example were included in a new positive training set. Over 1500 attributes, including disorder predictions, secondary structure predictions, and amino acid indices, were evaluated by the conditional probability method. The top attributes, including VSL2 and VL3 disorder predictions and several physicochemical propensities of amino acid residues, were used to develop the feed forward neural networks. The sensitivity, specificity, and accuracy of the resulting predictor, alpha-MoRF-PredII, were 0.87 +/- 0.10, 0.87 +/- 0.11, and 0.87 +/- 0.08 over 10 cross validations, respectively. We present the results of these analyses and validation examples to discuss the potential improvement of the alpha-MoRF-PredII prediction accuracy.  相似文献   

18.
19.
We propose a binary word encoding to improve the protein secondary structure prediction. A binary word encoding encodes a local amino acid sequence to a binary word, which consists of 0 or 1. We use an encoding function to map an amino acid to 0 or 1. Using the binary word encoding, we can statistically extract the multiresidue information, which depends on more than one residue. We combine the binary word encoding with the GOR method, its modified version, which shows better accuracy, and the neural network method. The binary word encoding improves the accuracy of GOR by 2.8%. We obtain similar improvement when we combine this with the modified GOR method and the neural network method. When we use multiple sequence alignment data, the binary word encoding similarly improves the accuracy. The accuracy of our best combined method is 68.2%. In this paper, we only show improvement of the GOR and neural network method, we cannot say that the encoding improves the other methods. But the improvement by the encoding suggests that the multiresidue interaction affects the formation of secondary structure. In addition, we find that the optimal encoding function obtained by the simulated annealing method relates to non-polarity. This means that nonpolarity is important to the multiresidue interaction. Proteins 27:36–46 © 1997 Wiley-Liss, Inc.  相似文献   

20.
Transmission of long duration EEG signals without loss of information is essential for telemedicine based applications. In this work, a lossless compression scheme for EEG signals based on neural network predictors using the concept of correlation dimension (CD) is proposed. EEG signals which are considered as irregular time series of chaotic processes can be characterized by the non-linear dynamic parameter CD which is a measure of the correlation among the EEG samples. The EEG samples are first divided into segments of 1 s duration and for each segment, the value of CD is calculated. Blocks of EEG samples are then constructed such that each block contains segments with closer CD values. By arranging the EEG samples in this fashion, the accuracy of the predictor is improved as it makes use of highly correlated samples. As a result, the magnitude of the prediction error decreases leading to less number of bits for transmission. Experiments are conducted using EEG signals recorded under different physiological conditions. Different neural network predictors as well as classical predictors are considered. Experimental results show that the proposed CD based preprocessing scheme improves the compression performance of the predictors significantly.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号