首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
Protein chemical shifts encode detailed structural information that is difficult and computationally costly to describe at a fundamental level. Statistical and machine learning approaches have been used to infer correlations between chemical shifts and secondary structure from experimental chemical shifts. These methods range from simple statistics such as the chemical shift index to complex methods using neural networks. Notwithstanding their higher accuracy, more complex approaches tend to obscure the relationship between secondary structure and chemical shift and often involve many parameters that need to be trained. We present hidden Markov models (HMMs) with Gaussian emission probabilities to model the dependence between protein chemical shifts and secondary structure. The continuous emission probabilities are modeled as conditional probabilities for a given amino acid and secondary structure type. Using these distributions as outputs of first‐ and second‐order HMMs, we achieve a prediction accuracy of 82.3%, which is competitive with existing methods for predicting secondary structure from protein chemical shifts. Incorporation of sequence‐based secondary structure prediction into our HMM improves the prediction accuracy to 84.0%. Our findings suggest that an HMM with correlated Gaussian distributions conditioned on the secondary structure provides an adequate generative model of chemical shifts. Proteins 2013; © 2012 Wiley Periodicals, Inc.  相似文献   

Databases of multiple sequence alignments are a valuable aid to protein sequence classification and analysis. One of the main challenges when constructing such a database is to simultaneously satisfy the conflicting demands of completeness on the one hand and quality of alignment and domain definitions on the other. The latter properties are best dealt with by manual approaches, whereas completeness in practice is only amenable to automatic methods. Herein we present a database based on hidden Markov model profiles (HMMs), which combines high quality and completeness. Our database, Pfam, consists of parts A and B. Pfam-A is curated and contains well-characterized protein domain families with high quality alignments, which are maintained by using manually checked seed alignments and HMMs to find and align all members. Pfam-B contains sequence families that were generated automatically by applying the Domainer algorithm to cluster and align the remaining protein sequences after removal of Pfam-A domains. By using Pfam, a large number of previously unannotated proteins from the Caenorhabditis elegans genome project were classified. We have also identified many novel family memberships in known proteins, including new kazal, Fibronectin type III, and response regulator receiver domains. Pfam-A families have permanent accession numbers and form a library of HMMs available for searching and automatic annotation of new protein sequences. Proteins: 28:405–420, 1997. © 1997 Wiley-Liss, Inc.  相似文献   

The detection of Outer Membrane Proteins (OMP) in whole genomes is an actual question, their sequence characteristics have thus been intensively studied. This class of protein displays a common beta-barrel architecture, formed by adjacent antiparallel strands. However, due to the lack of available structures, few structural studies have been made on this class of proteins. Here we propose a novel OMP local structure investigation, based on a structural alphabet approach, i.e., the decomposition of 3D structures using a library of four-residue protein fragments. The optimal decomposition of structures using hidden Markov model results in a specific structural alphabet of 20 fragments, six of them dedicated to the decomposition of beta-strands. This optimal alphabet, called SA20-OMP, is analyzed in details, in terms of local structures and transitions between fragments. It highlights a particular and strong organization of beta-strands as series of regular canonical structural fragments. The comparison with alphabets learned on globular structures indicates that the internal organization of OMP structures is more constrained than in globular structures. The analysis of OMP structures using SA20-OMP reveals some recurrent structural patterns. The preferred location of fragments in the distinct regions of the membrane is investigated. The study of pairwise specificity of fragments reveals that some contacts between structural fragments in beta-sheets are clearly favored whereas others are avoided. This contact specificity is stronger in OMP than in globular structures. Moreover, SA20-OMP also captured sequential information. This can be integrated in a scoring function for structural model ranking with very promising results.  相似文献   

Loops are regions of nonrepetitive conformation connecting regular secondary structures. We identified 2,024 loops of one to eight residues in length, with acceptable main-chain bond lengths and peptide bond angles, from a database of 223 protein and protein-domain structures. Each loop is characterized by its sequence, main-chain conformation, and relative disposition of its bounding secondary structures as described by the separation between the tips of their axes and the angle between them. Loops, grouped according to their length and type of their bounding secondary structures, were superposed and clustered into 161 conformational classes, corresponding to 63% of all loops. Of these, 109 (51% of the loops) were populated by at least four nonhomologous loops or four loops sharing a low sequence identity. Another 52 classes, including 12% of the loops, were populated by at least three loops of low sequence similarity from three or fewer nonhomologous groups. Loop class suprafamilies resulting from variations in the termini of secondary structures are discussed in this article. Most previously described loop conformations were found among the classes. New classes included a 2:4 type IV hairpin, a helix-capping loop, and a loop that mediates dinucleotide-binding. The relative disposition of bounding secondary structures varies among loop classes, with some classes such as beta-hairpins being very restrictive. For each class, sequence preferences as key residues were identified; those most frequently at these conserved positions than in proteins were Gly, Asp, Pro, Phe, and Cys. Most of these residues are involved in stabilizing loop conformation, often through a positive phi conformation or secondary structure capping. Identification of helix-capping residues and beta-breakers among the highly conserved positions supported our decision to group loops according to their bounding secondary structures. Several of the identified loop classes were associated with specific functions, and all of the member loops had the same function; key residues were conserved for this purpose, as is the case for the parvalbumin-like calcium-binding loops. A significant number, but not all, of the member loops of other loop classes had the same function, as is the case for the helix-turn-helix DNA-binding loops. This article provides a systematic and coherent conformational classification of loops, covering a broad range of lengths and all four combinations of bounding secondary structure types, and supplies a useful basis for modelling of loop conformations where the bounding secondary structures are known or reliably predicted.  相似文献   

胡始昌  江弋  林琛  邹权 《生物信息学》2012,10(2):112-115
蛋白质折叠问题被列为"21世纪的生物物理学"的重要课题,他是分子生物学中心法则尚未解决的一个重大生物学问题,因此预测蛋白质折叠模式是一个复杂、困难、和有挑战性的工作。为了解决该问题,我们引入了分类器集成,本文所采用的是三种分类器(LMT、RandomForest、SMO)进行集成以及188维组合理化特征来对蛋白质类别进行预测。实验证明,该方法可以有效表征蛋白质折叠模式的特性,对蛋白质序列数据实现精确分类;交叉验证和独立测试均证明本文预测准确率超过70%,比前人工作提高近10个百分点。  相似文献   

Karchin R  Cline M  Karplus K 《Proteins》2004,55(3):508-518
Residue burial, which describes a protein residue's exposure to solvent and neighboring atoms, is key to protein structure prediction, modeling, and analysis. We assessed 21 alphabets representing residue burial, according to their predictability from amino acid sequence, conservation in structural alignments, and utility in one fold-recognition scenario. This follows upon our previous work in assessing nine representations of backbone geometry.1 The alphabet found to be most effective overall has seven states and is based on a count of C(beta) atoms within a 14 A-radius sphere centered at the C(beta) of a residue of interest. When incorporated into a hidden Markov model (HMM), this alphabet gave us a 38% performance boost in fold recognition and 23% in alignment quality.  相似文献   

夏彬彬  王军 《生物工程学报》2021,37(11):3863-3879
随着蛋白质序列及结构数据的大量累积,在获得了大量描述性信息之后如何有效利用海量数据,从已有数据中高效提取信息并且应用到下游任务当中就成为了研究者亟待解决的问题。蛋白质的设计可使新蛋白的研发不再受限于实验条件,这对药物靶点预测、新药研发和材料设计等领域具有重要意义。深度学习作为一种高效的数据特征提取方法,可以通过它对蛋白质数据进行建模,进而加入先验信息对蛋白质进行设计。故此基于深度学习的蛋白质设计就成为一个具有广阔前景的研究领域。文中主要阐述基于深度学习的蛋白质序列与结构数据的建模和设计方法。详述该方法的策略、原理、适用范围、应用实例。讨论了深度学习方法在本领域的应用前景及局限性,以期为相关研究提供参考。  相似文献   

Structural alignment of proteins is widely used in various fields of structural biology. In order to further improve the quality of alignment, we describe an algorithm for structural alignment based on text modelling techniques. The technique firstly superimposes secondary structure elements of two proteins and then, models the 3D-structure of the protein in a sequence of alphabets. These sequences are utilized by a step-by-step sequence alignment procedure to align two protein structures. A benchmark test was organized on a set of 200 non-homologous proteins to evaluate the program and compare it to state of the art programs, e.g. CE, SAL, TM-align and 3D-BLAST. On average, the results of all-against-all structure comparison by the program have a competitive accuracy with CE and TM-align where the algorithm has a high running speed like 3D-BLAST.  相似文献   

Based on NMR spectroscopy data, conformation of the HIV-RF gp120 protein V3 loop giving rise to the virus principal neutralizing determinant and also determinants of cell tropism and syncytium formation was calculated by computer modeling approaches. Elements of the HIV-RF V3 loop secondary structure and conformational states of its irregular stretches were determined. The calculated structure was compared with the conformation of the homologous stretch of the HIV-Thailand protein gp120 V3 loop, and structural elements preserved in the two viral strains were identified. Conservative elements of the HIV-1 V3 loop structure are considered to be promising targets for deriving chemically modified forms of this loop with the enhanced immunogenicity and cross-reactivity of neutralizing antibodies and also for creation of effective antiviral drugs on this base.  相似文献   

G protein-coupled receptors constitute a large family of homologous transmembrane proteins that represents one of the most important classes of confirmed drug targets. For novel drug discovery, the 3D structure of target protein is indispensable. To construct hypothetical 3D structures of G protein-coupled receptors, several prediction methods have been proposed. But none of the them has confirmed a correct ligand binding site. In this study we constructed the 3D structure of bovine rhodopsin using the prediction method proposed by Donnelly et al., with some modification. We found that our 3D model showed a good agreement with the reported retinal binding site. Using the similar method, we constructed the 3D structure of the P2Y1 receptor; one of the G protein-coupled receptors, and showed a binding site of an endogenous ligand, ADP, on the basis of the 3D model and in vitro experimental data. These results should be valuable for design of a specific antagonist for P2Y1 receptor.  相似文献   

S Hayward  J F Collins 《Proteins》1992,14(3):372-381
Using a backpropagation neural network model we have found a limit for secondary structure prediction from local sequence. By including only sequences from whole alpha-helix and non-alpha-helix structures in our training and test sets--sequences spanning boundaries between these two structures were excluded--it was possible to investigate directly the relationship between sequence and structure for alpha-helix. A group of non-alpha-helix sequences, that was disrupting overall prediction success, was indistinguishable to the network from alpha-helix sequences. These sequences were found to occur at regions adjacent to the termini of alpha-helices with statistical significance, suggesting that potentially longer alpha-helices are disrupted by global constraints. Some of these regions spanned more than 20 residues. On these whole structure sequences, 10 residues in length, a comparatively high prediction success of 78% with a correlation coefficient of 0.52 was achieved. In addition, the structure of the input space, the distribution of beta-sheet in this space, and the effect of segment length were also investigated.  相似文献   

The peptide backbones in folded native proteins contain distinctive secondary structures, alpha-helices, beta-sheets, and turns, with significant frequency. One question that arises in folding is how the stability of this secondary structure relates to that of the protein as a whole. To address this question, we substituted the alpha-helix-stabilizing alanine side chain at 16 selected sites in the sequence of sperm whale myoglobin, 12 at helical sites on the surface of the protein, and 4 at obviously internal sites. Substitution of alanine for bulky side chains at internal sites destabilizes the protein, as expected if packing interactions are disrupted. Alanine substitutions do not uniformly stabilize the protein, either in capping positions near the ends of helices or at mid-helical sites near the surface of myoglobin. When corrected for the extent of exposure of each side chain replaced by alanine at a mid-helix position, alanine replacement still has no clear effect in stabilizing the native structure. Thus linkage between the stabilization of secondary structure and tertiary structure in myoglobin cannot be demonstrated, probably because of the relatively small free energy differences between side chains in stabilizing isolated helix. By contrast, about 80% of the variance in free energy observed can be accounted for by the loss in buried surface area of the native residue substituted by alanine. The differential free energy of helix stabilization does not account for any additional variation.  相似文献   

Zhu J  Xie L  Honig B 《Proteins》2006,65(2):463-479
In this article, we present an iterative, modular optimization (IMO) protocol for the local structure refinement of protein segments containing secondary structure elements (SSEs). The protocol is based on three modules: a torsion-space local sampling algorithm, a knowledge-based potential, and a conformational clustering algorithm. Alternative methods are tested for each module in the protocol. For each segment, random initial conformations were constructed by perturbing the native dihedral angles of loops (and SSEs) of the segment to be refined while keeping the protein body fixed. Two refinement procedures based on molecular mechanics force fields - using either energy minimization or molecular dynamics - were also tested but were found to be less successful than the IMO protocol. We found that DFIRE is a particularly effective knowledge-based potential and that clustering algorithms that are biased by the DFIRE energies improve the overall results. Results were further improved by adding an energy minimization step to the conformations generated with the IMO procedure, suggesting that hybrid strategies that combine both knowledge-based and physical effective energy functions may prove to be particularly effective in future applications.  相似文献   

A new model for calculating the solvation energy of proteins is developed and tested for its ability to identify the native conformation as the global energy minimum among a group of thousands of computationally generated compact non-native conformations for a series of globular proteins. In the model (called the WZS model), solvation preferences for a set of 17 chemically derived molecular fragments of the 20 amino acids are learned by a training algorithm based on maximizing the solvation energy difference between native and non-native conformations for a training set of proteins. The performance of the WZS model confirms the success of this learning approach; the WZS model misrecognizes (as more stable than native) only 7 of 8,200 non-native structures. Possible applications of this model to the prediction of protein structure from sequence are discussed.  相似文献   

Fuzzy cluster analysis has been applied to the 20 amino acids by using 65 physicochemical properties as a basis for classification. The clustering products, the fuzzy sets (i.e., classical sets with associated membership functions), have provided a new measure of amino acid similarities for use in protein folding studies. This work demonstrates that fuzzy sets of simple molecular attributes, when assigned to amino acid residues in a protein''s sequence, can predict the secondary structure of the sequence with reasonable accuracy. An approach is presented for discriminating standard folding states, using near-optimum information splitting in half-overlapping segments of the sequence of assigned membership functions. The method is applied to a nonredundant set of 252 proteins and yields approximately 73% matching for correctly predicted and correctly rejected residues with approximately 60% overall success rate for the correctly recognized ones in three folding states: alpha-helix, beta-strand, and coil. The most useful attributes for discriminating these states appear to be related to size, polarity, and thermodynamic factors. Van der Waals volume, apparent average thickness of surrounding molecular free volume, and a measure of dimensionless surface electron density can explain approximately 95% of prediction results. hydrogen bonding and hydrophobicity induces do not yet enable clear clustering and prediction.  相似文献   

Chang JM  Su EC  Lo A  Chiu HS  Sung TY  Hsu WL 《Proteins》2008,72(2):693-710
Prediction of protein subcellular localization (PSL) is important for genome annotation, protein function prediction, and drug discovery. Many computational approaches for PSL prediction based on protein sequences have been proposed in recent years for Gram-negative bacteria. We present PSLDoc, a method based on gapped-dipeptides and probabilistic latent semantic analysis (PLSA) to solve this problem. A protein is considered as a term string composed by gapped-dipeptides, which are defined as any two residues separated by one or more positions. The weighting scheme of gapped-dipeptides is calculated according to a position specific score matrix, which includes sequence evolutionary information. Then, PLSA is applied for feature reduction, and reduced vectors are input to five one-versus-rest support vector machine classifiers. The localization site with the highest probability is assigned as the final prediction. It has been reported that there is a strong correlation between sequence homology and subcellular localization (Nair and Rost, Protein Sci 2002;11:2836-2847; Yu et al., Proteins 2006;64:643-651). To properly evaluate the performance of PSLDoc, a target protein can be classified into low- or high-homology data sets. PSLDoc's overall accuracy of low- and high-homology data sets reaches 86.84% and 98.21%, respectively, and it compares favorably with that of CELLO II (Yu et al., Proteins 2006;64:643-651). In addition, we set a confidence threshold to achieve a high precision at specified levels of recall rates. When the confidence threshold is set at 0.7, PSLDoc achieves 97.89% in precision which is considerably better than that of PSORTb v.2.0 (Gardy et al., Bioinformatics 2005;21:617-623). Our approach demonstrates that the specific feature representation for proteins can be successfully applied to the prediction of protein subcellular localization and improves prediction accuracy. Besides, because of the generality of the representation, our method can be extended to eukaryotic proteomes in the future. The web server of PSLDoc is publicly available at http://bio-cluster.iis.sinica.edu.tw/~ bioapp/PSLDoc/.  相似文献   

The Saccharomyces cerevisiae adhesion protein alpha-agglutinin is expressed by cells of alpha mating type. On the basis of sequence similarities, alpha-agglutinin has been proposed to contain variable-type immunoglobulin-like (IgV) domains. The low level of sequence similarity to IgV domains of known structure made homology modeling using standard sequence-based alignment algorithms impossible. We have therefore developed a secondary structure-based method that allowed homology modeling of alpha-aggulutinin domain III, the domain most similar to IgV domains. The model was assessed and where necessary refined to accommodate information obtained by biochemical and molecular genetic approaches, including the positions of a disulfide bond, glycosylation sites, and proteolytic sites. The model successfully predicted surface exposure of glycosylation and proteolytic sites, as well as identifying residues essential for binding activity. One side of the domain was predicted to be covered by carbohydrate residues. Surface accessibility and volume packing analyses showed that the regions of the model that have greatest sequence dissimilarity from the IgV consensus sequence are poorly structured in the biophysical sense. Nonetheless, the utility of the model suggests that these alignment and testing techniques should be of general use for building and testing of models of proteins that share limited sequence similarity with known structures.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号