首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Abstract

The Protein Data Bank (PDB) is the preeminent source of protein structural information. PDB contains over 32,500 experimentally determined 3-D structures solved using X-ray crystallography or nuclear magnetic resonance spectroscopy. Intrinsically disordered regions fail to form a fixed 3-D structure under physiological conditions. In this study, we compare the amino-acid sequences of proteins whose structures are determined by X-ray crystallography with the corresponding sequences from the Swiss-Prot database. The analyzed dataset includes 16,370 structures, which represent 18,101 PDB chains and 5,434 different proteins from 910 different organisms (2,793 eukaryotic, 2,109 bacterial, 288 viral, and 244 archaeal). In this dataset, on average, each Swiss-Prot protein is represented by 7 PDB chains with 76% of the crystallized regions being represented by more than one structure. Intriguingly, the complete sequences of only ~7% of proteins are observed in the corresponding PDB structures, and only ~25% of the total dataset have >95% of their lengths observed in the corresponding PDB structures. This suggests that the vast majority of PDB proteins is shorter than their corresponding Swiss-Prot sequences and/or contain numerous residues, which are not observed in maps of electron density. To determine the prevalence of disordered regions in PDB, the residues in the Swiss-Prot sequences were grouped into four general categories, “Observed” (which correspond to structured regions), “Not observed” (regions with missing electron density, potentially disordered), “Uncharacterized,” and “Ambiguous,” depending on their appearance in the corresponding PDB entries. This non-redundant set of residues can be viewed as a ‘fragment’ or empirical domain database that contains a set of experimentally determined structured regions or domains and a set of experimentally verified disordered regions or domains. We studied the propensities and properties of residues in these four categories and analyzed their relations to the predictions of disorder using several algorithms. “Non-observed,” “Ambiguous,” and “Uncharacterized” regions were shown to possess the amino acid compositional biases typical of intrinsically disordered proteins. The application of four different disorder predictors (PONDR® VL-XT, VL3-BA, VSL1P, and IUPred) revealed that the vast majority of residues in the “Observed” dataset are ordered, and that the “Not observed” regions are mostly disordered. The “Uncharacterized” regions possess some tendency toward order, whereas the predictions for the short “Ambiguous” regions are really ambiguous. Long “Ambiguous” regions (>70 amino acid residues) are mostly predicted to be ordered, suggesting that they are likely to be “wobbly” domains.

Overall, we showed that completely ordered proteins are not highly abundant in PDB and many PDB sequences have disordered regions. In fact, in the analyzed dataset ~10% of the PDB proteins contain regions of consecutive missing or ambiguous residues longer than 30 amino-acids and ~40% of the proteins possess short regions (≥10 and <30 amino-acid long) of missing and ambiguous residues.  相似文献   

2.
Abstract

Short and long disordered regions of proteins have different preference for different amino acid residues. Different methods often have to be trained to predict them separately. In this study, we developed a single neural-network-based technique called SPINE-D that makes a three-state prediction first (ordered residues and disordered residues in short and long disordered regions) and reduces it into a two-state prediction afterwards. SPINE-D was tested on various sets composed of different combinations of Disprot annotated proteins and proteins directly from the PDB annotated for disorder by missing coordinates in X-ray determined structures. While disorder annotations are different according to Disprot and X-ray approaches, SPINE-D's prediction accuracy and ability to predict disorder are relatively independent of how the method was trained and what type of annotation was employed but strongly depend on the balance in the relative populations of ordered and disordered residues in short and long disordered regions in the test set. With greater than 85% overall specificity for detecting residues in both short and long disordered regions, the residues in long disordered regions are easier to predict at 81% sensitivity in a balanced test dataset with 56.5% ordered residues but more challenging (at 65% sensitivity) in a test dataset with 90% ordered residues. Compared to eleven other methods, SPINE-D yields the highest area under the curve (AUC), the highest Mathews correlation coefficient for residue-based prediction, and the lowest mean square error in predicting disorder contents of proteins for an independent test set with 329 proteins. In particular, SPINE-D is comparable to a meta predictor in predicting disordered residues in long disordered regions and superior in short disordered regions. SPINE-D participated in CASP 9 blind prediction and is one of the top servers according to the official ranking. In addition, SPINE-D was examined for prediction of functional molecular recognition motifs in several case studies. The server and databases are available at http://sparks.informatics.iupui.edu/.  相似文献   

3.
Short and long disordered regions of proteins have different preference for different amino acid residues. Different methods often have to be trained to predict them separately. In this study, we developed a single neural-network-based technique called SPINE-D that makes a three-state prediction first (ordered residues and disordered residues in short and long disordered regions) and reduces it into a two-state prediction afterwards. SPINE-D was tested on various sets composed of different combinations of Disprot annotated proteins and proteins directly from the PDB annotated for disorder by missing coordinates in X-ray determined structures. While disorder annotations are different according to Disprot and X-ray approaches, SPINE-D's prediction accuracy and ability to predict disorder are relatively independent of how the method was trained and what type of annotation was employed but strongly depend on the balance in the relative populations of ordered and disordered residues in short and long disordered regions in the test set. With greater than 85% overall specificity for detecting residues in both short and long disordered regions, the residues in long disordered regions are easier to predict at 81% sensitivity in a balanced test dataset with 56.5% ordered residues but more challenging (at 65% sensitivity) in a test dataset with 90% ordered residues. Compared to eleven other methods, SPINE-D yields the highest area under the curve (AUC), the highest Mathews correlation coefficient for residue-based prediction, and the lowest mean square error in predicting disorder contents of proteins for an independent test set with 329 proteins. In particular, SPINE-D is comparable to a meta predictor in predicting disordered residues in long disordered regions and superior in short disordered regions. SPINE-D participated in CASP 9 blind prediction and is one of the top servers according to the official ranking. In addition, SPINE-D was examined for prediction of functional molecular recognition motifs in several case studies.  相似文献   

4.
Missing regions in X‐ray crystal structures in the Protein Data Bank (PDB) have played a foundational role in the study of intrinsically disordered protein regions (IDPRs), especially in the development of in silico predictors of intrinsic disorder. However, a missing region is only a weak indication of intrinsic disorder, and this uncertainty is compounded by the presence of ambiguous regions, where more than one structure of the same protein sequence “disagrees” in terms of the presence or absence of missing residues. The question is this: are these ambiguous regions intrinsically disordered, or are they the result of static disorder that arises from experimental conditions, ensembles of structures, or domain wobbling? A novel way of looking at ambiguous regions in terms of the pattern between multiple PDB structures has been demonstrated. It was found that the propensity for intrinsic disorder increases as the level of ambiguity decreases. However, it is also shown that ambiguity is more likely to occur as the protein region is placed within different environmental conditions, and even the most ambiguous regions as a set display compositional bias that suggests flexibility. The results suggested that ambiguity is a natural result for many IDPRs crystallized under different conditions and that static disorder and wobbling domains are relatively rare. Instead, it is more likely that ambiguity arises because many of these regions were conditionally or partially disordered.  相似文献   

5.
Biologically active proteins without stable ordered structure (i.e., intrinsically disordered proteins) are attracting increased attention. Functional repertoires of ordered and disordered proteins are very different, and the ability to differentiate whether a given function is associated with intrinsic disorder or with a well-folded protein is crucial for modern protein science. However, there is a large gap between the number of proteins experimentally confirmed to be disordered and their actual number in nature. As a result, studies of functional properties of confirmed disordered proteins, while helpful in revealing the functional diversity of protein disorder, provide only a limited view. To overcome this problem, a bioinformatics approach for comprehensive study of functional roles of protein disorder was proposed in the first paper of this series (Xie, H.; Vucetic, S.; Iakoucheva, L. M.; Oldfield, C. J.; Dunker, A. K.; Obradovic, Z.; Uversky, V. N. Functional anthology of intrinsic disorder. 1. Biological processes and functions of proteins with long disordered regions. J. Proteome Res. 2007, 5, 1882-1898). Applying this novel approach to Swiss-Prot sequences and functional keywords, we found over 238 and 302 keywords to be strongly positively or negatively correlated, respectively, with long intrinsically disordered regions. This paper describes approximately 90 Swiss-Prot keywords attributed to the cellular components, domains, technical terms, developmental processes, and coding sequence diversities possessing strong positive and negative correlation with long disordered regions.  相似文献   

6.
Most protein chains interact with only one ligand but a small number of protein chains can bind several ligands, and many examples are available in the protein-ligand complex database of PDB. Among these proteins, some show preferences for the ligands or types of ligands they bind; however, so far we have only poor understanding of what determines protein-ligand binding and its specificity. Here we investigate the structural and functional properties of proteins in protein-ligand complexes. Analysis of the protein-ligand complex dataset from the PDB structure database reveals that proteins with more interactions have more disordered contact residues. Those proteins containing few disordered contact residues that bind multiple ligands have a tendency to consist of several domains. Analysis of physicochemical properties of hub contact residues binding multiple ligands indicates that they are enriched for hydrophilic, charged, polar and His-Asp catalytic triad residues. Finally, in order to differentiate proteins binding different classes of ligands, we mapped the three most prominent classes of ligands onto different superfamily domains. Our results demonstrate that contact residue disorder and ordered multiple domains are complementary factors that play a crucial role in determining ligand binding specificity and promiscuity.  相似文献   

7.
Intrinsically disordered regions serve as molecular recognition elements, which play an important role in the control of many cellular processes and signaling pathways. It is useful to be able to predict positions of disordered regions in protein chains. The statistical analysis of disordered residues was done considering 34,464 unique protein chains taken from the PDB database. In this database, 4.95% of residues are disordered (i.e. invisible in X-ray structures). The statistics were obtained separately for the N- and C-termini as well as for the central part of the protein chain. It has been shown that frequencies of occurrence of disordered residues of 20 types at the termini of protein chains differ from the ones in the middle part of the protein chain. Our systematic analysis of disordered regions in PDB revealed 109 disordered patterns of different lengths. Each of them has disordered occurrences in at least five protein chains with identity less than 20%. The vast majority of all occurrences of each disordered pattern are disordered. This allows one to use the library of disordered patterns for predicting the status of a residue of a given protein to be ordered or disordered. We analyzed the occurrence of the selected patterns in three eukaryotic and three bacterial proteomes.  相似文献   

8.
Intrinsically disordered proteins (IDPs) do not adopt stable three-dimensional structures in physiological conditions, yet these proteins play crucial roles in biological phenomena. In most cases, intrinsic disorder manifests itself in segments or domains of an IDP, called intrinsically disordered regions (IDRs), but fully disordered IDPs also exist. Although IDRs can be detected as missing residues in protein structures determined by X-ray crystallography, no protocol has been developed to identify IDRs from structures obtained by Nuclear Magnetic Resonance (NMR). Here, we propose a computational method to assign IDRs based on NMR structures. We compared missing residues of X-ray structures with residue-wise deviations of NMR structures for identical proteins, and derived a threshold deviation that gives the best correlation of ordered and disordered regions of both structures. The obtained threshold of 3.2 Å was applied to proteins whose structures were only determined by NMR, and the resulting IDRs were analyzed and compared to those of X-ray structures with no NMR counterpart in terms of sequence length, IDR fraction, protein function, cellular location, and amino acid composition, all of which suggest distinct characteristics. The structural knowledge of IDPs is still inadequate compared with that of structured proteins. Our method can collect and utilize IDRs from structures determined by NMR, potentially enhancing the understanding of IDPs.  相似文献   

9.
The function of a protein molecule is greatly influenced by its three-dimensional (3D) structure and therefore structure prediction will help identify its biological function. We have updated Sequence, Motif and Structure (SMS), the database of structurally rigid peptide fragments, by combining amino acid sequences and the corre-sponding 3D atomic coordinates of non-redundant (25%) and redundant (90%) protein chains available in the Protein Data Bank (PDB). SMS 2.0 provides information pertaining to the peptide fragments of length 5-14 resi-dues. The entire dataset is divided into three categories, namely, same sequence motifs having similar, intermedi-ate or dissimilar 3D structures. Further, options are provided to facilitate structural superposition using the pro-gram structural alignment of multiple proteins (STAMP) and the popular JAVA plug-in (Jmol) is deployed for visualization. In addition, functionalities are provided to search for the occurrences of the sequence motifs in other structural and sequence databases like PDB, Genome Database (GDB), Protein Information Resource (PIR) and Swiss-Prot. The updated database along with the search engine is available over the World Wide Web through the following URL http://cluster.physics.iisc.ernet.in/sms/.  相似文献   

10.
Proteins evolve through point mutations as well as by insertions and deletions (indels). During the last decade it has become apparent that protein regions that do not fold into three-dimensional structures, i.e. intrinsically disordered regions, are quite common. Here, we have studied the relationship between protein disorder and indels using HMM–HMM pairwise alignments in two sets of orthologous eukaryotic protein pairs. First, we show that disordered residues are much more frequent among indel residues than among aligned residues and, also are more prevalent among indels than in coils. Second, we observed that disordered residues are particularly common in longer indels. Disordered indels of short-to-medium size are prevalent in the non-terminal regions of proteins while the longest indels, ordered and disordered alike, occur toward the termini of the proteins where new structural units are comparatively well tolerated. Finally, while disordered regions often evolve faster than ordered regions and disorder is common in indels, there are some previously recognized protein families where the disordered region is more conserved than the ordered region. We find that these rare proteins are often involved in information processes, such as RNA processing and translation. This article is part of a Special Issue entitled: The emerging dynamic view of proteins: Protein plasticity in allostery, evolution and self-assembly.  相似文献   

11.
There has been an increased interest in computational methods for amyloid and (or) aggregate prediction, due to the prevalence of these aggregates in numerous diseases and their recently discovered functional importance. To evaluate these methods, several datasets have been compiled. Typically, aggregation‐prone regions of proteins, which form aggregates or amyloids in vivo, are more than 15 residues long and intrinsically disordered. However, the number of such experimentally established amyloid forming and non‐forming sequences are limited, not exceeding one hundred entries in existing databases. In this work, we parsed all available NMR‐resolved protein structures from the PDB and assembled a new, sevenfold larger, dataset of unfolded sequences, soluble at high concentrations. We proposed to use these sequences as a negative set for evaluating methods for predicting aggregation in vivo. We also present the results of benchmarking cutting edge tools for the prediction of aggregation versus solubility propensity.  相似文献   

12.
The amino acid sequences of soluble, ordered proteins with stable structures have evolved due to biological and physical requirements, thus distinguishing them from random sequences. Previous analyses have focused on extracting the features that frequently appear in protein substructures, such as α‐helix and β‐sheet, but the universal features of protein sequences have not been addressed. To clarify the differences between native protein sequences and random sequences, we analyzed 7368 soluble, ordered protein sequences, by inspecting the observed and expected occurrences of 400 amino acid pairs in local proximity, up to 10 residues along the sequence in comparison with their expected occurrence in random sequence. We found the trend that the hydrophobic residue pairs and the polar residue pairs are significantly decreased, whereas the pairs between a hydrophobic residue and a polar residue are increased. This trend was universally observed regardless of the secondary structure content but was not observed in protein sequences that include intrinsically disordered regions, indicating that it can be a general rule of protein foldability. The possible benefits of this rule are discussed from the viewpoints of protein aggregation and disorder, which are both caused by low‐complexity regions of hydrophobic or polar residues.  相似文献   

13.
The dominant view in protein science is that a three-dimensional (3-D) structure is a prerequisite for protein function. In contrast to this dominant view, there are many counterexample proteins that fail to fold into a 3-D structure, or that have local regions that fail to fold, and yet carry out function. Protein without fixed 3-D structure is called intrinsically disordered. Motivated by anecdotal accounts of higher rates of sequence evolution in disordered protein than in ordered protein we are exploring the molecular evolution of disordered proteins. To test whether disordered protein evolves more rapidly than ordered protein, pairwise genetic distances were compared between the ordered and the disordered regions of 26 protein families having at least one member with a structurally characterized region of disorder of 30 or more consecutive residues. For five families, there were no significant differences in pairwise genetic distances between ordered and disordered sequences. The disordered region evolved significantly more rapidly than the ordered region for 19 of the 26 families. The functions of these disordered regions are diverse, including binding sites for protein, DNA, or RNA and also including flexible linkers. The functions of some of these regions are unknown. The disordered regions evolved significantly more slowly than the ordered regions for the two remaining families. The functions of these more slowly evolving disordered regions include sites for DNA binding. More work is needed to understand the underlying causes of the variability in the evolutionary rates of intrinsically ordered and disordered protein.  相似文献   

14.
Identifying relationships between function, amino acid sequence, and protein structure represents a major challenge. In this study, we propose a bioinformatics approach that identifies functional keywords in the Swiss-Prot database that correlate with intrinsic disorder. A statistical evaluation is employed to rank the significance of these correlations. Protein sequence data redundancy and the relationship between protein length and protein structure were taken into consideration to ensure the quality of the statistical inferences. Over 200,000 proteins from the Swiss-Prot database were analyzed using this approach. The predictions of intrinsic disorder were carried out using PONDR VL3E predictor of long disordered regions that achieves an accuracy of above 86%. Overall, out of the 710 Swiss-Prot functional keywords that were each associated with at least 20 proteins, 238 were found to be strongly positively correlated with predicted long intrinsically disordered regions, whereas 302 were strongly negatively correlated with such regions. The remaining 170 keywords were ambiguous without strong positive or negative correlation with the disorder predictions. These functions cover a large variety of biological activities and imply that disordered regions are characterized by a wide functional repertoire. Our results agree well with literature findings, as we were able to find at least one illustrative example of functional disorder or order shown experimentally for the vast majority of keywords showing the strongest positive or negative correlation with intrinsic disorder. This work opens a series of three papers, which enriches the current view of protein structure-function relationships, especially with regards to functionalities of intrinsically disordered proteins, and provides researchers with a novel tool that could be used to improve the understanding of the relationships between protein structure and function. The first paper of the series describes our statistical approach, outlines the major findings, and provides illustrative examples of biological processes and functions positively and negatively correlated with intrinsic disorder.  相似文献   

15.
In their natural environment, three-dimensional structures of proteins undergo significant fluctuations and are often partially or completely disordered. This phenomenon recently became the focus of much attention, as many proteins, especially from higher organisms, were shown to contain large intrinsically disordered regions. Such disordered regions may become ordered only under very specific circumstances, if at all, and can be recognized by specific amino acid composition and sequence signatures. Here, we suggest that the balance between order and disorder is much more subtle in that many regions are very close to the order/disorder boundary. Specifically, analysis of redundant sets of experimental models of protein structures, where emphasis is put on comparison of structures of identical proteins solved in different conditions and functional states, shows hundreds of fragments captured in two states: ordered and disordered. We show that such fragments, which we call here "dual personality" (DP) fragments, have distinctive features that differentiate them from both regularly folded and intrinsically disordered fragments. We hypothesize, and show on several examples, that such fragments are often targets of regulation, either by allostery or posttranslational modifications.  相似文献   

16.
Cryo-elecron microscopy (cryo-EM) can provide important structural information of large macromolecular assemblies in different conformational states. Recent years have seen an increase in structures deposited in the Protein Data Bank (PDB) by fitting a high-resolution structure into its low-resolution cryo-EM map. A commonly used protocol for accommodating the conformational changes between the X-ray structure and the cryo-EM map is rigid body fitting of individual domains. With the emergence of different flexible fitting approaches, there is a need to compare and revise these different protocols for the fitting. We have applied three diverse automated flexible fitting approaches on a protein dataset for which rigid domain fitting (RDF) models have been deposited in the PDB. In general, a consensus is observed in the conformations, which indicates a convergence from these theoretically different approaches to the most probable solution corresponding to the cryo-EM map. However, the result shows that the convergence might not be observed for proteins with complex conformational changes or with missing densities in cryo-EM map. In contrast, RDF structures deposited in the PDB can represent conformations that not only differ from the consensus obtained by flexible fitting but also from X-ray crystallography. Thus, this study emphasizes that a "consensus" achieved by the use of several automated flexible fitting approaches can provide a higher level of confidence in the modeled configurations. Following this protocol not only increases the confidence level of fitting, but also highlights protein regions with uncertain fitting. Hence, this protocol can lead to better interpretation of cryo-EM data.  相似文献   

17.
Helices, strands and coils in proteins of known three-dimensional structure, corresponding to heptapeptide and large sequences (‘probe’ peptides), were scanned against peptide sequences of variable length, comprising seven or more residues that correspond to a different conformation (‘target’ peptides) in protein crystal structures available from the Protein Data Bank (PDB). Where the ‘probe’ and ‘target’ peptide sequences exactly match, they correspond to ‘chameleon’ sequences in protein structures. We observed ∼548 heptapeptide and large chameleon sequences that included peptides in the coil conformation from 53,794 PDB files that were analyzed. However, after excluding several chameleon peptides based on the quality of protein structure data, redundancy and peptides associated with cloning artifacts, such as, histidine-tags, we observed only ten chameleon peptides in structurally different proteins and the maximum length comprised seven amino acid residues. Our analysis suggests that the quality of protein structure data is important for identifying possibly, the ‘true chameleons’ in PDB. Majority of the chameleon sequences correspond to an entire strand in one protein that is observed as part of helix sequence in another protein. The heptapeptide chameleons are characterized with a high propensity of alanine, leucine and valine amino acid residues. The total hydropathy values range between −11.2 and 22.9, the difference in solvent accessibility between 2.0 Å2 and 373 Å2 units and the difference in total number of residue neighbor contacts between 0 and 7 residues. Our work identifies for the first time heptapeptide and large sequences that correspond to a single complete helix, strand or coil, which adopt entirely different secondary structures in another protein.  相似文献   

18.
Singh GP  Ganapathi M  Sandhu KS  Dash D 《Proteins》2006,62(2):309-315
The study of unfolded protein regions has gained importance because of their prevalence and important roles in various cellular functions. These regions have characteristically high net charge and low hydrophobicity. The amino acid sequence determines the intrinsic unstructuredness of a region and, therefore, efforts are ongoing to delineate the sequence motifs, which might contribute to protein disorder. We find that PEST motifs are enriched in the characterized disordered regions as compared with globular ones. Analysis of representative PDB chains revealed very few structures containing PEST sequences and the majority of them lacked regular secondary structure. A proteome-wide study in completely sequenced eukaryotes with predicted unfolded and folded proteins shows that PEST proteins make up a large fraction of unfolded dataset as compared with the folded proteins. Our data also reveal the prevalence of PEST proteins in eukaryotic proteomes (approximately 25%). Functional classification of the PEST-containing proteins shows an over- and under-representation in proteins involved in regulation and metabolism, respectively. Furthermore, our analysis shows that predicted PEST regions do not exhibit any preference to be localized in the C terminals of proteins, as reported earlier.  相似文献   

19.
Inherent flexibility and conformational heterogeneity in proteins can often result in the absence of loops and even entire domains in structures determined by x-ray crystallographic or NMR methods. X-ray solution scattering offers the possibility of obtaining complementary information regarding the structures of these disordered protein regions. Methods are presented for adding missing loops or domains by fixing a known structure and building the unknown regions to fit the experimental scattering data obtained from the entire particle. Simulated annealing was used to minimize a scoring function containing the discrepancy between the experimental and calculated patterns and the relevant penalty terms. In low-resolution models where interface location between known and unknown parts is not available, a gas of dummy residues represents the missing domain. In high-resolution models where the interface is known, loops or domains are represented as interconnected chains (or ensembles of residues with spring forces between the C(alpha) atoms), attached to known position(s) in the available structure. Native-like folds of missing fragments can be obtained by imposing residue-specific constraints. After validation in simulated examples, the methods have been applied to add missing loops or domains to several proteins where partial structures were available.  相似文献   

20.
We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics. The related data is available at Life Language Processing Website: http://llp.berkeley.edu and Harvard Dataverse: http://dx.doi.org/10.7910/DVN/JMFHTN.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号