共查询到20条相似文献,搜索用时 9 毫秒
2.
Protein chemical shifts encode detailed structural information that is difficult and computationally costly to describe at a fundamental level. Statistical and machine learning approaches have been used to infer correlations between chemical shifts and secondary structure from experimental chemical shifts. These methods range from simple statistics such as the chemical shift index to complex methods using neural networks. Notwithstanding their higher accuracy, more complex approaches tend to obscure the relationship between secondary structure and chemical shift and often involve many parameters that need to be trained. We present hidden Markov models (HMMs) with Gaussian emission probabilities to model the dependence between protein chemical shifts and secondary structure. The continuous emission probabilities are modeled as conditional probabilities for a given amino acid and secondary structure type. Using these distributions as outputs of first‐ and second‐order HMMs, we achieve a prediction accuracy of 82.3%, which is competitive with existing methods for predicting secondary structure from protein chemical shifts. Incorporation of sequence‐based secondary structure prediction into our HMM improves the prediction accuracy to 84.0%. Our findings suggest that an HMM with correlated Gaussian distributions conditioned on the secondary structure provides an adequate generative model of chemical shifts. Proteins 2013; © 2012 Wiley Periodicals, Inc. 相似文献
3.
To study the distinct influences of structure and function on evolution, we propose a minimalist model for proteins with binding pockets, called functional model proteins, based on a shifted-HP model on a two-dimensional square lattice. These model proteins are not maximally compact and contain an empty lattice site surrounded by at least three nearest neighbors, thus providing a binding pocket. Functional model proteins possess a unique native state, cooperative folding and tolerance to mutation. Due to the explicit functionality in these models (by design), we have been able to explore their fitness or evolutionary landscapes, as characterized by the size and distribution of homologous families and by the complexity of the inter-relatedness of the functional model proteins. Mindful that these minimalist models are highly idealized and two-dimensional, functional model proteins should nevertheless provide a useful means for exploring the constraints of maintaining structure and function on the evolution of proteins. 相似文献
5.
From protein sequence comparison data found in the literature, a library was organized using peptide fragment sequences which are common to related proteins. Each of the fragments was then examined for its occurrence in all the protein superfamilies defined by the NBRF-PIR data base. We have selected those fragment peptides that appear exclusively in one or a few superfamilies, and thus made a library of fragment peptides that characterize specific superfamilies. Such characteristic peptides are, in general, five to seven residues long and contain unusually high proportions of glycine and cysteine. This collection is a useful resource for the classification and functional prediction of protein molecules. 相似文献
6.
We present the MOlecular NETwork (MONET) ontology as a model to integrate data from different networks that govern cell function. To achieve this, different existing ontologies were analyzed and an integrated ontology was built in a way to make it possible to share and reuse knowledge, support interoperability between systems, and also allow the formulation of hypotheses through inferences. By studying the cell as an entity of a myriad of elements and networks of interactions, we aim to offer a means to understand the large-scale characteristics responsible for the behavior of the cell and to enable new biological insights. 相似文献
7.
Protein structure prediction methods typically use statistical potentials, which rely on statistics derived from a database of know protein structures. In the vast majority of cases, these potentials involve pairwise distances or contacts between amino acids or atoms. Although some potentials beyond pairwise interactions have been described, the formulation of a general multibody potential is seen as intractable due to the perceived limited amount of data. In this article, we show that it is possible to formulate a probabilistic model of higher order interactions in proteins, without arbitrarily limiting the number of contacts. The success of this approach is based on replacing a naive table‐based approach with a simple hierarchical model involving suitable probability distributions and conditional independence assumptions. The model captures the joint probability distribution of an amino acid and its neighbors, local structure and solvent exposure. We show that this model can be used to approximate the conditional probability distribution of an amino acid sequence given a structure using a pseudo‐likelihood approach. We verify the model by decoy recognition and site‐specific amino acid predictions. Our coarse‐grained model is compared to state‐of‐art methods that use full atomic detail. This article illustrates how the use of simple probabilistic models can lead to new opportunities in the treatment of nonlocal interactions in knowledge‐based protein structure prediction and design. Proteins 2013; 81:1340–1350. © 2013 Wiley Periodicals, Inc. 相似文献
8.
Residue coevolution has recently emerged as an important concept, especially in the context of protein structures. While a multitude of different functions for quantifying it have been proposed, not much is known about their relative strengths and weaknesses. Also, subtle algorithmic details have discouraged implementing and comparing them. We addressed this issue by developing an integrated online system that enables comparative analyses with a comprehensive set of commonly used scoring functions, including Statistical Coupling Analysis (SCA), Explicit Likelihood of Subset Variation (ELSC), mutual information and correlation-based methods. A set of data preprocessing options are provided for improving the sensitivity and specificity of coevolution signal detection, including sequence weighting, residue grouping and the filtering of sequences, sites and site pairs. A total of more than 100 scoring variations are available. The system also provides facilities for studying the relationship between coevolution scores and inter-residue distances from a crystal structure if provided, which may help in understanding protein structures. AVAILABILITY: The system is available at http://coevolution.gersteinlab.org. The source code and JavaDoc API can also be downloaded from the web site. 相似文献
9.
Patients with laryngeal cancer with early relapse usually have a poor prognosis. In this study, we aimed to identify a multi-gene signature to improve the relapse prediction in laryngeal cancer. One microarray data set GSE27020 (training set, N = 109) and one RNA-sequencing data set (validation set, N = 85) were included into the analysis. In the training set, the microarray expression profile was re-annotated into an mRNA-long noncoding RNA (lncRNA) biphasic profile. Then, LASSO Cox regression model identified nine relapse-related RNA (eight mRNA and one lncRNA), and a risk score was calculated for each sample according to the model coefficients. Patients with high-risk showed poorer relapse-free survival than patients with low risk (hazard ratios (HR): 6.189, 95% confidence interval (CI): 3.075-12.460, P < 0.0001). The risk score demonstrated good accuracy in predicting the relapse (area under time-dependent receiver-operating characteristic (AUC): 0.859 at 1 year, 0.822 at 3 years, and 0.815 at 5 years). The results were validated in the validation set (HR: 3.762, 95% CI: 1.594-8.877, P = 0.011; AUC: 0.770 at 1 year, 0.769 at 3 years, and 0.728 at 5 years). The multivariate analysis reached consistent results after adjustment by multiple confounders. When compared with a 27-gene signature, a 2-lncRNA signature, and Tumor-Node-Metastasis stage, the risk score also showed better performance ( P < 0.05). In conclusion, we successfully developed a robust mRNA-lncRNA signature that can accurately predict the relapse in laryngeal cancer. 相似文献
10.
The genome of the model plant species Arabidopsis thaliana has recently been sequenced. To accelerate its current genome research, we developed a whole-genome, BAC/BIBAC-based, integrated physical, genetic, and sequence map of the A. thaliana ecotype Columbia. This new map was constructed from the clones of a new plant-transformation-competent BIBAC library and is integrated with the existing sequence map. The clones were restriction fingerprinted by DNA sequencing gel-based electrophoresis, assembled into contigs, and anchored to an existing genetic map. The map consists of 194 BAC/BIBAC contigs, spanning 126 Mb of the 130-Mb Arabidopsis genome. A total of 120 contigs, spanning 114 Mb, were anchored to the chromosomes of Arabidopsis. Accuracy of the integrated map was verified using the existing physical and sequence maps and numerous DNA markers. Integration of the new map with the sequence map has enabled gap closure of the sequence map and will facilitate functional analysis of the genome sequence. The method used here has been demonstrated to be sufficient for whole-genome physical mapping from large-insert random bacterial clones and thus is applicable to rapid development of whole-genome physical maps for other species. 相似文献
12.
Modeling of the properties of biochemical components is gaining increasing interest due to its potential for further application within the area of biochemical process development. Generally protein solution properties such as protein solubility are expressed through component activity coefficients which are studied here. The original UNIQUAC model is chosen for the representation of protein activity coefficients and, to the best of our knowledge, this is the first time it has been directly applied to protein solutions. Ten different protein-salt-water systems with four different proteins, serum albumin, alphacymotrypsin, beta-lactoglobulin and ovalbumin, are investigated. A root-mean-squared deviation of 0.54% is obtained for the model by comparing calculated protein activity coefficients and protein activity coefficients deduced from osmotic measurements through virial expansion. Model predictions are used to analyze the effect of salt concentrations, pH, salt types, and temperature on protein activity coefficients and also on protein solubility and demonstrate consistency with results from other references. (c) 1997 John Wiley & Sons, Inc. Biotechnol Bioeng 55: 65-71, 1997. 相似文献
13.
Background:The wide availability of genome-scale data for several organisms has stimulated interest in computational approaches to gene function prediction. Diverse machine learning methods have been applied to unicellular organisms with some success, but few have been extensively tested on higher level, multicellular organisms. A recent mouse function prediction project (MouseFunc) brought together nine bioinformatics teams applying a diverse array of methodologies to mount the first large-scale effort to predict gene function in the laboratory mouse. Results:In this paper, we describe our contribution to this project, an ensemble framework based on the support vector machine that integrates diverse datasets in the context of the Gene Ontology hierarchy. We carry out a detailed analysis of the performance of our ensemble and provide insights into which methods work best under a variety of prediction scenarios. In addition, we applied our method to Saccharomyces cerevisiae and have experimentally confirmed functions for a novel mitochondrial protein. Conclusion:Our method consistently performs among the top methods in the MouseFunc evaluation. Furthermore, it exhibits good classification performance across a variety of cellular processes and functions in both a multicellular organism and a unicellular organism, indicating its ability to discover novel biology in diverse settings. 相似文献
14.
In predicting hierarchical protein function annotations, such as terms in the Gene Ontology (GO), the simplest approach makes predictions for each term independently. However, this approach has the unfortunate consequence that the predictor may assign to a single protein a set of terms that are inconsistent with one another; for example, the predictor may assign a specific GO term to a given protein ('purine nucleotide binding') but not assign the parent term ('nucleotide binding'). Such predictions are difficult to interpret. In this work, we focus on methods for calibrating and combining independent predictions to obtain a set of probabilistic predictions that are consistent with the topology of the ontology. We call this procedure 'reconciliation'. We begin with a baseline method for predicting GO terms from a collection of data types using an ensemble of discriminative classifiers. We apply the method to a previously described benchmark data set, and we demonstrate that the resulting predictions are frequently inconsistent with the topology of the GO. We then consider 11 distinct reconciliation methods: three heuristic methods; four variants of a Bayesian network; an extension of logistic regression to the structured case; and three novel projection methods - isotonic regression and two variants of a Kullback-Leibler projection method. We evaluate each method in three different modes - per term, per protein and joint - corresponding to three types of prediction tasks. Although the principal goal of reconciliation is interpretability, it is important to assess whether interpretability comes at a cost in terms of precision and recall. Indeed, we find that many apparently reasonable reconciliation methods yield reconciled probabilities with significantly lower precision than the original, unreconciled estimates. On the other hand, we find that isotonic regression usually performs better than the underlying, unreconciled method, and almost never performs worse; isotonic regression appears to be able to use the constraints from the GO network to its advantage. An exception to this rule is the high precision regime for joint evaluation, where Kullback-Leibler projection yields the best performance. 相似文献
15.
The advent of high-throughput phenotyping technologies has created a deluge of information that is difficult to deal with without the appropriate data management tools. These data management tools should integrate defined workflow controls for genomic-scale data acquisition and validation, data storage and retrieval, and data analysis, indexed around the genomic information of the organism of interest. To maximize the impact of these large datasets, it is critical that they are rapidly disseminated to the broader research community, allowing open access for data mining and discovery. We describe here a system that incorporates such functionalities developed around the Purdue University high-throughput ionomics phenotyping platform. The Purdue Ionomics Information Management System (PiiMS) provides integrated workflow control, data storage, and analysis to facilitate high-throughput data acquisition, along with integrated tools for data search, retrieval, and visualization for hypothesis development. PiiMS is deployed as a World Wide Web-enabled system, allowing for integration of distributed workflow processes and open access to raw data for analysis by numerous laboratories. PiiMS currently contains data on shoot concentrations of P, Ca, K, Mg, Cu, Fe, Zn, Mn, Co, Ni, B, Se, Mo, Na, As, and Cd in over 60,000 shoot tissue samples of Arabidopsis (Arabidopsis thaliana), including ethyl methanesulfonate, fast-neutron and defined T-DNA mutants, and natural accession and populations of recombinant inbred lines from over 800 separate experiments, representing over 1,000,000 fully quantitative elemental concentrations. PiiMS is accessible at www.purdue.edu/dp/ionomics. 相似文献
17.
AraNet is a functional gene network for the reference plant Arabidopsis and has been constructed in order to identify new genes associated with plant traits. It is highly predictive for diverse biological pathways and can be used to prioritize genes for functional screens. Moreover, AraNet provides a web-based tool with which plant biologists can efficiently discover novel functions of Arabidopsis genes (http://www.functionalnet.org/aranet/). This protocol explains how to conduct network-based prediction of gene functions using AraNet and how to interpret the prediction results. Functional discovery in plant biology is facilitated by combining candidate prioritization by AraNet with focused experimental tests. 相似文献
18.
MOTIVATION: Short linear peptide motifs mediate protein-protein interaction, cell compartment targeting and represent the sites of post-translational modification. The identification of functional motifs by conventional sequence searches, however, is hampered by the short length of the motifs resulting in a large number of hits of which only a small portion is functional. RESULTS: We have developed a procedure for the identification of functional motifs, which scores pattern conservation in homologous sequences by taking explicitly into account the sequence similarity to the query sequence. For a further improvement of this method, sequence filters have been optimized to mask those sequence regions containing little or no linear motifs. The performance of this approach was verified by measuring its ability to identify 576 experimentally validated motifs among a total of 15 563 instances in a set of 415 protein sequences. Compared to a random selection procedure, the joint application of sequence filters and the novel scoring scheme resulted in a 9-fold enrichment of validated functional motifs on the first rank. In addition, only half as many hits need to be investigated to recover 75% of the functional instances in our dataset. Therefore, this motif-scoring approach should be helpful to guide experiments because it allows focusing on those short linear peptide motifs that have a high probability to be functional. 相似文献
19.
Since membranous proteins play a key role in drug targeting therefore transmembrane proteins prediction is active and challenging area of biological sciences. Location based prediction of transmembrane proteins are significant for functional annotation of protein sequences. Hidden markov model based method was widely applied for transmembrane topology prediction. Here we have presented a revised and a better understanding model than an existing one for transmembrane protein prediction. Scripting on MATLAB was built and compiled for parameter estimation of model and applied this model on amino acid sequence to know the transmembrane and its adjacent locations. Estimated model of transmembrane topology was based on TMHMM model architecture. Only 7 super states are defined in the given dataset, which were converted to 96 states on the basis of their length in sequence. Accuracy of the prediction of model was observed about 74 %, is a good enough in the area of transmembrane topology prediction. Therefore we have concluded the hidden markov model plays crucial role in transmembrane helices prediction on MATLAB platform and it could also be useful for drug discovery strategy. AVAILABILITY: The database is available for free at bioinfonavneet@gmail.comvinaysingh@bhu.ac.in. 相似文献
20.
Background The development of high-throughput technologies has produced several large scale protein interaction data sets for multiple
species, and significant efforts have been made to analyze the data sets in order to understand protein activities. Considering
that the basic units of protein interactions are domain interactions, it is crucial to understand protein interactions at
the level of the domains. The availability of many diverse biological data sets provides an opportunity to discover the underlying
domain interactions within protein interactions through an integration of these biological data sets. 相似文献
|