首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Membrane proteins, which constitute approximately 20% of most genomes, are poorly tractable targets for experimental structure determination, thus analysis by prediction and modelling makes an important contribution to their on-going study. Membrane proteins form two main classes: alpha helical and beta barrel trans-membrane proteins. By using a method based on Bayesian Networks, which provides a flexible and powerful framework for statistical inference, we addressed alpha-helical topology prediction. This method has accuracies of 77.4% for prokaryotic proteins and 61.4% for eukaryotic proteins. The method described here represents an important advance in the computational determination of membrane protein topology and offers a useful, and complementary, tool for the analysis of membrane proteins for a range of applications.  相似文献   

2.
3.
Turn prediction in proteins using a pattern-matching approach   总被引:16,自引:0,他引:16  
We extend the use of amino acid sequence patterns [Cohen, F.E., Abarbanel, R. M., Kuntz, I. D., & Fletterick, R. J. (1983) Biochemistry 22, 4894-4904] to the identification of turns in globular proteins. The approach uses a conservative strategy, combined with a hierarchical search (strongest patterns first) and length-dependent masking, to achieve high accuracy (95%) on a test set of proteins of known structure. Applying the same procedure to homologous families gives a 90% success rate. Straightforward changes are suggested to improve the predictive power. The computer program, written in Lisp, provides a general pattern-recognition language well suited for a number of investigations of protein and nucleic acid sequences.  相似文献   

4.
MOTIVATION: Mining the biomedical literature for references to genes and proteins always involves a tradeoff between high precision with false negatives, and high recall with false positives. Having a reliable method for assessing the relevance of literature mining results is crucial to finding ways to balance precision and recall, and for subsequently building automated systems to analyze these results. We hypothesize that abstracts and titles that discuss the same gene or protein use similar words. To validate this hypothesis, we built a dictionary- and rule-based system to mine Medline for references to genes and proteins, and used a Bayesian metric for scoring the relevance of each reference assignment. RESULTS: We analyzed the entire set of Medline records from 1966 to late 2001, and scored each gene and protein reference using a Bayesian estimated probability (EP) based on word frequency in a training set of 137837 known assignments from 30594 articles to 36197 gene and protein symbols. Two test sets of 148 and 150 randomly chosen assignments, respectively, were hand-validated and categorized as either good or bad. The distributions of EP values, when plotted on a log-scale histogram, are shown to markedly differ between good and bad assignments. Using EP values, recall was 100% at 61% precision (EP=2 x 10(-5)), 63% at 88% precision (EP=0.008), and 10% at 100% precision (EP=0.1). These results show that Medline entries discussing the same gene or protein have similar word usage, and that our method of assessing this similarity using EP values is valid, and enables an EP cutoff value to be determined that accurately and reproducibly balances precision and recall, allowing automated analysis of literature mining results. .  相似文献   

5.
Heritability is a central element in quantitative genetics. New molecular markers to assess genetic variance and heritability are continually under development. The availability of molecular single nucleotide polymorphism (SNP) markers can be applied for estimation of variance components and heritability on population, where relationship information is unknown. In this study, we evaluated the capabilities of two Bayesian genomic models to estimate heritability in simulated populations. The populations comprised different family structures of either no or a limited number of relatives, a single quantitative trait, and with one of two densities of SNP markers. All individuals were both genotyped and phenotyped. Results illustrated that the two models were capable of estimating heritability, when true heritability was 0.15 or higher and populations had a sample size of 400 or higher. For heritabilities of 0.05, all models had difficulties in estimating the true heritability. The two Bayesian models were compared with a restricted maximum likelihood (REML) approach using a genomic relationship matrix. The comparison showed that the Bayesian approaches performed equally well as the REML approach. Differences in family structure were in general not found to influence the estimation of the heritability. For the sample sizes used in this study, a 10-fold increase of SNP density did not improve precision estimates compared with set-ups with a less dense distribution of SNPs. The methods used in this study showed that it was possible to estimate heritabilities on the basis of SNPs in animals with direct measurements. This conclusion is valuable in cases when quantitative traits are either difficult or expensive to measure.  相似文献   

6.
A Bayesian network approach to operon prediction   总被引:5,自引:0,他引:5  
  相似文献   

7.
Recently, a large number of relatively inexpensive in vitro short-term tests have been developed to help predict the carcinogenicity of chemicals. The carcinogenicity prediction and battery selection (CPBS) method utilizes the results of such short-term tests to screen for chemicals that are most likely to cause cancer. The method is an integrated approach for analyzing large, often sparsely filled, data bases containing short-term test results, which often have only marginal representation of known non-carcinogens. The CPBS method is developed for the purpose of (i) determining the reliability and predictive capability of individual and batteries of short-term tests, and (ii) developing a strategy for formulating and selecting optimally preferred batteries of short-term tests for screening chemicals for further testing. The term 'optimally preferred' connotes the best acceptable combination of tests in terms of trade-offs among the multiple attributes of each test and resulting battery (e.g., cost, sensitivity, specificity, etc). The CPBS method consists of 5 major tasks: (1) data consolidation, (2) parameter estimation, (3) predictivity calculation, (4) battery selection and (5) risk assessment. Although there is a great need for more research and improvement, the CPBS method at its present stage should add an important method to the maze of the thousands of new chemicals that are introduced into drugs, foods, consumer goods and to the environment every year. This method should also provide an enhanced identification procedure for classifying chemicals more accurately as suspected carcinogens or non-carcinogens.  相似文献   

8.
A new multi-model approach (MMA) for sweat loss prediction is proposed to improve prediction accuracy. MMA was computed as the average of sweat loss predicted by two existing thermoregulation models: i.e., the rational model SCENARIO and the empirical model Heat Strain Decision Aid (HSDA). Three independent physiological datasets, a total of 44 trials, were used to compare predictions by MMA, SCENARIO, and HSDA. The observed sweat losses were collected under different combinations of uniform ensembles, environmental conditions (15–40°C, RH 25–75%), and exercise intensities (250–600 W). Root mean square deviation (RMSD), residual plots, and paired t tests were used to compare predictions with observations. Overall, MMA reduced RMSD by 30–39% in comparison with either SCENARIO or HSDA, and increased the prediction accuracy to 66% from 34% or 55%. Of the MMA predictions, 70% fell within the range of mean observed value ± SD, while only 43% of SCENARIO and 50% of HSDA predictions fell within the same range. Paired t tests showed that differences between observations and MMA predictions were not significant, but differences between observations and SCENARIO or HSDA predictions were significantly different for two datasets. Thus, MMA predicted sweat loss more accurately than either of the two single models for the three datasets used. Future work will be to evaluate MMA using additional physiological data to expand the scope of populations and conditions.  相似文献   

9.
Identifying the interface between two interacting proteins provides important clues to the function of a protein, and is becoming increasing relevant to drug discovery. Here, surface patch analysis was combined with a Bayesian network to predict protein-protein binding sites with a success rate of 82% on a benchmark dataset of 180 proteins, improving by 6% on previous work and well above the 36% that would be achieved by a random method. A comparable success rate was achieved even when evolutionary information was missing, a further improvement on our previous method which was unable to handle incomplete data automatically. In a case study of the Mog1p family, we showed that our Bayesian network method can aid the prediction of previously uncharacterised binding sites and provide important clues to protein function. On Mog1p itself a putative binding site involved in the SLN1-SKN7 signal transduction pathway was detected, as was a Ran binding site, previously characterized solely by conservation studies, even though our automated method operated without using homologous proteins. On the remaining members of the family (two structural genomics targets, and a protein involved in the photosystem II complex in higher plants) we identified novel binding sites with little correspondence to those on Mog1p. These results suggest that members of the Mog1p family bind to different proteins and probably have different functions despite sharing the same overall fold. We also demonstrated the applicability of our method to drug discovery efforts by successfully locating a number of binding sites involved in the protein-protein interaction network of papilloma virus infection. In a separate study, we attempted to distinguish between the two types of binding site, obligate and non-obligate, within our dataset using a second Bayesian network. This proved difficult although some separation was achieved on the basis of patch size, electrostatic potential and conservation. Such was the similarity between the two interacting patch types, we were able to use obligate binding site properties to predict the location of non-obligate binding sites and vice versa.  相似文献   

10.

Background

Genomic prediction of breeding values from dense single nucleotide polymorphisms (SNP) genotypes is used for livestock and crop breeding, and can also be used to predict disease risk in humans. For some traits, the most accurate genomic predictions are achieved with non-linear estimates of SNP effects from Bayesian methods that treat SNP effects as random effects from a heavy tailed prior distribution. These Bayesian methods are usually implemented via Markov chain Monte Carlo (MCMC) schemes to sample from the posterior distribution of SNP effects, which is computationally expensive. Our aim was to develop an efficient expectation–maximisation algorithm (emBayesR) that gives similar estimates of SNP effects and accuracies of genomic prediction than the MCMC implementation of BayesR (a Bayesian method for genomic prediction), but with greatly reduced computation time.

Methods

emBayesR is an approximate EM algorithm that retains the BayesR model assumption with SNP effects sampled from a mixture of normal distributions with increasing variance. emBayesR differs from other proposed non-MCMC implementations of Bayesian methods for genomic prediction in that it estimates the effect of each SNP while allowing for the error associated with estimation of all other SNP effects. emBayesR was compared to BayesR using simulated data, and real dairy cattle data with 632 003 SNPs genotyped, to determine if the MCMC and the expectation-maximisation approaches give similar accuracies of genomic prediction.

Results

We were able to demonstrate that allowing for the error associated with estimation of other SNP effects when estimating the effect of each SNP in emBayesR improved the accuracy of genomic prediction over emBayesR without including this error correction, with both simulated and real data. When averaged over nine dairy traits, the accuracy of genomic prediction with emBayesR was only 0.5% lower than that from BayesR. However, emBayesR reduced computing time up to 8-fold compared to BayesR.

Conclusions

The emBayesR algorithm described here achieved similar accuracies of genomic prediction to BayesR for a range of simulated and real 630 K dairy SNP data. emBayesR needs less computing time than BayesR, which will allow it to be applied to larger datasets.

Electronic supplementary material

The online version of this article (doi:10.1186/s12711-014-0082-4) contains supplementary material, which is available to authorized users.  相似文献   

11.
BackgroundLeprosy remains concentrated among the poorest communities in low-and middle-income countries and it is one of the primary infectious causes of disability. Although there have been increasing advances in leprosy surveillance worldwide, leprosy underreporting is still common and can hinder decision-making regarding the distribution of financial and health resources and thereby limit the effectiveness of interventions. In this study, we estimated the proportion of unreported cases of leprosy in Brazilian microregions.Methodology/Principal findingsUsing data collected between 2007 to 2015 from each of the 557 Brazilian microregions, we applied a Bayesian hierarchical model that used the presence of grade 2 leprosy-related physical disabilities as a direct indicator of delayed diagnosis and a proxy for the effectiveness of local leprosy surveillance program. We also analyzed some relevant factors that influence spatial variability in the observed mean incidence rate in the Brazilian microregions, highlighting the importance of socioeconomic factors and how they affect the levels of underreporting. We corrected leprosy incidence rates for each Brazilian microregion and estimated that, on average, 33,252 (9.6%) new leprosy cases went unreported in the country between 2007 to 2015, with this proportion varying from 8.4% to 14.1% across the Brazilian States.Conclusions/SignificanceThe magnitude and distribution of leprosy underreporting were adequately explained by a model using Grade 2 disability as a marker for the ability of the system to detect new missing cases. The percentage of missed cases was significant, and efforts are warranted to improve leprosy case detection. Our estimates in Brazilian microregions can be used to guide effective interventions, efficient resource allocation, and target actions to mitigate transmission.  相似文献   

12.

Background  

In high density arrays, the identification of relevant genes for disease classification is complicated by not only the curse of dimensionality but also the highly correlated nature of the array data. In this paper, we are interested in the question of how many and which genes should be selected for a disease class prediction. Our work consists of a Bayesian supervised statistical learning approach to refine gene signatures with a regularization which penalizes for the correlation between the variables selected.  相似文献   

13.
14.
In equine breeding, the number of spermatozoa ejaculated is considered an important factor in fertility. Methods for predicting the number of spermatozoa have been derived from semen collection procedures. A once-daily collection period for 10 days is a standard recommendation to predict long-term daily sperm output (DSO). The first objective of this study was to determine the precision or repeatability of these DSO predictions. Semen was collected and evaluated daily during four periods for 10 days, for 15 different stallions. The analytical methods utilized hierarchal Bayesian modeling as implemented by Gibbs Sampling. The overall population model showed an initial decline in total sperm number of 1.54 billion spermatozoa per day until the observed mean change point of 4.71 days, at which time mean DSO was estimated at 5.28 billion spermatozoa per day. The hierarchal model showed standard deviations in DSO within-stallion of 0.67 billion spermatozoa per day and among-stallion of 1.86 billion spermatozoa per day. The study's second objective was to determine how testicular size affected DSO models. When the model was extended to include testicular size, the optimal prediction of DSO was that DSO = 0.79 + 0.018 x testicular size (in milliliters). Testicular size explained 36.5% of the among-stallion standard deviation in DSO, but was not significantly related to the mean number of collection-days required to reach DSO.  相似文献   

15.
Sethi D  Garg A  Raghava GP 《Amino acids》2008,35(3):599-605
The association of structurally disordered proteins with a number of diseases has engendered enormous interest and therefore demands a prediction method that would facilitate their expeditious study at molecular level. The present study describes the development of a computational method for predicting disordered proteins using sequence and profile compositions as input features for the training of SVM models. First, we developed the amino acid and dipeptide compositions based SVM modules which yielded sensitivities of 75.6 and 73.2% along with Matthew’s Correlation Coefficient (MCC) values of 0.75 and 0.60, respectively. In addition, the use of predicted secondary structure content (coil, sheet and helices) in the form of composition values attained a sensitivity of 76.8% and MCC value of 0.77. Finally, the training of SVM models using evolutionary information hidden in the multiple sequence alignment profile improved the prediction performance by achieving a sensitivity value of 78% and MCC of 0.78. Furthermore, when evaluated on an independent dataset of partially disordered proteins, the same SVM module provided a correct prediction rate of 86.6%. Based on the above study, a web server (“DPROT”) was developed for the prediction of disordered proteins, which is available at .  相似文献   

16.

Background  

Protein secondary structure prediction method based on probabilistic models such as hidden Markov model (HMM) appeals to many because it provides meaningful information relevant to sequence-structure relationship. However, at present, the prediction accuracy of pure HMM-type methods is much lower than that of machine learning-based methods such as neural networks (NN) or support vector machines (SVM).  相似文献   

17.
J Jiang  Q Zhang  L Ma  J Li  Z Wang  J-F Liu 《Heredity》2015,115(1):29-36
Predicting organismal phenotypes from genotype data is important for preventive and personalized medicine as well as plant and animal breeding. Although genome-wide association studies (GWAS) for complex traits have discovered a large number of trait- and disease-associated variants, phenotype prediction based on associated variants is usually in low accuracy even for a high-heritability trait because these variants can typically account for a limited fraction of total genetic variance. In comparison with GWAS, the whole-genome prediction (WGP) methods can increase prediction accuracy by making use of a huge number of variants simultaneously. Among various statistical methods for WGP, multiple-trait model and antedependence model show their respective advantages. To take advantage of both strategies within a unified framework, we proposed a novel multivariate antedependence-based method for joint prediction of multiple quantitative traits using a Bayesian algorithm via modeling a linear relationship of effect vector between each pair of adjacent markers. Through both simulation and real-data analyses, our studies demonstrated that the proposed antedependence-based multiple-trait WGP method is more accurate and robust than corresponding traditional counterparts (Bayes A and multi-trait Bayes A) under various scenarios. Our method can be readily extended to deal with missing phenotypes and resequence data with rare variants, offering a feasible way to jointly predict phenotypes for multiple complex traits in human genetic epidemiology as well as plant and livestock breeding.  相似文献   

18.
19.
We developed a software tool (SlidingBayes) for recombination analysis based on Bayesian phylogenetic inference. Sliding-Bayes provides a powerful approach for detecting potential recombination, especially between highly divergent sequences and complex HIV-1 recombinants for which simpler methods like neighbor joining (NJ) may be less powerful. SlidingBayes guides Markov Chain Monte Carlo (MCMC) sampling performed by MrBayes in a sliding window across the alignment (Bayesian scanning). The tool can be used for nucleotide and amino acid sequences and combines all the modeling possibilities of MrBayes with the ability to plot the posterior probability support for clustering of various combinations of taxa.  相似文献   

20.
The probable arrangement of the bacteriorhodopsin molecules in the purple membrane of Halobacterium halobium is in clusters of three, with a 3-fold axis at the centre of each cluster; the axis is at right angles to the plane of the membrane. The proposed arrangement and the results of model calculations together indicate that each protein molecule spans the entire thickness of the membrane. An earlier proposal for the structure had the protein molecules in two layers, and it was symmetric in projection onto the profile-axis. This model is now rejected since it would be difficult to account for the recently discovered function of pumping protons. There remains a discrepancy in that the calculated number of protein molecules in the unit-cell is 3.4 compared to the three expected.The X-ray diffraction patterns from dispersions of the lipids extracted from the red and purple membranes of H. halobium are described.Model calculations are reported, which are based on the bilayer profile calculated for the extracted lipids and on two simple profiles for the protein. The calculations favour a structure for the purple membrane having the lipid molecules in two layers, as in a bilayer, although there may be more of the lipid on one side of the membrane than on the other. Assuming bilayer structure, the diffraction nearest the centre of the oriented pattern suggests that the lipid molecules may be located mainly in a few discrete regions, roughly 20 Å across, between the protein molecules. An uninterrupted monolayer of the lipid on one surface of a sheet of the protein molecules gives poor agreement with the observed profile-diffraction.The X-ray diffraction pattern from the oriented membranes suggested α-helix in the bacteriorhodopsin, and this has been confirmed by recording a 1.5 Å-reflection oriented on the profile-axis. There appear to be at least two segments of α-helix, which are somewhat inclined to one another, and the two may be packed together. Prominent diffraction on the in-plane axis near 10 Å is consistent with the segments lying more or less perpendicular to the plane of the membrane.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号