首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: Experimental techniques alone cannot keep up with the production rate of protein sequences, while computational techniques for protein structure predictions have matured to such a level to provide reliable structural characterization of proteins at large scale. Integration of multiple computational tools for protein structure prediction can complement experimental techniques. RESULTS: We present an automated pipeline for protein structure prediction. The centerpiece of the pipeline is our threading-based protein structure prediction system PROSPECT. The pipeline consists of a dozen tools for identification of protein domains and signal peptide, protein triage to determine the protein type (membrane or globular), protein fold recognition, generation of atomic structural models, prediction result validation, etc. Different processing and prediction branches are determined automatically by a prediction pipeline manager based on identified characteristics of the protein. The pipeline has been implemented to run in a heterogeneous computational environment as a client/server system with a web interface. Genome-scale applications on Caenorhabditis elegans, Pyrococcus furiosus and three cyanobacterial genomes are presented. AVAILABILITY: The pipeline is available at http://compbio.ornl.gov/proteinpipeline/  相似文献   

2.
Is it better to combine predictions?   总被引:2,自引:0,他引:2  
We have compared the accuracy of the individual protein secondary structure prediction methods: PHD, DSC, NNSSP and Predator against the accuracy obtained by combing the predictions of the methods. A range of ways of combing predictions were tested: voting, biased voting, linear discrimination, neural networks and decision trees. The combined methods that involve 'learning' (the non-voting methods) were trained using a set of 496 non-homologous domains; this dataset was biased as some of the secondary structure prediction methods had used them for training. We used two independent test sets to compare predictions: the first consisted of 17 non-homologous domains from CASP3 (Third Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction); the second set consisted of 405 domains that were selected in the same way as the training set, and were non-homologous to each other and the training set. On both test datasets the most accurate individual method was NNSSP, then PHD, DSC and the least accurate was Predator; however, it was not possible to conclusively show a significant difference between the individual methods. Comparing the accuracy of the single methods with that obtained by combing predictions it was found that it was better to use a combination of predictions. On both test datasets it was possible to obtain a approximately 3% improvement in accuracy by combing predictions. In most cases the combined methods were statistically significantly better (at P = 0.05 on the CASP3 test set, and P = 0.01 on the EBI test set). On the CASP3 test dataset there was no significant difference in accuracy between any of the combined method of prediction: on the EBI test dataset, linear discrimination and neural networks significantly outperformed voting techniques. We conclude that it is better to combine predictions.  相似文献   

3.
Background

Named Entity Recognition (NER) and Normalisation (NEN) are core components of any text-mining system for biomedical texts. In a traditional concept-recognition pipeline, these tasks are combined in a serial way, which is inherently prone to error propagation from NER to NEN. We propose a parallel architecture, where both NER and NEN are modeled as a sequence-labeling task, operating directly on the source text. We examine different harmonisation strategies for merging the predictions of the two classifiers into a single output sequence.

Results

We test our approach on the recent Version 4 of the CRAFT corpus. In all 20 annotation sets of the concept-annotation task, our system outperforms the pipeline system reported as a baseline in the CRAFT shared task, a competition of the BioNLP Open Shared Tasks 2019. We further refine the systems from the shared task by optimising the harmonisation strategy separately for each annotation set.

Conclusions

Our analysis shows that the strengths of the two classifiers can be combined in a fruitful way. However, prediction harmonisation requires individual calibration on a development set for each annotation set. This allows achieving a good trade-off between established knowledge (training set) and novel information (unseen concepts).

  相似文献   

4.
Proteins encoded by newly-emerged genes (‘orphan genes’) share no sequence similarity with proteins in any other species. They provide organisms with a reservoir of genetic elements to quickly respond to changing selection pressures. Here, we systematically assess the ability of five gene prediction pipelines to accurately predict genes in genomes according to phylostratal origin. BRAKER and MAKER are existing, popular ab initio tools that infer gene structures by machine learning. Direct Inference is an evidence-based pipeline we developed to predict gene structures from alignments of RNA-Seq data. The BIND pipeline integrates ab initio predictions of BRAKER and Direct inference; MIND combines Direct Inference and MAKER predictions. We use highly-curated Arabidopsis and yeast annotations as gold-standard benchmarks, and cross-validate in rice. Each pipeline under-predicts orphan genes (as few as 11 percent, under one prediction scenario). Increasing RNA-Seq diversity greatly improves prediction efficacy. The combined methods (BIND and MIND) yield best predictions overall, BIND identifying 68% of annotated orphan genes, 99% of ancient genes, and give the highest sensitivity score regardless dataset in Arabidopsis. We provide a light weight, flexible, reproducible, and well-documented solution to improve gene prediction.  相似文献   

5.
Membrane proteins perform a variety of functions, all crucially dependent on their orientation in the membrane. However, neither the exact number of transmembrane domains (TMDs) nor the topology of most proteins have been experimentally determined. Due to this, most scientists rely primarily on prediction algorithms to determine topology and TMD assignments. Since these can give contradictory results, single‐algorithm‐based predictions are unreliable. To map the extent of potential misanalysis, the predictions of nine algorithms on the yeast proteome are compared and it is found that they have little agreement when predicting TMD number and termini orientation. To view all predictions in parallel, a webpage called TopologYeast: http://www.weizmann.ac.il/molgen/TopologYeast was created. Each algorithm is compared with experimental data and a poor agreement is found. The analysis suggests that more systematic data on protein topology are required to increase the training sets for prediction algorithms and to have accurate knowledge of membrane protein topology.  相似文献   

6.
For current state-of-the-art methods, the prediction of correct topology of membrane proteins has been reported to be above 80%. However, this performance has only been observed in small and possibly biased data sets obtained from protein structures or biochemical assays. Here, we test a number of topology predictors on an "unseen" set of proteins of known structure and also on four "genome-scale" data sets, including one recent large set of experimentally validated human membrane proteins with glycosylated sites. The set of glycosylated proteins is also used to examine the ability of prediction methods to separate membrane from nonmembrane proteins. The results show that methods utilizing multiple sequence alignments are overall superior to methods that do not. The best performance is obtained by TOPCONS, a consensus method that combines several of the other prediction methods. The best methods to distinguish membrane from nonmembrane proteins belong to the "Phobius" group of predictors. We further observe that the reported high accuracies in the smaller benchmark sets are not quite maintained in larger scale benchmarks. Instead, we estimate the performance of the best prediction methods for eukaryotic membrane proteins to be between 60% and 70%. The low agreement between predictions from different methods questions earlier estimates about the global properties of the membrane proteome. Finally, we suggest a pipeline to estimate these properties using a combination of the best predictors that could be applied in large-scale proteomics studies of membrane proteins.  相似文献   

7.
MOTIVATION: Many biomedical and clinical research problems involve discovering causal relationships between observations gathered from temporal events. Dynamic Bayesian networks are a powerful modeling approach to describe causal or apparently causal relationships, and support complex medical inference, such as future response prediction, automated learning, and rational decision making. Although many engines exist for creating Bayesian networks, most require a local installation and significant data manipulation to be practical for a general biologist or clinician. No software pipeline currently exists for interpretation and inference of dynamic Bayesian networks learned from biomedical and clinical data. RESULTS: miniTUBA is a web-based modeling system that allows clinical and biomedical researchers to perform complex medical/clinical inference and prediction using dynamic Bayesian network analysis with temporal datasets. The software allows users to choose different analysis parameters (e.g. Markov lags and prior topology), and continuously update their data and refine their results. miniTUBA can make temporal predictions to suggest interventions based on an automated learning process pipeline using all data provided. Preliminary tests using synthetic data and laboratory research data indicate that miniTUBA accurately identifies regulatory network structures from temporal data. AVAILABILITY: miniTUBA is available at http://www.minituba.org.  相似文献   

8.
In recent years, hybrid neural network approaches, which combine mechanistic and neural network models, have received considerable attention. These approaches are potentially very efficient for obtaining more accurate predictions of process dynamics by combining mechanistic and neural network models in such a way that the neural network model properly accounts for unknown and nonlinear parts of the mechanistic model. In this work, a full-scale coke-plant wastewater treatment process was chosen as a model system. Initially, a process data analysis was performed on the actual operational data by using principal component analysis. Next, a simplified mechanistic model and a neural network model were developed based on the specific process knowledge and the operational data of the coke-plant wastewater treatment process, respectively. Finally, the neural network was incorporated into the mechanistic model in both parallel and serial configurations. Simulation results showed that the parallel hybrid modeling approach achieved much more accurate predictions with good extrapolation properties as compared with the other modeling approaches even in the case of process upset caused by, for example, shock loading of toxic compounds. These results indicate that the parallel hybrid neural modeling approach is a useful tool for accurate and cost-effective modeling of biochemical processes, in the absence of other reasonably accurate process models.  相似文献   

9.
The integration of multiple predictors promises higher prediction accuracy than the accuracy that can be obtained with a single predictor. The challenge is how to select the best predictor at any given moment. Traditionally, multiple predictors are run in parallel and the one that generates the best result is selected for prediction. In this paper, we propose a novel approach for predictor integration based on the learning of historical predictions. Compared with the traditional approach, it does not require running all the predictors simultaneously. Instead, it uses classification algorithms such as k-Nearest Neighbor (k-NN) and Bayesian classification and dimension reduction technique such as Principal Component Analysis (PCA) to forecast the best predictor for the workload under study based on the learning of historical predictions. Then only the forecasted best predictor is run for prediction. Our experimental results show that it achieved 20.18% higher best predictor forecasting accuracy than the cumulative MSE based predictor selection approach used in the popular Network Weather Service system. In addition, it outperformed the observed most accurate single predictor in the pool for 44.23% of the performance traces.
Renato J. FigueiredoEmail:
  相似文献   

10.
Parsing a mental operation into components, characterizing the parallel or serial nature of this flow, and understanding what each process ultimately contributes to response time are fundamental questions in cognitive neuroscience. Here we show how a simple theoretical model leads to an extended set of predictions concerning the distribution of response time and its alteration by simultaneous performance of another task. The model provides a synthesis of psychological refractory period and random-walk models of response time. It merely assumes that a task consists of three consecutive stages—perception, decision based on noisy integration of evidence, and response—and that the perceptual and motor stages can operate simultaneously with stages of another task, while the central decision process constitutes a bottleneck. We designed a number-comparison task that provided a thorough test of the model by allowing independent variations in number notation, numerical distance, response complexity, and temporal asynchrony relative to an interfering probe task of tone discrimination. The results revealed a parsing of the comparison task in which each variable affects only one stage. Numerical distance affects the integration process, which is the only step that cannot proceed in parallel and has a major contribution to response time variability. The other stages, mapping the numeral to an internal quantity and executing the motor response, can be carried out in parallel with another task. Changing the duration of these processes has no significant effect on the variance.  相似文献   

11.
ABSTRACT: BACKGROUND: Accurate and efficient RNA secondary structure prediction remains an important open problem in computational molecular biology. Historically, advances in computing technology have enabled faster and more accurate RNA secondary structure predictions. Previous parallelized prediction programs achieved significant improvements in runtime, but their implementations were not portable from niche high-performance computers or easily accessible to most RNA researchers. With the increasing prevalence of multi-core desktop machines, a new parallel prediction program is needed to take full advantage of today's computing technology. FINDINGS: We present here the first implementation of RNA secondary structure prediction by thermodynamic optimization for modern multi-core computers. We show that GTfold predicts secondary structure in less time than UNAfold and RNAfold, without sacrificing accuracy, on machines with four or more cores. CONCLUSIONS: GTfold supports advances in RNA structural biology by reducing the timescales for secondary structure prediction. The difference will be particularly valuable to researchers working with lengthy RNA sequences, such as RNA viral genomes.  相似文献   

12.
Parsing a Cognitive Task: A Characterization of the Mind's Bottleneck   总被引:1,自引:1,他引:0  
Parsing a mental operation into components, characterizing the parallel or serial nature of this flow, and understanding what each process ultimately contributes to response time are fundamental questions in cognitive neuroscience. Here we show how a simple theoretical model leads to an extended set of predictions concerning the distribution of response time and its alteration by simultaneous performance of another task. The model provides a synthesis of psychological refractory period and random-walk models of response time. It merely assumes that a task consists of three consecutive stages—perception, decision based on noisy integration of evidence, and response—and that the perceptual and motor stages can operate simultaneously with stages of another task, while the central decision process constitutes a bottleneck. We designed a number-comparison task that provided a thorough test of the model by allowing independent variations in number notation, numerical distance, response complexity, and temporal asynchrony relative to an interfering probe task of tone discrimination. The results revealed a parsing of the comparison task in which each variable affects only one stage. Numerical distance affects the integration process, which is the only step that cannot proceed in parallel and has a major contribution to response time variability. The other stages, mapping the numeral to an internal quantity and executing the motor response, can be carried out in parallel with another task. Changing the duration of these processes has no significant effect on the variance.  相似文献   

13.
With discovery of diverse roles for RNA, its centrality in cellular functions has become increasingly apparent. A number of algorithms have been developed to predict RNA secondary structure. Their performance has been benchmarked by comparing structure predictions to reference secondary structures. Generally, algorithms are compared against each other and one is selected as best without statistical testing to determine whether the improvement is significant. In this work, it is demonstrated that the prediction accuracies of methods correlate with each other over sets of sequences. One possible reason for this correlation is that many algorithms use the same underlying principles. A set of benchmarks published previously for programs that predict a structure common to three or more sequences is statistically analyzed as an example to show that it can be rigorously evaluated using paired two-sample t-tests. Finally, a pipeline of statistical analyses is proposed to guide the choice of data set size and performance assessment for benchmarks of structure prediction. The pipeline is applied using 5S rRNA sequences as an example.  相似文献   

14.
MOTIVATION: A central problem in bioinformatics is the assignment of function to sequenced open reading frames (ORFs). The most common approach is based on inferred homology using a statistically based sequence similarity (SIM) method, e.g. PSI-BLAST. Alternative non-SIM based bioinformatic methods are becoming popular. One such method is Data Mining Prediction (DMP). This is based on combining evidence from amino-acid attributes, predicted structure and phylogenic patterns; and uses a combination of Inductive Logic Programming data mining, and decision trees to produce prediction rules for functional class. DMP predictions are more general than is possible using homology. In 2000/1, DMP was used to make public predictions of the function of 1309 Escherichia coli ORFs. Since then biological knowledge has advanced allowing us to test our predictions. RESULTS: We examined the updated (20.02.02) Riley group genome annotation, and examined the scientific literature for direct experimental derivations of ORF function. Both tests confirmed the DMP predictions. Accuracy varied between rules, and with the detail of prediction, but they were generally significantly better than random. For voting rules, accuracies of 75-100% were obtained. Twenty-one of these DMP predictions have been confirmed by direct experimentation. The DMP rules also have interesting biological explanations. DMP is, to the best of our knowledge, the first non-SIM based prediction method to have been tested directly on new data. AVAILABILITY: We have designed the "Genepredictions" database for protein functional predictions. This is intended to act as an open repository for predictions for any organism and can be accessed at http://www.genepredictions.org  相似文献   

15.
Parsing a mental operation into components, characterizing the parallel or serial nature of this flow, and understanding what each process ultimately contributes to response time are fundamental questions in cognitive neuroscience. Here we show how a simple theoretical model leads to an extended set of predictions concerning the distribution of response time and its alteration by simultaneous performance of another task. The model provides a synthesis of psychological refractory period and random-walk models of response time. It merely assumes that a task consists of three consecutive stages—perception, decision based on noisy integration of evidence, and response—and that the perceptual and motor stages can operate simultaneously with stages of another task, while the central decision process constitutes a bottleneck. We designed a number-comparison task that provided a thorough test of the model by allowing independent variations in number notation, numerical distance, response complexity, and temporal asynchrony relative to an interfering probe task of tone discrimination. The results revealed a parsing of the comparison task in which each variable affects only one stage. Numerical distance affects the integration process, which is the only step that cannot proceed in parallel and has a major contribution to response time variability. The other stages, mapping the numeral to an internal quantity and executing the motor response, can be carried out in parallel with another task. Changing the duration of these processes has no significant effect on the variance.  相似文献   

16.
It is widely acknowledged that species respond to climate change by range shifts. Robust predictions of such changes in species’ distributions are pivotal for conservation planning and policy making, and are thus major challenges in ecological research. Statistical species distribution models (SDMs) have been widely applied in this context, though they remain subject to criticism as they implicitly assume equilibrium, and incorporate neither dispersal, demographic processes nor biotic interactions explicitly. In this study, the effects of transient dynamics and ecological properties and processes on the prediction accuracy of SDMs for climate change projections were tested. A spatially explicit multi‐species dynamic population model was built, incorporating species‐specific and interspecific ecological processes, environmental stochasticity and climate change. Species distributions were sampled in different scenarios, and SDMs were estimated by applying generalised linear models (GLMs) and boosted regression trees (BRTs). Resulting model performances were related to prevailing ecological processes and temporal dynamics. SDM performance varied for different range dynamics. Prediction accuracies decreased when abrupt range shifts occurred as species were outpaced by the rate of climate change, and increased again when a new equilibrium situation was realised. When ranges contracted, prediction accuracies increased as the absences were predicted well. Far‐dispersing species were faster in tracking climate change, and were predicted more accurately by SDMs than short‐dispersing species. BRTs mostly outperformed GLMs. The presence of a predator, and the inclusion of its incidence as an environmental predictor, made BRTs and GLMs perform similarly. Results are discussed in light of other studies dealing with effects of ecological traits and processes on SDM performance. Perspectives are given on further advancements of SDMs and for possible interfaces with more mechanistic approaches in order to improve predictions under environmental change.  相似文献   

17.
Membrane organization describes the relationship of proteins to the membrane, that is, whether the protein crosses the membrane or is integral to the membrane and its orientation with respect to the membrane. Membrane organization is determined primarily by the presence of two features which target proteins to the secretory pathway: the endoplasmic reticulum signal peptide and the ?-helical transmembrane domain. In order to generate membrane organization annotation of high quality, confidence and throughput, the Membrane Organization (MemO) pipeline was developed, incorporating consensus feature prediction modules with integration and annotation rules derived from biological observations. The pipeline classifies proteins into six categories based on the presence or absence of predicted features: Soluble, intracellular proteins; Soluble, secreted proteins; Type I membrane proteins; Type II membrane proteins; Multi-span membrane proteins and Glycosylphosphatidylinositol anchored membrane proteins. The MemO pipeline represents an integrated strategy for the application of state-of-the-art bioinformatics tools to the annotation of protein membrane organization, a property which adds biological context to the large quantities of protein sequence information available.  相似文献   

18.
The increasing availability of time series expression datasets, although promising, raises a number of new computational challenges. Accordingly, the development of suitable classification methods to make reliable and sound predictions is becoming a pressing issue. We propose, here, a new method to classify time series gene expression via integration of biological networks. We evaluated our approach on 2 different datasets and showed that the use of a hidden Markov model/Gaussian mixture models hybrid explores the time-dependence of the expression data, thereby leading to better prediction results. We demonstrated that the biclustering procedure identifies function-related genes as a whole, giving rise to high accordance in prognosis prediction across independent time series datasets. In addition, we showed that integration of biological networks into our method significantly improves prediction performance. Moreover, we compared our approach with several state-of–the-art algorithms and found that our method outperformed previous approaches with regard to various criteria. Finally, our approach achieved better prediction results on early-stage data, implying the potential of our method for practical prediction.  相似文献   

19.
Machine learning (ML) along with high volume of satellite images offers an alternative to agronomists in crop yield predictions for decision support systems. This research exploited the possibility of using monthly image composites from Sentinel-2 imageries for rice crop yield predictions one month before the harvesting period at the field level using ML techniques in Taiwan. Three ML models, including random forest (RF), support vector machine (SVM), and artificial neural networks (ANN), were designed to address the research question of yield predictions in four consecutive growing seasons from 2019 to 2020 using field survey data. The research findings of yield modeling and predictions showed that SVM slightly outperformed RF and ANN. The results of model validation, obtained from SVM models using the data from transplanting to ripening, showed that the root mean square percentage error (RMSPE) and the mean absolute percentage error (MAPE) values were 5.5% and 4.5% for the 2019 second crop, and 4.7% and 3.5% for the 2020 first crop, respectively. The results of yield predictions (obtained from SVM) for the 2019 second crop and the 2020 first crop evaluated against the government statistics indicated a close agreement between these two datasets, with the RMSPE and MAPE values generally smaller than 11.2% and 9.2%. The SVM model configuration parameters used for rice crop yield predictions indicated satisfactory results. The comparison results between the predicted yields and the official statistics showed slight underestimations, with RMSPE and MAPE values of 9.4% and 7.1% for the 2019 first crop (hindcast), and 11.0% and 9.4% for the 2020 second crop (forecast), respectively. This study has successfully proven the validity of our methods for yield modeling and prediction from monthly composites from Sentinel-2 imageries using ML algorithms. The research findings from this research work could useful for agronomists to timely formulate action plans to address national food security issues.  相似文献   

20.
We have explored the possibility that consensus predictions of membrane protein topology might provide a means to estimate the reliability of a predicted topology. Using five current topology prediction methods and a test set of 60 Escherichia coli inner membrane proteins with experimentally determined topologies, we find that prediction performance varies strongly with the number of methods that agree, and that the topology of nearly half of all E. coli inner membrane proteins can be predicted with high reliability (>90% correct predictions) by a simple majority-vote approach.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号