首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Given the growing amount of biological data, data mining methods have become an integral part of bioinformatics research. Unfortunately, standard data mining tools are often not sufficiently equipped for handling raw data such as e.g. amino acid sequences. One popular and freely available framework that contains many well-known data mining algorithms is the Waikato Environment for Knowledge Analysis (Weka). In the BioWeka project, we introduce various input formats for bioinformatics data and bioinformatics methods like alignments to Weka. This allows users to easily combine them with Weka's classification, clustering, validation and visualization facilities on a single platform and therefore reduces the overhead of converting data between different data formats as well as the need to write custom evaluation procedures that can deal with many different programs. We encourage users to participate in this project by adding their own components and data formats to BioWeka. Availability: The software, documentation and tutorial are available at http://www.bioweka.org.  相似文献   

2.
The question of how the structure of a neuronal network affects its functionality has gained a lot of attention in neuroscience. However, the vast majority of the studies on structure-dynamics relationships consider few types of network structures and assess limited numbers of structural measures. In this in silico study, we employ a wide diversity of network topologies and search among many possibilities the aspects of structure that have the greatest effect on the network excitability. The network activity is simulated using two point-neuron models, where the neurons are activated by noisy fluctuation of the membrane potential and their connections are described by chemical synapse models, and statistics on the number and quality of the emergent network bursts are collected for each network type. We apply a prediction framework to the obtained data in order to find out the most relevant aspects of network structure. In this framework, predictors that use different sets of graph-theoretic measures are trained to estimate the activity properties, such as burst count or burst length, of the networks. The performances of these predictors are compared with each other. We show that the best performance in prediction of activity properties for networks with sharp in-degree distribution is obtained when the prediction is based on clustering coefficient. By contrast, for networks with broad in-degree distribution, the maximum eigenvalue of the connectivity graph gives the most accurate prediction. The results shown for small () networks hold with few exceptions when different neuron models, different choices of neuron population and different average degrees are applied. We confirm our conclusions using larger () networks as well. Our findings reveal the relevance of different aspects of network structure from the viewpoint of network excitability, and our integrative method could serve as a general framework for structure-dynamics studies in biosciences.  相似文献   

3.
MOTIVATION: We present an extensive evaluation of different methods and criteria to detect remote homologs of a given protein sequence. We investigate two associated problems: first, to develop a sensitive searching method to identify possible candidates and, second, to assign a confidence to the putative candidates in order to select the best one. For searching methods where the score distributions are known, p-values are used as confidence measure with great success. For the cases where such theoretical backing is absent, we propose empirical approximations to p-values for searching procedures. RESULTS: As a baseline, we review the performances of different methods for detecting remote protein folds (sequence alignment and threading, with and without sequence profiles, global and local). The analysis is performed on a large representative set of protein structures. For fold recognition, we find that methods using sequence profiles generally perform better than methods using plain sequences, and that threading methods perform better than sequence alignment methods. In order to assess the quality of the predictions made, we establish and compare several confidence measures, including raw scores, z-scores, raw score gaps, z-score gaps, and different methods of p-value estimation. We work our way from the theoretically well backed local scores towards more explorative global and threading scores. The methods for assessing the statistical significance of predictions are compared using specificity--sensitivity plots. For local alignment techniques we find that p-value methods work best, albeit computationally cheaper methods such as those based on score gaps achieve similar performance. For global methods where no theory is available methods based on score gaps work best. By using the score gap functions as the measure of confidence we improve the more powerful fold recognition methods for which p-values are unavailable. AVAILABILITY: The benchmark set is available upon request.  相似文献   

4.
Copy number variation (CNV) has played an important role in studies of susceptibility or resistance to complex diseases. Traditional methods such as fluorescence in situ hybridization (FISH) and array comparative genomic hybridization (aCGH) suffer from low resolution of genomic regions. Following the emergence of next generation sequencing (NGS) technologies, CNV detection methods based on the short read data have recently been developed. However, due to the relatively young age of the procedures, their performance is not fully understood. To help investigators choose suitable methods to detect CNVs, comparative studies are needed. We compared six publicly available CNV detection methods: CNV-seq, FREEC, readDepth, CNVnator, SegSeq and event-wise testing (EWT). They are evaluated both on simulated and real data with different experiment settings. The receiver operating characteristic (ROC) curve is employed to demonstrate the detection performance in terms of sensitivity and specificity, box plot is employed to compare their performances in terms of breakpoint and copy number estimation, Venn diagram is employed to show the consistency among these methods, and F-score is employed to show the overlapping quality of detected CNVs. The computational demands are also studied. The results of our work provide a comprehensive evaluation on the performances of the selected CNV detection methods, which will help biological investigators choose the best possible method.  相似文献   

5.
ABSTRACT: BACKGROUND: RNA molecules play diverse functional and structural roles in cells. They function as messengers for transferring genetic information from DNA to proteins, as the primary genetic material in many viruses, as catalysts (ribozymes) important for protein synthesis and RNA processing, and as essential and ubiquitous regulators of gene expression in living organisms. Many of these functions depend on precisely orchestrated interactions between RNA molecules and specific proteins in cells. Understanding the molecular mechanisms by which proteins recognize and bind RNA is essential for comprehending the functional implications of these interactions, but the recognition 'code' that mediates interactions between proteins and RNA is not yet understood. Success in deciphering this code would dramatically impact the development of new therapeutic strategies for intervening in devastating diseases such as AIDS and cancer. Because of the high cost of experimental determination of protein-RNA interfaces, there is an increasing reliance on statistical machine learning methods for training predictors of RNA-binding residues in proteins. However, because of differences in the choice of datasets, performance measures, and data representations used, it has been difficult to obtain an accurate assessment of the current state of the art in protein-RNA interface prediction. RESULTS: We provide a review of published approaches for predicting RNA-binding residues in proteins and a systematic comparison and critical assessment of protein-RNA interface residue predictors trained using these approaches on three carefully curated non-redundant datasets. We directly compare two widely used machine learning algorithms (Naive Bayes (NB) and Support Vector Machine (SVM)) using three different data representations in which features are encoded using either sequence- or structure-based windows. Our results show that (i) Sequence-based classifiers that use a position-specific scoring matrix (PSSM)-based representation (PSSMSeq) outperform those that use an amino acid identity based representation (IDSeq) or a smoothed PSSM (SmoPSSMSeq); (ii) Structure-based classifiers that use smoothed PSSM representation (SmoPSSMStr) outperform those that use PSSM (PSSMStr) as well as sequence identity based representation (IDStr). PSSMSeq classifiers, when tested on an independent test set of 44 proteins, achieve performance that is comparable to that of three state-of-the-art structure-based predictors (including those that exploit geometric features) in terms of Matthews Correlation Coefficient (MCC), although the structure-based methods achieve substantially higher Specificity (albeit at the expense of Sensitivity) compared to sequence-based methods. We also find that the expected performance of the classifiers on a residue level can be markedly different from that on a protein level. Our experiments show that the classifiers trained on three different non-redundant protein-RNA interface datasets achieve comparable cross-validation performance. However, we find that the results are significantly affected by differences in the distance threshold used to define interface residues. CONCLUSIONS: Our results demonstrate that protein-RNA interface residue predictors that use a PSSM-based encoding of sequence windows outperform classifiers that use other encodings of sequence windows. While structure-based methods that exploit geometric features can yield significant increases in the Specificity of protein-RNA interface residue predictions, such increases are offset by decreases in Sensitivity. These results underscore the importance of comparing alternative methods using rigorous statistical procedures, multiple performance measures, and datasets that are constructed based on several alternative definitions of interface residues and redundancy cutoffs as well as including evaluations on independent test sets into the comparisons.  相似文献   

6.
Robots that simulate patients suffering from joint resistance caused by biomechanical and neural impairments are used to aid the training of physical therapists in manual examination techniques. However, there are few methods for assessing such robots. This article proposes two types of assessment measures based on typical judgments of clinicians. One of the measures involves the evaluation of how well the simulator presents different severities of a specified disease. Experienced clinicians were requested to rate the simulated symptoms in terms of severity, and the consistency of their ratings was used as a performance measure. The other measure involves the evaluation of how well the simulator presents different types of symptoms. In this case, the clinicians were requested to classify the simulated resistances in terms of symptom type, and the average ratios of their answers were used as performance measures. For both types of assessment measures, a higher index implied higher agreement among the experienced clinicians that subjectively assessed the symptoms based on typical symptom features. We applied these two assessment methods to a patient knee robot and achieved positive appraisals. The assessment measures have potential for use in comparing several patient simulators for training physical therapists, rather than as absolute indices for developing a standard.  相似文献   

7.
Prediction of β-turns from amino acid sequences has long been recognized as an important problem in structural bioinformatics due to their frequent occurrence as well as their structural and functional significance. Because various structural features of proteins are intercorrelated, secondary structure information has been often employed as an additional input for machine learning algorithms while predicting β-turns. Here we present a novel bidirectional Elman-type recurrent neural network with multiple output layers (MOLEBRNN) capable of predicting multiple mutually dependent structural motifs and demonstrate its efficiency in recognizing three aspects of protein structure: β-turns, β-turn types, and secondary structure. The advantage of our method compared to other predictors is that it does not require any external input except for sequence profiles because interdependencies between different structural features are taken into account implicitly during the learning process. In a sevenfold cross-validation experiment on a standard test dataset our method exhibits the total prediction accuracy of 77.9% and the Mathew's Correlation Coefficient of 0.45, the highest performance reported so far. It also outperforms other known methods in delineating individual turn types. We demonstrate how simultaneous prediction of multiple targets influences prediction performance on single targets. The MOLEBRNN presented here is a generic method applicable in a variety of research fields where multiple mutually depending target classes need to be predicted. Availability: http://webclu.bio.wzw.tum.de/predator-web/.  相似文献   

8.

Background  

Prediction of the transmembrane strands and topology of β-barrel outer membrane proteins is of interest in current bioinformatics research. Several methods have been applied so far for this task, utilizing different algorithmic techniques and a number of freely available predictors exist. The methods can be grossly divided to those based on Hidden Markov Models (HMMs), on Neural Networks (NNs) and on Support Vector Machines (SVMs). In this work, we compare the different available methods for topology prediction of β-barrel outer membrane proteins. We evaluate their performance on a non-redundant dataset of 20 β-barrel outer membrane proteins of gram-negative bacteria, with structures known at atomic resolution. Also, we describe, for the first time, an effective way to combine the individual predictors, at will, to a single consensus prediction method.  相似文献   

9.

Background  

Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories.  相似文献   

10.
While only a few studies have analysed training methods used on working dogs, a recent survey in 303 Belgian military handlers revealed the use of harsh training methods on military working dogs (MWD). The present work aims at analysing the training methods used on Belgian MWD and the behaviour of handlers to objectify the performances of the dog handlers teams (DH teams) and the welfare of the animals.A standardized evaluation, including obedience and protection work exercises, was conducted on DH teams (n = 33). Every evaluation was done twice to assess the reliability of the observation methods. The behaviours of MWD and handlers were recorded on videotape and subsequently analysed. Results showed that handlers rewarded or punished their dogs intermittently. Stroking and patting the dogs were the most frequently used rewards. Pulling on the leash and hanging dogs by their collars were the most commonly used aversive stimuli.The team's performance was influenced by the training method and by the dog's concentration: (1) low-performance dogs received more aversive stimuli than high-performance dogs; (2) dog's distraction influenced the performance: distracted dogs performed less well.Handlers punished more and rewarded less at the second evaluation than at the first one. This suggests that handlers modified their usual behaviour at the first evaluation in view to present themselves in a positive light. During the second evaluation the dogs reacted to this higher frequency of aversive stimuli as they exhibited a lower posture after aversive stimuli. The authors cannot prove that the welfare of these dogs had been hampered, but there is an indication that it was under threat.Low team performances suggest that DH teams should train more regularly and undertake the usefulness of setting a new training system that would rely on: the use of more positive training methods, an increased training frequency, the elaboration of a course on training principles, and an improvement of dog handler relationship.  相似文献   

11.
The aggregation of proteins or peptides in amyloid fibrils is associated with a number of clinical disorders, including Alzheimer''s, Huntington''s and prion diseases, medullary thyroid cancer, renal and cardiac amyloidosis. Despite extensive studies, the molecular mechanisms underlying the initiation of fibril formation remain largely unknown. Several lines of evidence revealed that short amino-acid segments (hot spots), located in amyloid precursor proteins act as seeds for fibril elongation. Therefore, hot spots are potential targets for diagnostic/therapeutic applications, and a current challenge in bioinformatics is the development of methods to accurately predict hot spots from protein sequences. In this paper, we combined existing methods into a meta-predictor for hot spots prediction, called MetAmyl for METapredictor for AMYLoid proteins. MetAmyl is based on a logistic regression model that aims at weighting predictions from a set of popular algorithms, statistically selected as being the most informative and complementary predictors. We evaluated the performances of MetAmyl through a large scale comparative study based on three independent datasets and thus demonstrated its ability to differentiate between amyloidogenic and non-amyloidogenic polypeptides. Compared to 9 other methods, MetAmyl provides significant improvement in prediction on studied datasets. We further show that MetAmyl is efficient to highlight the effect of point mutations involved in human amyloidosis, so we suggest this program should be a useful complementary tool for the diagnosis of these diseases.  相似文献   

12.
One common and challenging problem faced by many bioinformatics applications, such as promoter recognition, splice site prediction, RNA gene prediction, drug discovery and protein classification, is the imbalance of the available datasets. In most of these applications, the positive data examples are largely outnumbered by the negative data examples, which often leads to the development of sub-optimal prediction models having high negative recognition rate (Specificity = SP) and low positive recognition rate (Sensitivity = SE). When class imbalance learning methods are applied, usually, the SE is increased at the expense of reducing some amount of the SP. In this paper, we point out that in these data-imbalanced bioinformatics applications, the goal of applying class imbalance learning methods would be to increase the SE as high as possible by keeping the reduction of SP as low as possible. We explain that the existing performance measures used in class imbalance learning can still produce sub-optimal models with respect to this classification goal. In order to overcome these problems, we introduce a new performance measure called Adjusted Geometric-mean (AGm). The experimental results obtained on ten real-world imbalanced bioinformatics datasets demonstrates that the AGm metric can achieve a lower rate of reduction of SP than the existing performance metrics, when increasing the SE through class imbalance learning methods. This characteristic of AGm metric makes it more suitable for achieving the proposed classification goal in imbalanced bioinformatics datasets learning.  相似文献   

13.
Meta-DP: domain prediction meta-server   总被引:1,自引:0,他引:1  
SUMMARY: Meta-DP, a domain prediction meta-server provides a simple interface to predict domains in a given protein sequence using a number of domain prediction methods. The Meta-DP is a convenient resource because through accessing a single site, users automatically obtain the results of the various domain prediction methods along with a consensus prediction. The Meta-DP is currently coupled to 10 domain prediction servers and can be extended to include any number of methods. Meta-DP can thus become a centralized repository of available methods. Meta-DP was also used to evaluate the performance of 13 domain prediction methods in the context of CAFASP-DP. AVAILABILITY: The Meta-DP server is freely available at http://meta-dp.bioinformatics.buffalo.edu and the CAFASP-DP evaluation results are available at http://cafasp4.bioinformatics.buffalo.edu/dp/update.html CONTACT: hkaur@bioinformatics.buffalo.edu SUPPLEMENTARY INFORMATION: Available at http://cafasp4.bioinformatics.buffalo.edu/dp/update.html.  相似文献   

14.
Motivated by a clinical prediction problem, a simulation study was performed to compare different approaches for building risk prediction models. Robust prediction models for hospital survival in patients with acute heart failure were to be derived from three highly correlated blood parameters measured up to four times, with predictive ability having explicit priority over interpretability. Methods that relied only on the original predictors were compared with methods using an expanded predictor space including transformations and interactions. Predictors were simulated as transformations and combinations of multivariate normal variables which were fitted to the partly skewed and bimodally distributed original data in such a way that the simulated data mimicked the original covariate structure. Different penalized versions of logistic regression as well as random forests and generalized additive models were investigated using classical logistic regression as a benchmark. Their performance was assessed based on measures of predictive accuracy, model discrimination, and model calibration. Three different scenarios using different subsets of the original data with different numbers of observations and events per variable were investigated. In the investigated setting, where a risk prediction model should be based on a small set of highly correlated and interconnected predictors, Elastic Net and also Ridge logistic regression showed good performance compared to their competitors, while other methods did not lead to substantial improvements or even performed worse than standard logistic regression. Our work demonstrates how simulation studies that mimic relevant features of a specific data set can support the choice of a good modeling strategy.  相似文献   

15.

Background  

Prediction of protein structures is one of the fundamental challenges in biology today. To fully understand how well different prediction methods perform, it is necessary to use measures that evaluate their performance. Every two years, starting in 1994, the CASP (Critical Assessment of protein Structure Prediction) process has been organized to evaluate the ability of different predictors to blindly predict the structure of proteins. To capture different features of the models, several measures have been developed during the CASP processes. However, these measures have not been examined in detail before. In an attempt to develop fully automatic measures that can be used in CASP, as well as in other type of benchmarking experiments, we have compared twenty-one measures. These measures include the measures used in CASP3 and CASP2 as well as have measures introduced later. We have studied their ability to distinguish between the better and worse models submitted to CASP3 and the correlation between them.  相似文献   

16.
17.
Aim This study used data from temperate forest communities to assess: (1) five different stepwise selection methods with generalized additive models, (2) the effect of weighting absences to ensure a prevalence of 0.5, (3) the effect of limiting absences beyond the environmental envelope defined by presences, (4) four different methods for incorporating spatial autocorrelation, and (5) the effect of integrating an interaction factor defined by a regression tree on the residuals of an initial environmental model. Location State of Vaud, western Switzerland. Methods Generalized additive models (GAMs) were fitted using the grasp package (generalized regression analysis and spatial predictions, http://www.cscf.ch/grasp ). Results Model selection based on cross‐validation appeared to be the best compromise between model stability and performance (parsimony) among the five methods tested. Weighting absences returned models that perform better than models fitted with the original sample prevalence. This appeared to be mainly due to the impact of very low prevalence values on evaluation statistics. Removing zeroes beyond the range of presences on main environmental gradients changed the set of selected predictors, and potentially their response curve shape. Moreover, removing zeroes slightly improved model performance and stability when compared with the baseline model on the same data set. Incorporating a spatial trend predictor improved model performance and stability significantly. Even better models were obtained when including local spatial autocorrelation. A novel approach to include interactions proved to be an efficient way to account for interactions between all predictors at once. Main conclusions Models and spatial predictions of 18 forest communities were significantly improved by using either: (1) cross‐validation as a model selection method, (2) weighted absences, (3) limited absences, (4) predictors accounting for spatial autocorrelation, or (5) a factor variable accounting for interactions between all predictors. The final choice of model strategy should depend on the nature of the available data and the specific study aims. Statistical evaluation is useful in searching for the best modelling practice. However, one should not neglect to consider the shapes and interpretability of response curves, as well as the resulting spatial predictions in the final assessment.  相似文献   

18.
Thanks to its chemical plasticity, cysteine (Cys) is a very versatile player in proteins. A major determinant of Cys reactivity is pKa: the ability to predict it is deemed critical in redox bioinformatics. I considered different computational methods for pKa predictions and ultimately applied one (propka, ppka1) to various datasets; for all residues I assessed the effect of (1) hydrogen bonding, electrostatics and solvation on predictions and (2) protein mobility on pKa variability. Particularly for Cys, exposure and H-bond contributions heavily dictated propka predictions. The prominence of H-bond contributions was previously reported: this may explain the effectiveness of ppka1 (with Cys, tested in a benchmark). However ppka1 was also very sensitive to protein mobility; I assessed the effects of mobility on particularly large (compared to previous studies) datasets of structural ensembles; I found that exposed Cys presented the highest pKa variability, ascribable to correspondingly high H-bond fluctuations associated with protein flexibility. The benefit of including protein dynamics in pKa predictions was previously proposed, but empirical methods were never tested in this sense; instead, giving their outstanding speed, they could lend particularly well to this purpose. I devised a strategy combining short range molecular dynamics with ppka1; the protocol aimed to mitigate high ppka1 variability by including a “statistical view” of fast conformational changes. Tested in a benchmark, the strategy lead to improved performances. These results provide new insights on Cys bioinformatics (pKa prediction protocols) and Cys biology (effect of mobility on exposed Cys properties).  相似文献   

19.
Reverse engineering of gene regulatory networks has been an intensively studied topic in bioinformatics since it constitutes an intermediate step from explorative to causative gene expression analysis. Many methods have been proposed through recent years leading to a wide range of mathematical approaches. In practice, different mathematical approaches will generate different resulting network structures, thus, it is very important for users to assess the performance of these algorithms. We have conducted a comparative study with six different reverse engineering methods, including relevance networks, neural networks, and Bayesian networks. Our approach consists of the generation of defined benchmark data, the analysis of these data with the different methods, and the assessment of algorithmic performances by statistical analyses. Performance was judged by network size and noise levels. The results of the comparative study highlight the neural network approach as best performing method among those under study.  相似文献   

20.
生态监管是保障生态保护修复工程有效实施的重要途径,生态保护修复监督评估是我国当前维护生态环境安全的迫切需求和生态监管开展的有力抓手。在对国内外已有生态保护修复成效评估相关理论方法进行分析的基础上,围绕生态保护修复工程实施目标,结合生态监管要求,统筹考虑“山水林田湖草生命共同体”理念、生态保护修复工程类型、指标可操作性和兼容性等因素,综合运用频度分析法、专家咨询法和层次分析法等方法,提出了生态保护修复工程实施成效评估指标体系,构建了包括生态环境成效和工程管理成效的指标体系。其中,生态环境成效指标由共性指标和个性指标组成,共性指标包括生态系统格局、生态系统功能、生态干扰等方面的具体指标,个性指标包括森林保护修复工程、草原保护修复工程、湿地保护修复工程、矿山修复工程、防沙治沙工程、石漠化治理工程、水土流失治理工程、地质灾害治理工程、流域综合治理工程等不同工程类型的具体指标。工程管理成效指标主要包括工程绩效、工程实施及公众满意等方面的具体指标。在对各项指标评估的基础上,通过综合评估计算对生态保护修复工程实施成效进行评估。基于生态系统格局指数、生态系统功能指数、生态干扰指数、工程管理成效指数及不...  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号