期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Gene function classification using Bayesian models with hierarchy-based priors

Babak Shahbaba Radford M Neal 《BMC bioinformatics》2006,7(1):448

相似文献

2.

An adaptive optimal ensemble classifier via bagging and rank aggregation with applications to high dimensional data

Susmita Datta Vasyl Pihur Somnath Datta 《BMC bioinformatics》2010,11(1):427

Background

Generally speaking, different classifiers tend to work well for certain types of data and conversely, it is usually not known a priori which algorithm will be optimal in any given classification application. In addition, for most classification problems, selecting the best performing classification algorithm amongst a number of competing algorithms is a difficult task for various reasons. As for example, the order of performance may depend on the performance measure employed for such a comparison. In this work, we present a novel adaptive ensemble classifier constructed by combining bagging and rank aggregation that is capable of adaptively changing its performance depending on the type of data that is being classified. The attractive feature of the proposed classifier is its multi-objective nature where the classification results can be simultaneously optimized with respect to several performance measures, for example, accuracy, sensitivity and specificity. We also show that our somewhat complex strategy has better predictive performance as judged on test samples than a more naive approach that attempts to directly identify the optimal classifier based on the training data performances of the individual classifiers. 相似文献

3.

Knowing Right From Wrong In Mental Arithmetic Judgments: Calibration Of Confidence Predicts The Development Of Accuracy

Luke F. Rinne Michèle M. M. Mazzocco 《PloS one》2014,9(7)

Does knowing when mental arithmetic judgments are right—and when they are wrong—lead to more accurate judgments over time? We hypothesize that the successful detection of errors (and avoidance of false alarms) may contribute to the development of mental arithmetic performance. Insight into error detection abilities can be gained by examining the “calibration” of mental arithmetic judgments—that is, the alignment between confidence in judgments and the accuracy of those judgments. Calibration may be viewed as a measure of metacognitive monitoring ability. We conducted a developmental longitudinal investigation of the relationship between the calibration of children''s mental arithmetic judgments and their performance on a mental arithmetic task. Annually between Grades 5 and 8, children completed a problem verification task in which they rapidly judged the accuracy of arithmetic expressions (e.g., 25+50 = 75) and rated their confidence in each judgment. Results showed that calibration was strongly related to concurrent mental arithmetic performance, that calibration continued to develop even as mental arithmetic accuracy approached ceiling, that poor calibration distinguished children with mathematics learning disability from both low and typically achieving children, and that better calibration in Grade 5 predicted larger gains in mental arithmetic accuracy between Grades 5 and 8. We propose that good calibration supports the implementation of cognitive control, leading to long-term improvement in mental arithmetic accuracy. Because mental arithmetic “fluency” is critical for higher-level mathematics competence, calibration of confidence in mental arithmetic judgments may represent a novel and important developmental predictor of future mathematics performance. 相似文献

4.

Using hierarchical text classification to investigate the utility of machine learning in automating online analyses of wildlife exploitation

《Ecological Informatics》2023

Expanding digital data sources, including social media, online news articles and blogs, provide an opportunity to understand better the context and intensity of human-nature interactions, such as wildlife exploitation. However, online searches encompassing large taxonomic groups can generate vast datasets, which can be overwhelming to filter for relevant content without the use of automated tools. The variety of machine learning models available to researchers, and the need for manually labelled training data with an even balance of labels, can make applying these tools challenging. Here, we implement and evaluate a hierarchical text classification pipeline which brings together three binary classification tasks with increasingly specific relevancy criteria. Crucially, the hierarchical approach facilitates the filtering and structuring of a large dataset, of which relevant sources make up a small proportion. Using this pipeline, we also investigate how the accuracy with which text classifiers identify relevant and irrelevant texts is influenced by the use of different models, training datasets, and the classification task. To evaluate our methods, we collected data from Facebook, Twitter, Google and Bing search engines, with the aim of identifying sources documenting the hunting and persecution of bats (Chiroptera). Overall, the ‘state-of-the-art’ transformer-based models were able to identify relevant texts with an average accuracy of 90%, with some classifiers achieving accuracy of >95%. Whilst this demonstrates that application of more advanced models can lead to improved accuracy, comparable performance was achieved by simpler models when applied to longer documents and less ambiguous classification tasks. Hence, the benefits from using more computationally expensive models are dependent on the classification context. We also found that stratification of training data, according to the presence of key search terms, improved classification accuracy for less frequent topics within datasets, and therefore improves the applicability of classifiers to future data collection. Overall, whilst our findings reinforce the usefulness of automated tools for facilitating online analyses in conservation and ecology, they also highlight that the effectiveness and appropriateness of such tools is determined by the nature and volume of data collected, the complexity of the classification task, and the computational resources available to researchers. 相似文献

5.

Consensus methods based on machine learning techniques for marine phytoplankton presence–absence prediction

《Ecological Informatics》2017

We performed different consensus methods by combining binary classifiers, mostly machine learning classifiers, with the aim to test their capability as predictive tools for the presence–absence of marine phytoplankton species. The consensus methods were constructed by considering a combination of four methods (i.e., generalized linear models, random forests, boosting and support vector machines). Six different consensus methods were analyzed by taking into account six different ways of combining single-model predictions. Some of these methods are presented here for the first time. To evaluate the performance of the models, we considered eight phytoplankton species presence–absence data sets and data related to environmental variables. Some of the analyzed species are toxic, whereas others provoke water discoloration, which can cause alarm in the population. Besides the phytoplankton data sets, we tested the models on 10 well-known open access data sets. We evaluated the models' performances over a test sample. For most (72%) of the data sets, a consensus method was the method with the lowest classification error. In particular, a consensus method that weighted single-model predictions in accordance with single-model performances (weighted average prediction error — WA-PE model) was the one that presented the lowest classification error most of the time. For the phytoplankton species, the errors of the WA-PE model were between 10% for the species Akashiwo sanguinea and 38% for Dinophysis acuminata. This study provides novel approaches to improve the prediction accuracy in species distribution studies and, in particular, in those concerning marine phytoplankton species. 相似文献

6.

On optimal Bayesian classification and risk estimation under multiple classes

Lori?A.?Dalton Email author Mohammadmahdi?R.?Yousefi 《EURASIP Journal on Bioinformatics and Systems Biology》2015,2015(1):8

A recently proposed optimal Bayesian classification paradigm addresses optimal error rate analysis for small-sample discrimination, including optimal classifiers, optimal error estimators, and error estimation analysis tools with respect to the probability of misclassification under binary classes. Here, we address multi-class problems and optimal expected risk with respect to a given risk function, which are common settings in bioinformatics. We present Bayesian risk estimators (BRE) under arbitrary classifiers, the mean-square error (MSE) of arbitrary risk estimators under arbitrary classifiers, and optimal Bayesian risk classifiers (OBRC). We provide analytic expressions for these tools under several discrete and Gaussian models and present a new methodology to approximate the BRE and MSE when analytic expressions are not available. Of particular note, we present analytic forms for the MSE under Gaussian models with homoscedastic covariances, which are new even in binary classification. 相似文献

7.

Classification in conservation biology: A comparison of five machine-learning methods 总被引：3，自引：0，他引：3

Christian Kampichler Ralf Wieland Sophie Calm Holger Weissenberger Stefan Arriaga-Weiss 《Ecological Informatics》2010,5(6):441-450

Classification is one of the most widely applied tasks in ecology. Ecologists have to deal with noisy, high-dimensional data that often are non-linear and do not meet the assumptions of conventional statistical procedures. To overcome this problem, machine-learning methods have been adopted as ecological classification methods. We compared five machine-learning based classification techniques (classification trees, random forests, artificial neural networks, support vector machines, and automatically induced rule-based fuzzy models) in a biological conservation context. The study case was that of the ocellated turkey (Meleagris ocellata), a bird endemic to the Yucatan peninsula that has suffered considerable decreases in local abundance and distributional area during the last few decades. On a grid of 10 × 10 km cells that was superimposed to the peninsula we analysed relationships between environmental and social explanatory variables and ocellated turkey abundance changes between 1980 and 2000. Abundance was expressed in three (decrease, no change, and increase) and 14 more detailed abundance change classes, respectively. Modelling performance varied considerably between methods with random forests and classification trees being the most efficient ones as measured by overall classification error and the normalised mutual information index. Artificial neural networks yielded the worst results along with linear discriminant analysis, which was included as a conventional statistical approach. We not only evaluated classification accuracy but also characteristics such as time effort, classifier comprehensibility and method intricacy—aspects that determine the success of a classification technique among ecologists and conservation biologists as well as for the communication with managers and decision makers. We recommend the combined use of classification trees and random forests due to the easy interpretability of classifiers and the high comprehensibility of the method. 相似文献

8.

Classification and Visualization Based on Derived Image Features: Application to Genetic Syndromes

Brunilda Balliu Rolf P. Würtz Bernhard Horsthemke Dagmar Wieczorek Stefan B?hringer 《PloS one》2014,9(11)

Data transformations prior to analysis may be beneficial in classification tasks. In this article we investigate a set of such transformations on 2D graph-data derived from facial images and their effect on classification accuracy in a high-dimensional setting. These transformations are low-variance in the sense that each involves only a fixed small number of input features. We show that classification accuracy can be improved when penalized regression techniques are employed, as compared to a principal component analysis (PCA) pre-processing step. In our data example classification accuracy improves from 47% to 62% when switching from PCA to penalized regression. A second goal is to visualize the resulting classifiers. We develop importance plots highlighting the influence of coordinates in the original 2D space. Features used for classification are mapped to coordinates in the original images and combined into an importance measure for each pixel. These plots assist in assessing plausibility of classifiers, interpretation of classifiers, and determination of the relative importance of different features. 相似文献

9.

A feature selection method for classification within functional genomics experiments based on the proportional overlapping score

Osama Mahmoud Andrew Harrison Aris Perperoglou Asma Gul Zardad Khan Metodi V Metodiev Berthold Lausen 《BMC bioinformatics》2014,15(1)

Background

Microarray technology, as well as other functional genomics experiments, allow simultaneous measurements of thousands of genes within each sample. Both the prediction accuracy and interpretability of a classifier could be enhanced by performing the classification based only on selected discriminative genes. We propose a statistical method for selecting genes based on overlapping analysis of expression data across classes. This method results in a novel measure, called proportional overlapping score (POS), of a feature’s relevance to a classification task.

Results

We apply POS, along‐with four widely used gene selection methods, to several benchmark gene expression datasets. The experimental results of classification error rates computed using the Random Forest, k Nearest Neighbor and Support Vector Machine classifiers show that POS achieves a better performance.

Conclusions

A novel gene selection method, POS, is proposed. POS analyzes the expressions overlap across classes taking into account the proportions of overlapping samples. It robustly defines a mask for each gene that allows it to minimize the effect of expression outliers. The constructed masks along‐with a novel gene score are exploited to produce the selected subset of genes.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2105-15-274) contains supplementary material, which is available to authorized users. 相似文献

10.

A comparison of supervised learning techniques in the classification of bat echolocation calls

David W. Armitage Holly K. Ober 《Ecological Informatics》2010,5(6):465-473

Today's acoustic monitoring devices are capable of recording and storing tremendous amounts of data. Until recently, the classification of animal vocalizations from field recordings has been relegated to qualitative approaches. For large-scale acoustic monitoring studies, qualitative approaches are very time-consuming and suffer from the bias of subjectivity. Recent developments in supervised learning techniques can provide rapid, accurate, species-level classification of bioacoustics data. We compared the classification performances of four supervised learning techniques (random forests, support vector machines, artificial neural networks, and discriminant function analysis) for five different classification tasks using bat echolocation calls recorded by a popular frequency-division bat detector. We found that all classifiers performed similarly in terms of overall accuracy with the exception of discriminant function analysis, which had the lowest average performance metrics. Random forests had the advantage of high sensitivities, specificities, and predictive powers across the majority of classification tasks, and also provided metrics for determining the relative importance of call features in distinguishing between groups. Overall classification accuracy for each task was slightly lower than reported accuracies using calls recorded by time-expansion detectors. Myotis spp. were particularly difficult to separate; classifiers performed best when members of this genus were combined in genus-level classification and analyzed separately at the level of species. Additionally, we identified and ranked the relative contributions of all predictor features to classifier accuracy and found measurements of frequency, total call duration, and characteristic slope to be the most important contributors to classification success. We provide recommendations to maximize accuracy and efficiency when analyzing acoustic data, and suggest an application of automated bioacoustics monitoring to contribute to wildlife monitoring efforts. 相似文献

11.

Strong feature sets from small samples.

Seungchan Kim Edward R Dougherty Junior Barrera Yidong Chen Michael L Bittner Jeffrey M Trent 《Journal of computational biology》2002,9(1):127-146

For small samples, classifier design algorithms typically suffer from overfitting. Given a set of features, a classifier must be designed and its error estimated. For small samples, an error estimator may be unbiased but, owing to a large variance, often give very optimistic estimates. This paper proposes mitigating the small-sample problem by designing classifiers from a probability distribution resulting from spreading the mass of the sample points to make classification more difficult, while maintaining sample geometry. The algorithm is parameterized by the variance of the spreading distribution. By increasing the spread, the algorithm finds gene sets whose classification accuracy remains strong relative to greater spreading of the sample. The error gives a measure of the strength of the feature set as a function of the spread. The algorithm yields feature sets that can distinguish the two classes, not only for the sample data, but for distributions spread beyond the sample data. For linear classifiers, the topic of the present paper, the classifiers are derived analytically from the model, thereby providing an enormous savings in computation time. The algorithm is applied to cancer classification via cDNA microarrays. In particular, the genes BRCA1 and BRCA2 are associated with a hereditary disposition to breast cancer, and the algorithm is used to find gene sets whose expressions can be used to classify BRCA1 and BRCA2 tumors. 相似文献

12.

Rule extraction from minimal neural networks for credit card screening

Setiono R Baesens B Mues C 《International journal of neural systems》2011,21(4):265-276

While feedforward neural networks have been widely accepted as effective tools for solving classification problems, the issue of finding the best network architecture remains unresolved, particularly so in real-world problem settings. We address this issue in the context of credit card screening, where it is important to not only find a neural network with good predictive performance but also one that facilitates a clear explanation of how it produces its predictions. We show that minimal neural networks with as few as one hidden unit provide good predictive accuracy, while having the added advantage of making it easier to generate concise and comprehensible classification rules for the user. To further reduce model size, a novel approach is suggested in which network connections from the input units to this hidden unit are removed by a very straightaway pruning procedure. In terms of predictive accuracy, both the minimized neural networks and the rule sets generated from them are shown to compare favorably with other neural network based classifiers. The rules generated from the minimized neural networks are concise and thus easier to validate in a real-life setting. 相似文献

13.

Error Correcting Mechanisms during Antisaccades: Contribution of Online Control during Primary Saccades and Offline Control via Secondary Saccades

Harleen Bedi Herbert C. Goltz Agnes M. F. Wong Manokaraananthan Chandrakumar Ewa Niechwiej-Szwedo 《PloS one》2013,8(8)

Errors in eye movements can be corrected during the ongoing saccade through in-flight modifications (i.e., online control), or by programming a secondary eye movement (i.e., offline control). In a reflexive saccade task, the oculomotor system can use extraretinal information (i.e., efference copy) online to correct errors in the primary saccade, and offline retinal information to generate a secondary corrective saccade. The purpose of this study was to examine the error correction mechanisms in the antisaccade task. The roles of extraretinal and retinal feedback in maintaining eye movement accuracy were investigated by presenting visual feedback at the spatial goal of the antisaccade. We found that online control for antisaccade is not affected by the presence of visual feedback; that is whether visual feedback is present or not, the duration of the deceleration interval was extended and significantly correlated with reduced antisaccade endpoint error. We postulate that the extended duration of deceleration is a feature of online control during volitional saccades to improve their endpoint accuracy. We found that secondary saccades were generated more frequently in the antisaccade task compared to the reflexive saccade task. Furthermore, we found evidence for a greater contribution from extraretinal sources of feedback in programming the secondary “corrective” saccades in the antisaccade task. Nonetheless, secondary saccades were more corrective for the remaining antisaccade amplitude error in the presence of visual feedback of the target. Taken together, our results reveal a distinctive online error control strategy through an extension of the deceleration interval in the antisaccade task. Target feedback does not improve online control, rather it improves the accuracy of secondary saccades in the antisaccade task. 相似文献

14.

Evaluation of Argos Telemetry Accuracy in the High-Arctic and Implications for the Estimation of Home-Range Size

Sylvain Christin Martin-Hugues St-Laurent Dominique Berteaux 《PloS one》2015,10(11)

Animal tracking through Argos satellite telemetry has enormous potential to test hypotheses in animal behavior, evolutionary ecology, or conservation biology. Yet the applicability of this technique cannot be fully assessed because no clear picture exists as to the conditions influencing the accuracy of Argos locations. Latitude, type of environment, and transmitter movement are among the main candidate factors affecting accuracy. A posteriori data filtering can remove “bad” locations, but again testing is still needed to refine filters. First, we evaluate experimentally the accuracy of Argos locations in a polar terrestrial environment (Nunavut, Canada), with both static and mobile transmitters transported by humans and coupled to GPS transmitters. We report static errors among the lowest published. However, the 68^th error percentiles of mobile transmitters were 1.7 to 3.8 times greater than those of static transmitters. Second, we test how different filtering methods influence the quality of Argos location datasets. Accuracy of location datasets was best improved when filtering in locations of the best classes (LC3 and 2), while the Douglas Argos filter and a homemade speed filter yielded similar performance while retaining more locations. All filters effectively reduced the 68^th error percentiles. Finally, we assess how location error impacted, at six spatial scales, two common estimators of home-range size (a proxy of animal space use behavior synthetizing movements), the minimum convex polygon and the fixed kernel estimator. Location error led to a sometimes dramatic overestimation of home-range size, especially at very local scales. We conclude that Argos telemetry is appropriate to study medium-size terrestrial animals in polar environments, but recommend that location errors are always measured and evaluated against research hypotheses, and that data are always filtered before analysis. How movement speed of transmitters affects location error needs additional research. 相似文献

15.

ROC-based utility function maximization for feature selection and classification with applications to high-dimensional protease data

Liu Z Tan M 《Biometrics》2008,64(4):1155-1161

SUMMARY: In medical diagnosis, the diseased and nondiseased classes are usually unbalanced and one class may be more important than the other depending on the diagnosis purpose. Most standard classification methods, however, are designed to maximize the overall accuracy and cannot incorporate different costs to different classes explicitly. In this article, we propose a novel nonparametric method to directly maximize the weighted specificity and sensitivity of the receiver operating characteristic curve. Combining advances in machine learning, optimization theory, and statistics, the proposed method has excellent generalization property and assigns different error costs to different classes explicitly. We present experiments that compare the proposed algorithms with support vector machines and regularized logistic regression using data from a study on HIV-1 protease as well as six public available datasets. Our main conclusion is that the performance of proposed algorithm is significantly better in most cases than the other classifiers tested. Software package in MATLAB is available upon request. 相似文献

16.

Impact of ecological redundancy on the performance of machine learning classifiers in vegetation mapping

下载免费PDF全文

Paul D. Macintyre Adriaan Van Niekerk Mark P. Dobrowolski James L. Tsakalos Ladislav Mucina 《Ecology and evolution》2018,8(13):6728-6737

Vegetation maps are models of the real vegetation patterns and are considered important tools in conservation and management planning. Maps created through traditional methods can be expensive and time‐consuming, thus, new more efficient approaches are needed. The prediction of vegetation patterns using machine learning shows promise, but many factors may impact on its performance. One important factor is the nature of the vegetation–environment relationship assessed and ecological redundancy. We used two datasets with known ecological redundancy levels (strength of the vegetation–environment relationship) to evaluate the performance of four machine learning (ML) classifiers (classification trees, random forests, support vector machines, and nearest neighbor). These models used climatic and soil variables as environmental predictors with pretreatment of the datasets (principal component analysis and feature selection) and involved three spatial scales. We show that the ML classifiers produced more reliable results in regions where the vegetation–environment relationship is stronger as opposed to regions characterized by redundant vegetation patterns. The pretreatment of datasets and reduction in prediction scale had a substantial influence on the predictive performance of the classifiers. The use of ML classifiers to create potential vegetation maps shows promise as a more efficient way of vegetation modeling. The difference in performance between areas with poorly versus well‐structured vegetation–environment relationships shows that some level of understanding of the ecology of the target region is required prior to their application. Even in areas with poorly structured vegetation–environment relationships, it is possible to improve classifier performance by either pretreating the dataset or reducing the spatial scale of the predictions. 相似文献

17.

Multi-factorial analysis of class prediction error: estimating optimal number of biomarkers for various classification rules

Khondoker MR Bachmann TT Mewissen M Dickinson P Dobrzelecki B Campbell CJ Mount AR Walton AJ Crain J Schulze H Giraud G Ross AJ Ciani I Ember SW Tlili C Terry JG Grant E McDonnell N Ghazal P 《Journal of bioinformatics and computational biology》2010,8(6):945-965

Machine learning and statistical model based classifiers have increasingly been used with more complex and high dimensional biological data obtained from high-throughput technologies. Understanding the impact of various factors associated with large and complex microarray datasets on the predictive performance of classifiers is computationally intensive, under investigated, yet vital in determining the optimal number of biomarkers for various classification purposes aimed towards improved detection, diagnosis, and therapeutic monitoring of diseases. We investigate the impact of microarray based data characteristics on the predictive performance for various classification rules using simulation studies. Our investigation using Random Forest, Support Vector Machines, Linear Discriminant Analysis and k-Nearest Neighbour shows that the predictive performance of classifiers is strongly influenced by training set size, biological and technical variability, replication, fold change and correlation between biomarkers. Optimal number of biomarkers for a classification problem should therefore be estimated taking account of the impact of all these factors. A database of average generalization errors is built for various combinations of these factors. The database of generalization errors can be used for estimating the optimal number of biomarkers for given levels of predictive accuracy as a function of these factors. Examples show that curves from actual biological data resemble that of simulated data with corresponding levels of data characteristics. An R package optBiomarker implementing the method is freely available for academic use from the Comprehensive R Archive Network (http://www.cran.r-project.org/web/packages/optBiomarker/). 相似文献

18.

Selecting discriminant function models for predicting the expected richness of aquatic macroinvertebrates 总被引：4，自引：0，他引：4

JOHN VAN SICKLE DAVID D. HUFF CHARLES P. HAWKINS 《Freshwater Biology》2006,51(2):359-372

1. The predictive modelling approach to bioassessment estimates the macroinvertebrate assemblage expected at a stream site if it were in a minimally disturbed reference condition. The difference between expected and observed assemblages then measures the departure of the site from reference condition. 2. Most predictive models employ site classification, followed by discriminant function (DF) modelling, to predict the expected assemblage from a suite of environmental variables. Stepwise DF analysis is normally used to choose a single subset of DF predictor variables with a high accuracy for classifying sites. An alternative is to screen all possible combinations of predictor variables, in order to identify several ‘best’ subsets that yield good overall performance of the predictive model. 3. We applied best‐subsets DF analysis to assemblage and environmental data from 199 reference sites in Oregon, U.S.A. Two sets of 66 best DF models containing between one and 14 predictor variables (that is, having model orders from one to 14) were developed, for five‐group and 11‐group site classifications. 4. Resubstitution classification accuracy of the DF models increased consistently with model order, but cross‐validated classification accuracy did not improve beyond seventh or eighth‐order models, suggesting that the larger models were overfitted. 5. Overall predictive model performance at model training sites, measured by the root‐mean‐squared error of the observed/expected species richness ratio, also improved steadily with DF model order. But high‐order DF models usually performed poorly at an independent set of validation sites, another sign of model overfitting. 6. Models selected by stepwise DF analysis showed evidence of overfitting and were outperformed by several of the best‐subsets models. 7. The group separation strength of a DF model, as measured by Wilks’Λ, was more strongly correlated with overall predictive model performance at training sites than was DF classification accuracy. 8. Our results suggest improved strategies for developing reliable, parsimonious predictive models. We emphasise the value of independent validation data for obtaining a realistic picture of model performance. We also recommend assessing not just one or two, but several, candidate models based on their overall performance as well as the performance of their DF component. 9. We provide links to our free software for stepwise and best‐subsets DF analysis. 相似文献

19.

Predicting Disease Risk Using Bootstrap Ranking and Classification Algorithms

Ohad Manor Eran Segal 《PLoS computational biology》2013,9(8)

Genome-wide association studies (GWAS) are widely used to search for genetic loci that underlie human disease. Another goal is to predict disease risk for different individuals given their genetic sequence. Such predictions could either be used as a “black box” in order to promote changes in life-style and screening for early diagnosis, or as a model that can be studied to better understand the mechanism of the disease. Current methods for risk prediction typically rank single nucleotide polymorphisms (SNPs) by the p-value of their association with the disease, and use the top-associated SNPs as input to a classification algorithm. However, the predictive power of such methods is relatively poor. To improve the predictive power, we devised BootRank, which uses bootstrapping in order to obtain a robust prioritization of SNPs for use in predictive models. We show that BootRank improves the ability to predict disease risk of unseen individuals in the Wellcome Trust Case Control Consortium (WTCCC) data and results in a more robust set of SNPs and a larger number of enriched pathways being associated with the different diseases. Finally, we show that combining BootRank with seven different classification algorithms improves performance compared to previous studies that used the WTCCC data. Notably, diseases for which BootRank results in the largest improvements were recently shown to have more heritability than previously thought, likely due to contributions from variants with low minimum allele frequency (MAF), suggesting that BootRank can be beneficial in cases where SNPs affecting the disease are poorly tagged or have low MAF. Overall, our results show that improving disease risk prediction from genotypic information may be a tangible goal, with potential implications for personalized disease screening and treatment. 相似文献

20.

Non-linear models based on simple topological indices to identify RNase III protein members

Agüero-Chapin G de la Riva GA Molina-Ruiz R Sánchez-Rodríguez A Pérez-Machado G Vasconcelos V Antunes A 《Journal of theoretical biology》2011,273(1):167-178

Alignment-free classifiers are especially useful in the functional classification of protein classes with variable homology and different domain structures. Thus, the Topological Indices to BioPolymers (TI2BioP) methodology (Agüero-Chapin et al., 2010) inspired in both the TOPS-MODE and the MARCH-INSIDE methodologies allows the calculation of simple topological indices (TIs) as alignment-free classifiers. These indices were derived from the clustering of the amino acids into four classes of hydrophobicity and polarity revealing higher sequence-order information beyond the amino acid composition level. The predictability power of such TIs was evaluated for the first time on the RNase III family, due to the high diversity of its members (primary sequence and domain organization). Three non-linear models were developed for RNase III class prediction: Decision Tree Model (DTM), Artificial Neural Networks (ANN)-model and Hidden Markov Model (HMM). The first two are alignment-free approaches, using TIs as input predictors. Their performances were compared with a non-classical HMM, modified according to our amino acid clustering strategy. The alignment-free models showed similar performances on the training and the test sets reaching values above 90% in the overall classification. The non-classical HMM showed the highest rate in the classification with values above 95% in training and 100% in test. Although the higher accuracy of the HMM, the DTM showed simplicity for the RNase III classification with low computational cost. Such simplicity was evaluated in respect to HMM and ANN models for the functional annotation of a new bacterial RNase III class member, isolated and annotated by our group. 相似文献