首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In recent years, more and more high-throughput data sources useful for protein complex prediction have become available (e.g., gene sequence, mRNA expression, and interactions). The integration of these different data sources can be challenging. Recently, it has been recognized that kernel-based classifiers are well suited for this task. However, the different kernels (data sources) are often combined using equal weights. Although several methods have been developed to optimize kernel weights, no large-scale example of an improvement in classifier performance has been shown yet. In this work, we employ an evolutionary algorithm to determine weights for a larger set of kernels by optimizing a criterion based on the area under the ROC curve. We show that setting the right kernel weights can indeed improve performance. We compare this to the existing kernel weight optimization methods (i.e., (regularized) optimization of the SVM criterion or aligning the kernel with an ideal kernel) and find that these do not result in a significant performance improvement and can even cause a decrease in performance. Results also show that an expert approach of assigning high weights to features with high individual performance is not necessarily the best strategy.  相似文献   

2.
Proteins do not carry out their functions alone. Instead, they often act by participating in macromolecular complexes and play different functional roles depending on the other members of the complex. It is therefore interesting to identify co-complex relationships. Although protein complexes can be identified in a high-throughput manner by experimental technologies such as affinity purification coupled with mass spectrometry (APMS), these large-scale datasets often suffer from high false positive and false negative rates. Here, we present a computational method that predicts co-complexed protein pair (CCPP) relationships using kernel methods from heterogeneous data sources. We show that a diffusion kernel based on random walks on the full network topology yields good performance in predicting CCPPs from protein interaction networks. In the setting of direct ranking, a diffusion kernel performs much better than the mutual clustering coefficient. In the setting of SVM classifiers, a diffusion kernel performs much better than a linear kernel. We also show that combination of complementary information improves the performance of our CCPP recognizer. A summation of three diffusion kernels based on two-hybrid, APMS, and genetic interaction networks and three sequence kernels achieves better performance than the sequence kernels or diffusion kernels alone. Inclusion of additional features achieves a still better ROC(50) of 0.937. Assuming a negative-to-positive ratio of 600ratio1, the final classifier achieves 89.3% coverage at an estimated false discovery rate of 10%. Finally, we applied our prediction method to two recently described APMS datasets. We find that our predicted positives are highly enriched with CCPPs that are identified by both datasets, suggesting that our method successfully identifies true CCPPs. An SVM classifier trained from heterogeneous data sources provides accurate predictions of CCPPs in yeast. This computational method thereby provides an inexpensive method for identifying protein complexes that extends and complements high-throughput experimental data.  相似文献   

3.
The receiver operating characteristic (ROC) curve is an important tool for the evaluation and comparison of predictive models when the outcome is binary. If the class membership of the outcomes is known, ROC can be constructed for a model, and the ROC with greater area under the curve indicates better performance. However in practice, imperfect reference standards often exist, in which class membership of every data point is not fully determined. This situation is especially prevalent in high-throughput biomedical data because obtaining perfect reference standards for all data points is either too costly or technically impractical. To construct ROC curves for these data, the common practice is to either ignore the uncertainties in references or remove data points with high uncertainties. Such approaches may cause bias to the ROC curves and generate misleading results in method evaluation. Here we present a framework to incorporate membership uncertainties into the construction of ROC curve, termed the expected ROC or “eROC” curve. We develop an efficient procedure for the estimation of eROC curve. The advantages of using eROC are demonstrated using simulated and real data.  相似文献   

4.
Lloyd CJ 《Biometrics》2000,56(3):862-867
The performance of a diagnostic test is summarized by its receiver operating characteristic (ROC) curve. Under quite natural assumptions about the latent variable underlying the test, the ROC curve is convex. Empirical data on a test's performance often comes in the form of observed true positive and false positive relative frequencies under varying conditions. This paper describes a family of regression models for analyzing such data. The underlying ROC curves are specified by a quality parameter delta and a shape parameter mu and are guaranteed to be convex provided delta > 1. Both the position along the ROC curve and the quality parameter delta are modeled linearly with covariates at the level of the individual. The shape parameter mu enters the model through the link functions log(p mu) - log(1 - p mu) of a binomial regression and is estimated either by search or from an appropriate constructed variate. One simple application is to the meta-analysis of independent studies of the same diagnostic test, illustrated on some data of Moses, Shapiro, and Littenberg (1993). A second application, to so-called vigilance data, is given, where ROC curves differ across subjects and modeling of the position along the ROC curve is of primary interest.  相似文献   

5.
The development of high-throughput technology has generated a massive amount of high-dimensional data, and many of them are of discrete type. Robust and efficient learning algorithms such as LASSO [1] are required for feature selection and overfitting control. However, most feature selection algorithms are only applicable to the continuous data type. In this paper, we propose a novel method for sparse support vector machines (SVMs) with L_{p} (p ≪ 1) regularization. Efficient algorithms (LpSVM) are developed for learning the classifier that is applicable to high-dimensional data sets with both discrete and continuous data types. The regularization parameters are estimated through maximizing the area under the ROC curve (AUC) of the cross-validation data. Experimental results on protein sequence and SNP data attest to the accuracy, sparsity, and efficiency of the proposed algorithm. Biomarkers identified with our methods are compared with those from other methods in the literature. The software package in Matlab is available upon request.  相似文献   

6.
7.
8.
Summary This article considers receiver operating characteristic (ROC) analysis for bivariate marker measurements. The research interest is to extend tools and rules from univariate marker to bivariate marker setting for evaluating predictive accuracy of markers using a tree‐based classification rule. Using an and–or classifier, an ROC function together with a weighted ROC function (WROC) and their conjugate counterparts are proposed for examining the performance of bivariate markers. The proposed functions evaluate the performance of and–or classifiers among all possible combinations of marker values, and are ideal measures for understanding the predictability of biomarkers in target population. Specific features of ROC and WROC functions and other related statistics are discussed in comparison with those familiar properties for univariate marker. Nonparametric methods are developed for estimating ROC‐related functions (partial) area under curve and concordance probability. With emphasis on average performance of markers, the proposed procedures and inferential results are useful for evaluating marker predictability based on a single or bivariate marker (or test) measurements with different choices of markers, and for evaluating different and–or combinations in classifiers. The inferential results developed in this article also extend to multivariate markers with a sequence of arbitrarily combined and–or classifier.  相似文献   

9.
Summary In estimation of the ROC curve, when the true disease status is subject to nonignorable missingness, the observed likelihood involves the missing mechanism given by a selection model. In this article, we proposed a likelihood‐based approach to estimate the ROC curve and the area under the ROC curve when the verification bias is nonignorable. We specified a parametric disease model in order to make the nonignorable selection model identifiable. With the estimated verification and disease probabilities, we constructed four types of empirical estimates of the ROC curve and its area based on imputation and reweighting methods. In practice, a reasonably large sample size is required to estimate the nonignorable selection model in our settings. Simulation studies showed that all four estimators of ROC area performed well, and imputation estimators were generally more efficient than the other estimators proposed. We applied the proposed method to a data set from research in Alzheimer's disease.  相似文献   

10.
An issue for class‐imbalanced learning is what assessment metric should be employed. So far, precision‐recall curve (PRC) as a metric is rarely used in practice as compared with its alternative of receiver operating characteristic (ROC). This study investigates the performance of PRC as the evaluating criterion to address the class‐imbalanced data and focuses on the comparison of PRC with ROC. The advantages of PRC over ROC on assessing class‐imbalanced data are also investigated and tested on our proposed algorithm by tuning the whole model parameters in simulation studies and real data examples. The result shows that PRC is competitive with ROC as performance measurement for handling class‐imbalanced data in tuning the model parameters. PRC can be considered as an alternative but effective assessment for preprocessing (such as variable selection) skewed data and building a classifier in class‐imbalanced learning.  相似文献   

11.
Apoptosis, a type of cell death, is necessary for maintaining tissue homeostasis and removing malignant cells. Interrupted apoptosis process contributes to carcinogenesis, developmental defects, autoimmune diseases and neurological disorders. Due to the complexity of the process, the molecular dynamics and relative interactions of individual proteins responsible for the activation or inhibition of apoptosis should be researched systematically. In this study, we integrate known protein interactions from databases DIP, IntAct, MINT, HPRD and BioGRID by Naïve Bayes classifier. The receiver operation characteristic (ROC) curve with the area under the ROC curve (AUC) of 0.797 indicates it has a good performance in prediction. Then, we predict the global human apoptotic protein interactions network. Within it, we not only identify the already known interactions of caspases (caspase-8/-10, caspase-9, caspase-3/-6/-7) and Bcl-2 family, but also reveal that Bid can interact with casein kinases (CSK21/22/2B, KC1A, KC1E); both of B2LA1 and B2CL2 can interact with Bid, Bax and Bak; caspase-8 interacts with autophagic proteins (MLP3B, MLP3A and LRRk2). Consequently, we make an initial step to develop the web service IntApop that provides an appropriate platform for apoptosis researchers, systems biologists and translational clinician scientists to predict apoptotic protein interactions in human. In addition, the interaction network can be visualized online, making it a widely applicable systems biology tool for apoptosis and cancer researchers.  相似文献   

12.
MOTIVATION: An important application of microarrays is to discover genomic biomarkers, among tens of thousands of genes assayed, for disease classification. Thus there is a need for developing statistical methods that can efficiently use such high-throughput genomic data, select biomarkers with discriminant power and construct classification rules. The ROC (receiver operator characteristic) technique has been widely used in disease classification with low-dimensional biomarkers because (1) it does not assume a parametric form of the class probability as required for example in the logistic regression method; (2) it accommodates case-control designs and (3) it allows treating false positives and false negatives differently. However, due to computational difficulties, the ROC-based classification has not been used with microarray data. Moreover, the standard ROC technique does not incorporate built-in biomarker selection. RESULTS: We propose a novel method for biomarker selection and classification using the ROC technique for microarray data. The proposed method uses a sigmoid approximation to the area under the ROC curve as the objective function for classification and the threshold gradient descent regularization method for estimation and biomarker selection. Tuning parameter selection based on the V-fold cross validation and predictive performance evaluation are also investigated. The proposed approach is demonstrated with a simulation study, the Colon data and the Estrogen data. The proposed approach yields parsimonious models with excellent classification performance.  相似文献   

13.
Evaluation of diagnostic performance is typically based on the receiver operating characteristic (ROC) curve and the area under the curve (AUC) as its summary index. The partial area under the curve (pAUC) is an alternative index focusing on the range of practical/clinical relevance. One of the problems preventing more frequent use of the pAUC is the perceived loss of efficiency in cases of noncrossing ROC curves. In this paper, we investigated statistical properties of comparisons of two correlated pAUCs. We demonstrated that outside of the classic model there are practically reasonable ROC types for which comparisons of noncrossing concave curves would be more powerful when based on a part of the curve rather than the entire curve. We argue that this phenomenon stems in part from the exclusion of noninformative parts of the ROC curves that resemble straight‐lines. We conducted extensive simulation studies in families of binormal, straight‐line, and bigamma ROC curves. We demonstrated that comparison of pAUCs is statistically more powerful than comparison of full AUCs when ROC curves are close to a “straight line”. For less flat binormal ROC curves an increase in the integration range often leads to a disproportional increase in pAUCs’ difference, thereby contributing to an increase in statistical power. Thus, efficiency of differences in pAUCs of noncrossing ROC curves depends on the shape of the curves, and for families of ROC curves that are nearly straight‐line shaped, such as bigamma ROC curves, there are multiple practical scenarios in which comparisons of pAUCs are preferable.  相似文献   

14.
15.

Background  

The receiver operating characteristic (ROC) curve is a fundamental tool to assess the discriminant performance for not only a single marker but also a score function combining multiple markers. The area under the ROC curve (AUC) for a score function measures the intrinsic ability for the score function to discriminate between the controls and cases. Recently, the partial AUC (pAUC) has been paid more attention than the AUC, because a suitable range of the false positive rate can be focused according to various clinical situations. However, existing pAUC-based methods only handle a few markers and do not take nonlinear combination of markers into consideration.  相似文献   

16.
Rodenberg C  Zhou XH 《Biometrics》2000,56(4):1256-1262
A receiver operating characteristic (ROC) curve is commonly used to measure the accuracy of a medical test. It is a plot of the true positive fraction (sensitivity) against the false positive fraction (1-specificity) for increasingly stringent positivity criterion. Bias can occur in estimation of an ROC curve if only some of the tested patients are selected for disease verification and if analysis is restricted only to the verified cases. This bias is known as verification bias. In this paper, we address the problem of correcting for verification bias in estimation of an ROC curve when the verification process and efficacy of the diagnostic test depend on covariates. Our method applies the EM algorithm to ordinal regression models to derive ML estimates for ROC curves as a function of covariates, adjusted for covariates affecting the likelihood of being verified. Asymptotic variance estimates are obtained using the observed information matrix of the observed data. These estimates are derived under the missing-at-random assumption, which means that selection for disease verification depends only on the observed data, i.e., the test result and the observed covariates. We also address the issues of model selection and model checking. Finally, we illustrate the proposed method on data from a two-phase study of dementia disorders, where selection for verification depends on the screening test result and age.  相似文献   

17.

Background  

As in many different areas of science and technology, most important problems in bioinformatics rely on the proper development and assessment of binary classifiers. A generalized assessment of the performance of binary classifiers is typically carried out through the analysis of their receiver operating characteristic (ROC) curves. The area under the ROC curve (AUC) constitutes a popular indicator of the performance of a binary classifier. However, the assessment of the statistical significance of the difference between any two classifiers based on this measure is not a straightforward task, since not many freely available tools exist. Most existing software is either not free, difficult to use or not easy to automate when a comparative assessment of the performance of many binary classifiers is intended. This constitutes the typical scenario for the optimization of parameters when developing new classifiers and also for their performance validation through the comparison to previous art.  相似文献   

18.
Hologram quantitative structure-activity relationships (HQSAR) were applied to a data set of 41 cruzain inhibitors. The best HQSAR model (Q(2)=0.77; R(2)=0.90) employing Surflex-Sim, as training and test sets generator, was obtained using atoms, bonds, and connections as fragment distinctions and 4-7 as fragment size. This model was then used to predict the potencies of 12 test set compounds, giving satisfactory predictive R(2) value of 0.88. The contribution maps obtained from the best HQSAR model are in agreement with the biological activities of the study compounds. The Trypanosoma cruzi cruzain shares high similarity with the mammalian homolog cathepsin L. The selectivity toward cruzain was checked by a database of 123 compounds, which corresponds to the 41 cruzain inhibitors used in the HQSAR model development plus 82 cathepsin L inhibitors. We screened these compounds by ROCS (Rapid Overlay of Chemical Structures), a Gaussian-shape volume overlap filter that can rapidly identify shapes that match the query molecule. Remarkably, ROCS was able to rank the first 37 hits as being only cruzain inhibitors. In addition, the area under the curve (AUC) obtained with ROCS was 0.96, indicating that the method was very efficient to distinguishing between cruzain and cathepsin L inhibitors.  相似文献   

19.
Combining diagnostic test results to increase accuracy   总被引:4,自引:0,他引:4  
When multiple diagnostic tests are performed on an individual or multiple disease markers are available it may be possible to combine the information to diagnose disease. We consider how to choose linear combinations of markers in order to optimize diagnostic accuracy. The accuracy index to be maximized is the area or partial area under the receiver operating characteristic (ROC) curve. We propose a distribution-free rank-based approach for optimizing the area under the ROC curve and compare it with logistic regression and with classic linear discriminant analysis (LDA). It has been shown that the latter method optimizes the area under the ROC curve when test results have a multivariate normal distribution for diseased and non-diseased populations. Simulation studies suggest that the proposed non-parametric method is efficient when data are multivariate normal.The distribution-free method is generalized to a smooth distribution-free approach to: (i) accommodate some reasonable smoothness assumptions; (ii) incorporate covariate effects; and (iii) yield optimized partial areas under the ROC curve. This latter feature is particularly important since it allows one to focus on a region of the ROC curve which is of most relevance to clinical practice. Neither logistic regression nor LDA necessarily maximize partial areas. The approaches are illustrated on two cancer datasets, one involving serum antigen markers for pancreatic cancer and the other involving longitudinal prostate specific antigen data.  相似文献   

20.
A critical step for DNA array analysis is data filtration, which can reduce thousands of detected signals to limited sets of genes. Commonly accepted rules for such filtration are still absent. We present a rational approach, based on thresholding of intensities with cutoff levels that are estimated by receiver operating characteristic (ROC) analysis. The technique compares test results with known distributions of positive and negative signals. We apply the method to Atlas cDNA arrays, GeneFilters, and Affymetrix GeneChip. ROC analysis demonstrates similarities in the distribution of false and true positive data for these different systems. We illustrate the estimation of an optimal cutoff level for intensity-based filtration, providing the highest ratio of true to false signals. For GeneChip arrays, we derived filtration thresholds consistent with the reported data based on replicate hybridizations. Intensity-based filtration optimized with ROC combined with other types of filtration (for example, based on significances of differences and/or ratios), should improve DNA array analysis. ROC methodology is also demonstrated for comparison of the performance of different types of arrays, imagers, and analysis software.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号