Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: Cyscore as a case study期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

首页 | 本学科首页

官方微博 | 高级检索

Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: Cyscore as a case study

Authors:	Hongjian Li Kwong-Sak Leung Man-Hon Wong Pedro J Ballester

Affiliation:	.Department of Computer Science and Engineering, Chinese University of Hong Kong, Hong Kong, China ;.European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK ;.Cancer Research Center of Marseille (Inserm U1068, UM105, IPC), 27 Boulevard Lei Roure, 13009 Marseille, France

Abstract:	Background State-of-the-art protein-ligand docking methods are generally limited by the traditionally low accuracy of their scoring functions, which are used to predict binding affinity and thus vital for discriminating between active and inactive compounds. Despite intensive research over the years, classical scoring functions have reached a plateau in their predictive performance. These assume a predetermined additive functional form for some sophisticated numerical features, and use standard multivariate linear regression (MLR) on experimental data to derive the coefficients. Results In this study we show that such a simple functional form is detrimental for the prediction performance of a scoring function, and replacing linear regression by machine learning techniques like random forest (RF) can improve prediction performance. We investigate the conditions of applying RF under various contexts and find that given sufficient training samples RF manages to comprehensively capture the non-linearity between structural features and measured binding affinities. Incorporating more structural features and training with more samples can both boost RF performance. In addition, we analyze the importance of structural features to binding affinity prediction using the RF variable importance tool. Lastly, we use Cyscore, a top performing empirical scoring function, as a baseline for comparison study. Conclusions Machine-learning scoring functions are fundamentally different from classical scoring functions because the former circumvents the fixed functional form relating structural features with binding affinities. RF, but not MLR, can effectively exploit more structural features and more training samples, leading to higher prediction performance. The future availability of more X-ray crystal structures will further widen the performance gap between RF-based and MLR-based scoring functions. This further stresses the importance of substituting RF for MLR in scoring function development. Electronic supplementary material The online version of this article (doi:10.1186/1471-2105-15-291) contains supplementary material, which is available to authorized users.

Keywords:	Molecular docking Binding affinity Drug discovery Machine learning

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司京ICP备09084417号