首页 | 本学科首页   官方微博 | 高级检索  
     


Learning a functional grammar of protein domains using natural language word embedding techniques
Authors:Daniel W. A. Buchan  David T. Jones
Affiliation:Department of Computer Science, University College London, London, UK
Abstract:In this paper, using Word2vec, a widely-used natural language processing method, we demonstrate that protein domains may have a learnable implicit semantic “meaning” in the context of their functional contributions to the multi-domain proteins in which they are found. Word2vec is a group of models which can be used to produce semantically meaningful embeddings of words or tokens in a fixed-dimension vector space. In this work, we treat multi-domain proteins as “sentences” where domain identifiers are tokens which may be considered as “words.” Using all InterPro (Finn et al. 2017) pfam domain assignments we observe that the embedding could be used to suggest putative GO assignments for Pfam (Finn et al. 2016) domains of unknown function.
Keywords:function prediction  machine learning  protein domains  semantic embedding  word2vec
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号