首页 | 本学科首页   官方微博 | 高级检索  
   检索      


Clinical and pharmacogenomic data mining: 1. Generalized theory of expected information and application to the development of tools
Authors:Robson Barry
Institution:T. J. Watson Research Center, 1101 Kitchwan Road, Yorktown Heights, New York 10598, USA.
Abstract:New scientific problems, arising from the human genome project, are challenging the classical means of using statistics. Yet quantified knowledge in the form of rules and rule strengths based on real relationships in data, as opposed to expert opinion, is urgently required for researcher and physician decision support. The problem is that with many parameters, the space to be analyzed is highly dimensional. That is, the combinations of data to examine are subject to a combinatorial explosion as the number of possible events (entries, items, sub-records) (a),(b),(c),... per record (a,b,c,..) increases, and hence much of the space is sparsely populated. These combinatorial considerations are particularly problematic for identifying those associations called "Unicorn Events" which occur significantly less than expected to the extent that they are never seen to be counted. To cope with the combinatorial explosion, a novel numerical "book keeping" approach is taken to generate information terms relating to the combinatorial subsets of events (a,b,c,..), and, most importantly, the zeta (Zeta) function is employed. The incomplete Zeta function zeta(s,n) with s = 1, in which frequencies of occurrence such as n = n(a,b,c,...) determine the range of summation n, is argued to be the natural choice of information function. It emerges from Bayesian integration, taken over the distribution of possible values of information measures for sparse and ample data alike. Expected mutual information l(a;b;c) in nats (i.e., natural units analogous to bits but based on the natural logarithm), such as is available to the observer, is measured as e.g., the difference zeta(s,o(a,b,c..)) - zeta(s,e(a,b,c..)) where o(a,b,c,..) and e(a,b,c,..) are, or relate to, the observed and expected frequencies of occurrence, respectively. For real values of s > 1 the qualitative impact of strongly (positively or negatively) ranked data is preserved despite several numerical approximations. As real s increases, and the output of the information functions converge into three values +1, 0, and -1 nats representing a trinary logic system. For quantitative data, a useful ad hoc method, to report sigma-normalized covariations in an analogous manner to mutual information for significance comparison purposes, is demonstrated. Finally, the potential ability to make use of mutual information in a complex biomedical study, and to include Bayesian prior information derived from statistical, tabular, anecdotal, and expert opinion is briefly illustrated.
Keywords:
本文献已被 PubMed 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号