发明名称 METHOD AND SYSTEM FOR DISTRIBUTED LATENT DIRICHLET ALLOCATION COMPUTATION USING ADDITION OF APPROXIMATE COUNTERS
摘要 Herein is described a data-parallel algorithm for topic modeling on a distributed system in which memory and communication bandwidth requirements are streamlined for distributed implementation. According to embodiments, a distributed LDA Gibbs sampling algorithm shares approximate counter values amongst the nodes of a distributed system. These approximate counter values are repeatedly aggregated and then shared again to perform the distributed LDA Gibbs sampling. In order to maintain the shared counter values as approximate counter values of sixteen bits or less, approximate counter values are summed to produce aggregate approximate counter values. These small aggregate approximate counter values are shared between the nodes of the distributed system. As such, the addition of various types of approximate counters is described herein. Specifically, addition of binary Morris approximate counters, general Morris approximate counters, and Csürös approximate counters are described in the context of distributed implementations of an LDA Gibbs sampling algorithm.
申请公布号 US2017039265(A1) 申请公布日期 2017.02.09
申请号 US201514821511 申请日期 2015.08.07
申请人 Oracle International Corporation 发明人 Steele, JR. Guy L.;Tristan Jean-Baptiste
分类号 G06F17/30;G06F17/27 主分类号 G06F17/30
代理机构 代理人
主权项 1. A method for identifying sets of correlated words comprising: receiving information for a set of documents; wherein the set of documents comprises a plurality of words; a first computing device running an uncollapsed Gibbs sampler over a Dirichlet distribution of the plurality of words in the set of documents to produce sampler result data, further comprising: receiving, from a second computing device, a first approximate counter value that corresponds to a particular counter,adding the first approximate counter value to a second approximate counter value that also corresponds to the particular counter to produce an aggregate approximate counter value, andusing the aggregate approximate counter value as the value of the particular counter; and determining, from the sampler result data, one or more sets of correlated words.
地址 Redwood Shores CA US