发明名称 DATA-PARALLEL PARAMETER ESTIMATION OF THE LATENT DIRICHLET ALLOCATION MODEL BY GREEDY GIBBS SAMPLING
摘要 A novel data-parallel algorithm is presented for topic modeling on a highly-parallel hardware architectures. The algorithm is a Markov-Chain Monte Carlo algorithm used to estimate the parameters of the LDA topic model. This algorithm is based on a highly parallel partially-collapsed Gibbs sampler, but replaces a stochastic step that draws from a distribution with an optimization step that computes the mean of the distribution directly and deterministically. This algorithm is correct, it is statistically performant, and it is faster than state-of-the art algorithms because it can exploit the massive amounts of parallelism by processing the algorithm on a highly-parallel architecture, such as a GPU. Furthermore, the partially-collapsed Gibbs sampler converges about as fast as the collapsed Gibbs sampler and identifies solutions that are as good, or even better, as the collapsed Gibbs sampler.
申请公布号 US2016210718(A1) 申请公布日期 2016.07.21
申请号 US201514599272 申请日期 2015.01.16
申请人 Oracle International Corporation 发明人 Tristan Jean-Baptiste;Steele Guy
分类号 G06T1/20 主分类号 G06T1/20
代理机构 代理人
主权项 1. A method for identifying sets of correlated words comprising: receiving information for a set of documents; wherein the set of documents comprises a plurality of words; running a partially-collapsed Gibbs sampler over a Dirichlet distribution of the plurality of words in the set of documents to produce sampler result data, further comprising: calculating a mean of the Dirichlet distribution; determining, from the sampler result data, one or more sets of correlated words; wherein the method is performed by one or more computing devices.
地址 Redwood Shores CA US