发明名称 |
DATA-PARALLEL PARAMETER ESTIMATION OF THE LATENT DIRICHLET ALLOCATION MODEL BY GREEDY GIBBS SAMPLING |
摘要 |
A novel data-parallel algorithm is presented for topic modeling on a highly-parallel hardware architectures. The algorithm is a Markov-Chain Monte Carlo algorithm used to estimate the parameters of the LDA topic model. This algorithm is based on a highly parallel partially-collapsed Gibbs sampler, but replaces a stochastic step that draws from a distribution with an optimization step that computes the mean of the distribution directly and deterministically. This algorithm is correct, it is statistically performant, and it is faster than state-of-the art algorithms because it can exploit the massive amounts of parallelism by processing the algorithm on a highly-parallel architecture, such as a GPU. Furthermore, the partially-collapsed Gibbs sampler converges about as fast as the collapsed Gibbs sampler and identifies solutions that are as good, or even better, as the collapsed Gibbs sampler. |
申请公布号 |
US2016210718(A1) |
申请公布日期 |
2016.07.21 |
申请号 |
US201514599272 |
申请日期 |
2015.01.16 |
申请人 |
Oracle International Corporation |
发明人 |
Tristan Jean-Baptiste;Steele Guy |
分类号 |
G06T1/20 |
主分类号 |
G06T1/20 |
代理机构 |
|
代理人 |
|
主权项 |
1. A method for identifying sets of correlated words comprising:
receiving information for a set of documents; wherein the set of documents comprises a plurality of words; running a partially-collapsed Gibbs sampler over a Dirichlet distribution of the plurality of words in the set of documents to produce sampler result data, further comprising:
calculating a mean of the Dirichlet distribution; determining, from the sampler result data, one or more sets of correlated words; wherein the method is performed by one or more computing devices. |
地址 |
Redwood Shores CA US |