摘要 |
What is disclosed is a novel system and method for analyzing multi-dimensional cluster data sets to identify clusters of related documents in an electronic document storage system. Digital documents, for which multi-dimensional probabilistic relationships are to be determined, are received and then parsed to identify multi-dimensional count data with at least three dimensions. Multi-dimensional tensors representing the count data and estimated cluster membership probabilities are created. The tensors are then iteratively processed using a first and a complementary second tensor factorization model to refine the cluster definition matrices until a convergence criteria has been satisfied. Likely cluster memberships for the count data are determined based upon the refinements made to the cluster definition matrices by the alternating tensor factorization models. The present method advantageously extends to the field of tensor analysis a combination of Non-negative Matrix Factorization and Probabilistic Latent Semantic Analysis to decompose non-negative data.
|