发明名称 Systems and methods for scalable topic detection in social media
摘要 Embodiments generally relate to systems and methods for detecting topics in social media data. More particularly, the systems and methods can extract a concept hierarchy from a set of data, wherein the concept hierarchy comprises a plurality of layers. Further, the systems and methods can train topic models based on the content in each of the layers. Still further, the systems and methods can select the most appropriate topic model for social media data by balancing the complexity of the model and the accuracy of the topic detection result. Moreover, the systems and methods can use the most appropriate topic model to detect topics in social media data.
申请公布号 US9183293(B2) 申请公布日期 2015.11.10
申请号 US201113324391 申请日期 2011.12.13
申请人 XEROX CORPOATION 发明人 Li Lei;Peng Wei;Sun Tong
分类号 G06F17/30;G06Q30/02;G06Q50/00 主分类号 G06F17/30
代理机构 MH2 Technology Law Group LLP 代理人 MH2 Technology Law Group LLP
主权项 1. A method of processing data, the method comprising: receiving identification of a plurality of concepts via a user interface, the concepts representing a top level of a hierarchy of topics; processing a data set to extract children of the top level of the hierarchy of topics, wherein at least the children of the hierarchy of topics is based on a hierarchy of the data set identified from a source of the data set; linking a portion of the data set to a subset of the hierarchy of topics, wherein the subset of the hierarchy of topics comprises one or more subtopics; extracting selected terms from the portion of the data set, wherein the selected terms were identified as important based on calculated information retrieval measurements of the portion of the data set; training topic models for the subset of the hierarchy of topics and the one or more subtopics using the selected terms from the portion of the data set and a probabilistic learning technique, wherein for each topic model the training comprises: determining a prior knowledge estimate based on estimated prior knowledge of a portion of the data set belonging to the topic model;determining a plurality of term contribution estimates by processing each term of the selected terms to estimate a measure of evidence that the term contributes to the portion of the data set belonging to the topic model; andcombining the prior knowledge estimate and the plurality of term contribution estimates to determine a probability that the portion of the data set belongs to the topic model; evaluating an accuracy and evaluating a complexity of each topic model of the topic models in response to a determination that a topic model has been trained for at least one subtopic; determining, using one or more processors, that the subset of the hierarchy of topics is an appropriate topic for textual data generated via a social networking service by determining that the subset of the hierarchy of topics balances the accuracy and the complexity of the topic models, wherein the subset of the hierarchy of topics is at a median hierarchy level relative to the hierarchy of topics; and detecting one or more appropriate subtopics of the appropriate topic that are most appropriate for the textual data generated via the social networking service by examining the accuracy of each topic model associated with the one or more subtopics of the appropriate topic, wherein the detecting one or more appropriate subtopics of the appropriate topic comprises applying a locality-sensitive hashing (LSH) technique to the textual data generated via the social networking service and the portion of the data set.
地址 Norwalk CT US