发明名称 TOPIC EXTRACTION USING CLAUSE SEGMENTATION AND HIGH-FREQUENCY WORDS
摘要 The disclosed embodiments provide a system for processing data. During operation, the system obtains a set of clauses in a first set of content items comprising unstructured data. Next, the system obtains a set of stop words comprising high-frequency words that occur in a second set of content items. The system then automatically extracts a set of topics from the set of clauses by generating a set of n-grams from the set of clauses and excluding a first n-gram in the set of n-grams from the set of topics when the first n-gram contains a word in the set of stop words in a pre-specified position of the first n-gram. Finally, the system displays the set of topics to a user to improve understanding of the first set of content items by the user without requiring the user to manually analyze the first set of content items.
申请公布号 US2016314191(A1) 申请公布日期 2016.10.27
申请号 US201514807674 申请日期 2015.07.23
申请人 LinkedIn Corporation 发明人 Markman Vita G.;Zhang Yongzheng;Martell Craig H.;Finger Lutz T.
分类号 G06F17/30;G06F17/27 主分类号 G06F17/30
代理机构 代理人
主权项 1. A method, comprising: obtaining a set of clauses in a first set of content items comprising unstructured data; obtaining a set of stop words comprising high-frequency words that occur in a second set of content items; and automatically extracting, by one or more computer systems, a set of topics from the set of clauses by: generating a set of n-grams from the set of clauses; andexcluding a first n-gram in the set of n-grams from the set of topics when the first n-gram contains a word in the set of stop words in a pre-specified position of the first n-gram; and displaying, by the one or more computer systems, the set of topics to a user to improve understanding of the first set of content items by the user without requiring the user to manually analyze the first set of content items.
地址 Mountain View CA US