发明名称 |
TOPIC EXTRACTION USING CLAUSE SEGMENTATION AND HIGH-FREQUENCY WORDS |
摘要 |
The disclosed embodiments provide a system for processing data. During operation, the system obtains a set of clauses in a first set of content items comprising unstructured data. Next, the system obtains a set of stop words comprising high-frequency words that occur in a second set of content items. The system then automatically extracts a set of topics from the set of clauses by generating a set of n-grams from the set of clauses and excluding a first n-gram in the set of n-grams from the set of topics when the first n-gram contains a word in the set of stop words in a pre-specified position of the first n-gram. Finally, the system displays the set of topics to a user to improve understanding of the first set of content items by the user without requiring the user to manually analyze the first set of content items. |
申请公布号 |
US2016314191(A1) |
申请公布日期 |
2016.10.27 |
申请号 |
US201514807674 |
申请日期 |
2015.07.23 |
申请人 |
LinkedIn Corporation |
发明人 |
Markman Vita G.;Zhang Yongzheng;Martell Craig H.;Finger Lutz T. |
分类号 |
G06F17/30;G06F17/27 |
主分类号 |
G06F17/30 |
代理机构 |
|
代理人 |
|
主权项 |
1. A method, comprising:
obtaining a set of clauses in a first set of content items comprising unstructured data; obtaining a set of stop words comprising high-frequency words that occur in a second set of content items; and automatically extracting, by one or more computer systems, a set of topics from the set of clauses by:
generating a set of n-grams from the set of clauses; andexcluding a first n-gram in the set of n-grams from the set of topics when the first n-gram contains a word in the set of stop words in a pre-specified position of the first n-gram; and displaying, by the one or more computer systems, the set of topics to a user to improve understanding of the first set of content items by the user without requiring the user to manually analyze the first set of content items. |
地址 |
Mountain View CA US |