发明名称 TERM SELECTION FROM A DOCUMENT TO FIND SIMILAR CONTENT
摘要 Methods, devices, and systems are described for creating and implementing search query vectors for knowledge base articles or other formal articles, the query vectors automatically created from informal correspondence such as a service request email to an information technology (IT) department. Term frequency-inverse document frequency (TF-IDF) scores are calculated for rarewords in the correspondence with respect to a corpus of other service requests. High scoring terms with the same neighbors as those in the corpus of formal articles are added to the search query vector, while high scoring terms that do not share the same neighbors are thrown out. The query vector is then used to run a search of the knowledge base for relevant articles.
申请公布号 US2016140231(A1) 申请公布日期 2016.05.19
申请号 US201414546340 申请日期 2014.11.18
申请人 Oracle International Corporation 发明人 Agarwal Pranav Kumar
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项 1. A method for searching using term selection from a document to find similar content, the method comprising: providing formally written articles; selecting one or more tokens in each article by: identifying candidate root words;calculating, using a processor operatively coupled with a memory, a term frequency-inverse document frequency (TF-IDF) score for each of the candidate root words; andselecting the candidate root words as tokens based on the TF-IDF scores; cataloging neighboring tokens for each selected token into a data structure for each article, where neighboring tokens include tokens that are within a threshold number of words to the selected token in an article; merging the data structures for the articles into a merged data structure; providing a written correspondence; selecting one or more tokens in the correspondence by: identifying candidate root words from the correspondence;computing a TF-IDF score for each of the candidate root words in the correspondence with respect to a corpus of other correspondence; andselecting the candidate root words as tokens based on the TF-IDF scores; ascertaining neighboring tokens for each selected token in the correspondence; finding a match between a token in the correspondence and in the merged data structure; for the matched token, counting how many neighboring tokens in the merged data structure are also neighboring tokens in the correspondence; and adding the matched token to a query vector based on the counting; and performing a search of the formally written articles using the query vector.
地址 Redwood Shores CA US