发明名称 Accessing documents using predictive word sequences
摘要 Methods and systems for accessing documents in document collections using predictive word sequences are disclosed. A method for accessing documents using predictive word sequences include creating a candidate list of word sequences where respective ones of the word sequences comprise one or more elements derived from the document corpus; expanding the candidate list by adding one or more new word sequences, where each new pattern is created by combining one or more elements derived from the document corpus with one of the word sequences currently in the candidate list; determining a predictive power with respect to the subject for respective ones of entries of the candidate list, where the entries include the word sequences and the new word sequences; pruning from the candidate list ones of said entries with the determined predictive power less than a predetermined threshold; and accessing documents from the document corpus based on the pruned candidate list. The expanding of the candidate list can include creating each new pattern as a gapped sequence, where the gapped sequence comprises one of the word sequences and one of said elements separated by zero or more words. Corresponding system and computer readable media embodiments are also disclosed.
申请公布号 US9069842(B2) 申请公布日期 2015.06.30
申请号 US201012892637 申请日期 2010.09.28
申请人 The MITRE Corporation 发明人 Melby Paul Christian
分类号 G06F17/30 主分类号 G06F17/30
代理机构 Sterne, Kessler, Goldstein & Fox PLLC 代理人 Sterne, Kessler, Goldstein & Fox PLLC
主权项 1. A method for accessing documents related to a subject from a document corpus, comprising: categorizing documents from the document corpus based on one or more subjects; creating a candidate list of word sequences, wherein respective ones of the word sequences comprise one or more elements derived from the document corpus; expanding the candidate list by adding one or more new word patterns, wherein each new pattern comprises a gapped sequence created by combining one or more elements derived from the document corpus with one of said word sequences; determining a predictive power with respect to the subject for respective ones of entries of the candidate list, wherein the entries comprise said word sequences and said new word patterns; pruning from the candidate list ones of said entries with the determined predictive power less than a predetermined threshold, wherein the predictive power comprises a measure of information gain, and wherein the pruning further comprises pruning from the candidate list ones of said entries with a frequency of occurrence less than a predetermined frequency threshold; accessing documents from the document corpus based on the pruned candidate list; updating the categorization of documents based on the accessing; and iteratively performing the expanding, the determining the predictive power, and the pruning, for increasing entry lengths until at least one of the entries is of a predetermined length.
地址 McLean VA US