发明名称 Scalable lookup-driven entity extraction from indexed document collections
摘要 A set of documents is filtered for entity extraction. A list of entity strings is received. A set of token sets that covers the entity strings in the list is determined. An inverted index generated on a first set of documents is queried using the set of token sets to determine a set of document identifiers for a subset of the documents in the first set. A second set of documents identified by the set of document identifiers is retrieved from the first set of documents. The second set of documents is filtered to include one or more documents of the second set that each includes a match with at least one entity string of the list of entity strings. Entity recognition may be performed on the filtered second set of documents.
申请公布号 US9501475(B2) 申请公布日期 2016.11.22
申请号 US201414294791 申请日期 2014.06.03
申请人 Microsoft Technology Licensing, LLC 发明人 Agrawal Sanjay;Chakrabarti Kaushik;Chaudhuri Surajit;Ganti Venkatesh
分类号 G06F17/00;G06F7/00;G06F17/30;G06F17/27 主分类号 G06F17/00
代理机构 代理人 Corie Alin;Swain Sandy;Minhas Micky
主权项 1. A method for ad-hoc entity extraction, comprising: filtering a first set of documents to generate a second set of documents that includes documents of the first set based at least on a set of token sets that covers entity strings in a list of entity strings, the number of tokens in the set of token sets being less than the number of words of the entity strings in the list of entity strings, the set of tokens generated based on the entity strings in the list of the entity strings; and performing entity recognition on the second set of documents.
地址 Redmond WA US