发明名称 Approximate named-entity extraction
摘要 According to one embodiment, a method is provided for approximate named-entity extraction from a dictionary that includes entries, where each of the entries includes one or more words. Words are read from the entries of the dictionary, and network resources are searched to determine a frequency of occurrence of the words on the network resources. In view of the frequency of occurrence of the words located on the network resources, domain relevancy of the words in the entries of the dictionary is determined. A domain repository is created using top-ranked words as determined by the domain relevancy of the words. In view of the domain repository, signatures for both the entries of the dictionary and strings of an input document are computed. The strings of the input document are filtered by comparing the signatures of the strings against the signatures of the entries to identify approximate-match entity names.
申请公布号 US9311290(B2) 申请公布日期 2016.04.12
申请号 US201313970707 申请日期 2013.08.20
申请人 International Business Machines Corporation 发明人 Chen Ying;Spangler William S.;Yan Su
分类号 G06F17/27 主分类号 G06F17/27
代理机构 Cantor Colburn LLP 代理人 Cantor Colburn LLP
主权项 1. A method for approximate named-entity extraction from a dictionary comprising a plurality of entries, each of the entries including one or more words, the method comprising: reading a plurality of the words from the entries of the dictionary; searching network resources to determine a frequency of occurrence of the words on the network resources; in view of the frequency of occurrence of the words located on the network resources, determining domain relevancy of the words in the entries of the dictionary; creating a domain repository using top-ranked words as determined by the domain relevancy of the words; in view of the domain repository, computing signatures for both the entries of the dictionary and strings of an input document as representative strings that capture domain-related information based on a domain knowledge base; filtering the strings of the input document by comparing the signatures of the strings against the signatures of the entries to identify approximate-match entity names, the filtering further comprising: storing the signatures for the entries in a signature Bloom filter, wherein comparing the signatures of the strings is performed against the signatures of the entries in the signature Bloom filter;creating a length-based inverted index as a list of unique tokens from the dictionary indicating the entries where the tokens occur, the signatures of the entries, and a number of tokens per entry; andnarrowing a search range when filtering the strings using the length-based inverted index to identify approximate matches having matching signatures and a similar length in view of the number of tokens per entry compared to a number of tokens per string of the input document as bounded by a parameter that establishes the search range relative to the number of tokens per string of the input document; and generating a list of the approximate-match entity names as identified based on the filtering.
地址 Armonk NY US