发明名称 Document tagging and retrieval using entity specifiers
摘要 Techniques for managing big data include tagging of documents and subsequent retrieval using per-subject dictionaries having entries with some entries specially designated as entities. An entity indicates that the term in the entry has special meaning, e.g., brands (trademarks/service marks), trade names, geographic identifiers or other classes of terms. A dictionary may include a non-entity entry for a term and one or more entity entries, for different entity types. The entries may also include subject-determining-power scores. The subject-determining-power scores provide an indication of the descriptive power of the term with respect to the subject of the dictionary containing the term. The same term may have entries in multiple dictionaries with different subject-determining-power scores in each of the dictionaries. The entity distinctions for a term can then be used in tagging documents and processing retrieval requests.
申请公布号 US9251136(B2) 申请公布日期 2016.02.02
申请号 US201314055379 申请日期 2013.10.16
申请人 INTERNATIONAL BUSINESS MACHINES CORPORATION 发明人 Gattiker Anne Elizabeth;Gebara Fadi H.;Hylick Anthony N.;Kanj Rouwaida N.
分类号 G06F17/30;G06F17/27 主分类号 G06F17/30
代理机构 Mitch Harris, Atty at Law, LLC 代理人 Mitch Harris, Atty at Law, LLC ;Harris Andrew M.;Stock William J.
主权项 1. A computer-performed method of organizing a collection of electronic documents, the method comprising: in a computer system, storing entries in multiple dictionaries separate from and not associated with any particular one of the electronic documents, wherein the multiple dictionaries are data structures stored within the computer system, wherein individual ones of the multiple dictionaries correspond to one of a plurality of different subjects, wherein the entries contain a descriptive term and wherein entries corresponding to an entity contain an entity type code indicating that the entry is an entity entry with respect to a subject of the one of the multiple dictionaries in which the entry is stored and a category of that entity, wherein entity entries are identified as belonging to one or more special categories of terms that have special meaning with respect to their corresponding subjects, and wherein at least some of the descriptive terms are present in two or more of the multiple dictionaries; responsive to requests within the computer system, accessing the collection of electronic documents by matching terms contained in the electronic documents with descriptive terms in the multiple dictionaries to determine one or more subjects of the electronic documents from subjects of one or more of the multiple dictionaries that contain the descriptive terms matching the terms contained in the electronic documents; and responsive to the matching detecting a match between a descriptive term in one of the multiple dictionaries, determining whether or not the entry containing the descriptive term has an entity type code; responsive to determining that the entry has an entity type code, providing an indication of the entity type code in conjunction with an indication of the one or more subjects of the electronic documents along with the one or more subjects determined by the determining in response to the requests; storing a representation of the one or more subjects determined by the determining one or more subjects of the electronic documents along with the indication of the entity type code in a memory of the computer system as tags associated with the collection of electronic documents wherein the tags describe subjects to which the corresponding electronic documents pertain; receiving a request to identify one or more of the electronic documents, wherein the request includes at least one search term descriptive of the one or more electronic documents and at least one entity type code identifying the search term as an entity and a type of the entity; determining whether the at least one search term is present in a given one of the multiple dictionaries; responsive to determining that the at least one search term is present in the given dictionary, first detecting whether an entry matching the at least one search term is an entity or a non-entity; responsive to determining that the at least one search term is not present in the given dictionary, second detecting whether or not an entry matching the at least one entity type code is present in the given dictionary; using a result of the first detecting and a result of the second detecting in determining at least one subject for the at least one search term; and matching the at least one subject with the tags associated with the collection of electronic documents to obtain the one or more electronic documents to return in response to the request; and storing a representation of the one or more electronic documents in a memory of the computer system to provide the response to the request to identify the one or more electronic documents.
地址 Armonk NY US