发明名称 Document retrieval using internal dictionary-hierarchies to adjust per-subject match results
摘要 Techniques for managing big data include retrieval using per-subject dictionaries having multiple levels of sub-classification hierarchy within the subject. Entries may include subject-determining-power (SDP) scores that provide an indication of the descriptive power of the entry term with respect to the subject of the dictionary containing the term. The same term may have entries in multiple dictionaries with different SDP scores in each of the dictionaries. A retrieval request for one or more documents containing search terms descriptive of the one or more documents can be processed by identifying a set of candidate documents tagged with subjects, i.e., identifiers of per-subject dictionaries having entries corresponding to a search term, then using affinity values to adjust the aggregate score for the terms in the dictionaries. Documents are then selected for best match to the subject based on the adjusted scores. Alternatively, the adjustment may be performed after selecting the documents by re-ordering them according to adjusted scores.
申请公布号 US9235638(B2) 申请公布日期 2016.01.12
申请号 US201314077305 申请日期 2013.11.12
申请人 INTERNATIONAL BUSINESS MACHINES CORPORATION 发明人 Gattiker Anne Elizabeth;Gebara Fadi H.;Hylick Anthony N.;Kanj Rouwaida N.
分类号 G06F17/30 主分类号 G06F17/30
代理机构 Mitch Harris, Atty at Law, LLC 代理人 Mitch Harris, Atty at Law, LLC ;Harris Andrew M.;Stock William J.
主权项 1. A computer-performed method of retrieving documents from a collection of documents, the method comprising: receiving, by at least one processor within a computer system, a search request including search terms descriptive of documents in the collection; first matching, by the at least one processor, the search terms to descriptive terms of entries in multiple dictionaries to determine multiple subjects specified by the search terms, the multiple dictionaries have an associated unique subject, wherein entries in the multiple dictionaries contain descriptive terms, wherein at least some of the descriptive terms are present in two or more of the multiple dictionaries, and wherein the first matching generates scores indicating the relative strength of a relationship between the search terms and the multiple subjects; second matching, by the at least one processor, multiple subjects determined by the first matching to documents in the collection of documents to obtain a subset of the collection of documents that are relevant to multiple subjects; within the computer system, maintaining records of hierarchy of classification for the entries within the multiple dictionaries for the multiple dictionaries, wherein the hierarchy records encode or store affinity values showing a strength of relationship between the entries within a corresponding dictionary; adjusting, by the at least one processor, a result of the first matching or the second matching using the records of hierarchy of classification for the multiple subjects determined by the first matching; and returning, by the at least one processor, at least a portion of the subset of the collection of documents obtained by the second matching.
地址 Armonk NY US