发明名称 Annotating entities using cross-document signals
摘要 Techniques for annotating an entity in a document corpus using cross-document signals. A method includes determining which documents in a document corpus mention an entity of interest, clustering the documents that mention an entity of interest according to a temporal signal, a structural signal and/or a content signal, thereby forming at least one cluster of documents, and annotating at least one document in the at least one cluster of documents by marking each occurrence of the entity in the at least one document.
申请公布号 US9465865(B2) 申请公布日期 2016.10.11
申请号 US201213587011 申请日期 2012.08.16
申请人 International Business Machines Corporation 发明人 De Sushovan;Singh Amit K.;Visweswariah Karthik
分类号 G06F17/30 主分类号 G06F17/30
代理机构 Ryan, Mason & Lewis, LLP 代理人 Ryan, Mason & Lewis, LLP
主权项 1. A method for annotating an entity in a document corpus using cross-document signals, the method comprising: determining which documents in a document corpus of multiple documents mention an entity of interest; clustering the documents that mention an entity of interest according to similarities across a temporal signal, a structural signal and a content signal, thereby forming multiple clusters of documents; annotating each document in the multiple clusters of documents with an annotation by marking each occurrence of the entity in each document; calculating a confidence measure for each occurrence of the entity in each document in each of the multiple clusters, wherein said confidence measure comprises the sum of (i) a measure of similarity between the given occurrence of the entity and the entity of interest, and (ii) a measure of similarity between the documents within the cluster of the given document via∑j⁢⁢x⁡(i,j)⁢sim⁡(i,j)+∑k⁢∑j≠NA⁢x⁡(i,j)⁢x⁡(k,j)⁢sim⁡(i,k),wherein x(i, j) indicates mention i being assigned to entity j, sim(i, j) indicates a similarity of mention i to entity j, sim(i, k) indicates a document similarity of mention i and mention k, and NA represents a non-applicable designation; creating a graph for each of the multiple clusters, wherein each of multiple nodes of each graph represents a mention of the entity of interest, and wherein said creating comprises placing an edge between each respective pair of nodes that share a mention, wherein an edge weight attributed to each edge is equal to the similarity between the shared mention; removing said annotation from one or more documents in the multiple clusters of documents by removing said marking for each occurrence of the entity in each document that corresponds to a confidence measure below a given value; and outputting (i), each annotated document and (ii) each created graph; wherein said determining, said clustering, said annotating, said calculating, said creating, said removing, and said outputting are carried out by a computer device.
地址 Armonk NY US