主权项 |
1. A method for annotating an entity in a document corpus using cross-document signals, the method comprising:
determining which documents in a document corpus of multiple documents mention an entity of interest; clustering the documents that mention an entity of interest according to similarities across a temporal signal, a structural signal and a content signal, thereby forming multiple clusters of documents; annotating each document in the multiple clusters of documents with an annotation by marking each occurrence of the entity in each document; calculating a confidence measure for each occurrence of the entity in each document in each of the multiple clusters, wherein said confidence measure comprises the sum of (i) a measure of similarity between the given occurrence of the entity and the entity of interest, and (ii) a measure of similarity between the documents within the cluster of the given document via∑jx(i,j)sim(i,j)+∑k∑j≠NAx(i,j)x(k,j)sim(i,k),wherein x(i, j) indicates mention i being assigned to entity j, sim(i, j) indicates a similarity of mention i to entity j, sim(i, k) indicates a document similarity of mention i and mention k, and NA represents a non-applicable designation;
creating a graph for each of the multiple clusters, wherein each of multiple nodes of each graph represents a mention of the entity of interest, and wherein said creating comprises placing an edge between each respective pair of nodes that share a mention, wherein an edge weight attributed to each edge is equal to the similarity between the shared mention; removing said annotation from one or more documents in the multiple clusters of documents by removing said marking for each occurrence of the entity in each document that corresponds to a confidence measure below a given value; and outputting (i), each annotated document and (ii) each created graph; wherein said determining, said clustering, said annotating, said calculating, said creating, said removing, and said outputting are carried out by a computer device. |