发明名称 Method and system for constructing a document redundancy graph
摘要 A system and method for constructing a document redundancy graph with respect to a document set. The redundancy graph can be constructed with a node for each paragraph associated with the document set such that each node in the redundancy graph represents a unique cluster of information. The nodes can be linked in an order with respect to the information provided in the document set and bundles of redundant information from the document set can be mapped to individual nodes. A data structure (e.g., a hash table) of a paragraph identifier associated with a probability value can be constructed for eliminating inconsistencies with respect to node redundancy. Additionally, a sequence of unique nodes can also be integrated into the graph construction process. The nodes can be connected to the paragraphs associated with the document set via a hyperlink and/or via a label with respect to each node.
申请公布号 US8914720(B2) 申请公布日期 2014.12.16
申请号 US200912533901 申请日期 2009.07.31
申请人 Xerox Corporation 发明人 Harrington Steven J.
分类号 G06F17/00;G06F17/30;G06F17/22 主分类号 G06F17/00
代理机构 Ortiz & Lopez, PLLC 代理人 Lopez Kermit D.;Ortiz Luis M.;Ortiz & Lopez, PLLC
主权项 1. A method for constructing a document redundancy graph, said method comprising: representing each paragraph associated with a document set as a node among a plurality of nodes, wherein each node among said plurality of nodes with respect to said redundancy graph represents a unique cluster of information related to said each paragraph; providing said each paragraph with a unique paragraph identifier; constructing a hash table of all paragraph identifiers comprising identifiers of all paragraphs reachable from said each paragraph; merging said plurality of nodes associated with redundant information by configuring said hash table with respect to a pair of paragraph identifiers in association with a probability value, wherein said probability value sorts a plurality of information matches in an order of decreasing certainty of common content, wherein a pair of said paragraph identifiers associated with an increased certainty of common content are selected to merge; and combining said plurality of nodes unique to a single document by expressing a pair of nodes with overlapping common content as a combined node, wherein said combined node comprises an empty intersection of said pair of nodes and comparing each paragraph identifier among said pair of paragraph identifiers to a probability value associated with an entry in said hash table in an order wherein said hash table eliminates inconsistency associated with said plurality of information matches.
地址 Norwalk CT US