发明名称 Clustering of forms from large-scale scanned-document collection
摘要 Techniques for identifying documents sharing common underlying structures in a large collection of documents and processing the documents using the identified structures are disclosed. Images of the document collection are processed to detect occurrences of a predetermined set of image features that are common or similar among forms. The images are then indexed in an image index based on the detected image features. A graph of nodes is built. Nodes in the graph represent images and are connected to nodes representing similar document images by edges. Documents sharing common underlying structures are identified by gathering strongly inter-connected nodes in the graph. The identified documents are processed based at least in part on the resulting clusters.
申请公布号 US8744183(B2) 申请公布日期 2014.06.03
申请号 US201313937118 申请日期 2013.07.08
申请人 Google Inc. 发明人 Urbach Shlomo;Fink Eyal;Yadid Tal;Netzer Yuval
分类号 G06K9/00 主分类号 G06K9/00
代理机构 代理人
主权项 1. A computer-implemented method of identifying documents sharing at least one common underlying structure, comprising: detecting, by at least one computer, occurrences of a plurality of predetermined image features in a plurality of document images, wherein at least one of the plurality of predetermined image features is common among instances of a form; indexing, by the at least one computer, the plurality of document images in an image index based on the detected image features; building, by the at least one computer, a graph of connected nodes for the plurality of document images by searching the image index, wherein nodes representing instances of a predefined document type are connected by edges in the graph; and identifying, by the at least one computer, the documents sharing common underlying structures using the graph.
地址 Mountain View CA US