发明名称 METHOD AND SYSTEM FOR PAGE CONSTRUCT DETECTION BASED ON SEQUENTIAL REGULARITIES
摘要 Disclosed is a method and system that generates a page construct structure associated with a sequentially-ordered set of pages, each being characterized by a set of page construct features. N-grams, i.e., a sequence of n features, are computed from a set of page construct features for n contiguous pages, and n-grams which are repetitive are selected. Pages matching the most frequent repetitive n-ram are grouped together under a new node, and a new sequence is created. The method is iteratively applied to this new sequence. The output is an ordered set of trees.
申请公布号 US2015178256(A1) 申请公布日期 2015.06.25
申请号 US201314140075 申请日期 2013.12.24
申请人 Xerox Corporation 发明人 Déjean Harvé
分类号 G06F17/22;G06F17/21 主分类号 G06F17/22
代理机构 代理人
主权项 1. A computer implemented method of generating a page construct sequential hierarchical structure associated with a sequence of pages associated with a digital version of a document comprising: a) obtaining a sequence of pages representing the document; b) defining a set of page construct features associated with each page of the sequence of pages, each page construct feature defined as a document element which recurrently occurs at regular positions outside a running content of the sequence of pages, the construct feature defined by a feature value types; c) computing a set of feature values associated with the set of page construct features for each page of the sequence; d) generating a set of n-grams from the sequence of pages, an n-gram including an ordered sequence of n page construct features provided by a sequence of n pages; e) electing sequential n-grams from the set of n-grams, the sequential n-grams defined as similar contiguous n-grams; f) selecting the most frequent sequential n-gram from the elected sequential n-grams; and g) generating a new sequence of the pages by matching the selected most frequent sequential n-gram against the sequence of pages associated with the document, replacing matched pages of the sequence of pages with a respective node, and associating the matched pages of the sequence of pages as children of the respective node, the new sequence of pages representing the page construct hierarchical sequential structure associated with the document.
地址 Norwalk CT US