发明名称 Categorization of multi-page documents by anisotropic diffusion
摘要 A computer implemented system and method are provided for refining category scores for pages of a sequence of document pages that potentially includes document boundaries. The method uses initial category scores provided by a categorizer that considers one page at a time or concatenated pairs of pages (called bipages). The category scores represent the probability that a page belongs to a particular category. The method uses anisotropic diffusion to refine the initial page category scores using the scores of neighboring pages as a function of the probability that there is a boundary between the pages. The method may be performed iteratively.
申请公布号 US8892562(B2) 申请公布日期 2014.11.18
申请号 US201213558814 申请日期 2012.07.26
申请人 Xerox Corporation 发明人 Renders Jean-Michel;Ragnet François;Cramet Damien
分类号 G06F17/30 主分类号 G06F17/30
代理机构 Fay Sharpe LLP 代理人 Fay Sharpe LLP
主权项 1. A computer implemented categorization method comprising: receiving a sequence of pages to be categorized; for each of a plurality of pages in the sequence as a current page: computing a page category score for each of a set of categories for the current page;computing a first bipage category score for each of the set of categories for a first bipage comprising a preceding page and the current page;computing a second bipage category score for each of the set of categories for a second bipage comprising a subsequent page and the current page;computing a first boundary probability that there is a document boundary between the preceding page and the current page; andcomputing a second boundary probability that there is a document boundary between the subsequent page and the current page; with a computer processor, for at least one iteration, for each of the plurality of pages, computing a refined page category score for each of the set of categories for the current page as a function of: the first bipage category scores weighted by a first weighting factor, the first weighting factor being based on the first boundary probability;the second bipage category scores weighted by a second weighting factor, the second weighting factor being based on the second boundary probability; andthe page category scores of the current page; and outputting information based on the refined page category scores for each of the plurality of pages.
地址 Norwalk CT US