发明名称 |
Categorization of multi-page documents by anisotropic diffusion |
摘要 |
A computer implemented system and method are provided for refining category scores for pages of a sequence of document pages that potentially includes document boundaries. The method uses initial category scores provided by a categorizer that considers one page at a time or concatenated pairs of pages (called bipages). The category scores represent the probability that a page belongs to a particular category. The method uses anisotropic diffusion to refine the initial page category scores using the scores of neighboring pages as a function of the probability that there is a boundary between the pages. The method may be performed iteratively. |
申请公布号 |
US8892562(B2) |
申请公布日期 |
2014.11.18 |
申请号 |
US201213558814 |
申请日期 |
2012.07.26 |
申请人 |
Xerox Corporation |
发明人 |
Renders Jean-Michel;Ragnet François;Cramet Damien |
分类号 |
G06F17/30 |
主分类号 |
G06F17/30 |
代理机构 |
Fay Sharpe LLP |
代理人 |
Fay Sharpe LLP |
主权项 |
1. A computer implemented categorization method comprising:
receiving a sequence of pages to be categorized; for each of a plurality of pages in the sequence as a current page:
computing a page category score for each of a set of categories for the current page;computing a first bipage category score for each of the set of categories for a first bipage comprising a preceding page and the current page;computing a second bipage category score for each of the set of categories for a second bipage comprising a subsequent page and the current page;computing a first boundary probability that there is a document boundary between the preceding page and the current page; andcomputing a second boundary probability that there is a document boundary between the subsequent page and the current page; with a computer processor, for at least one iteration, for each of the plurality of pages, computing a refined page category score for each of the set of categories for the current page as a function of:
the first bipage category scores weighted by a first weighting factor, the first weighting factor being based on the first boundary probability;the second bipage category scores weighted by a second weighting factor, the second weighting factor being based on the second boundary probability; andthe page category scores of the current page; and outputting information based on the refined page category scores for each of the plurality of pages. |
地址 |
Norwalk CT US |