发明名称 Table of contents extraction with improved robustness
摘要 In a method for identifying a table of contents in a document (10) , text fragments are extracted (12) from the document. There are identified (20, 30, 34, 38) : (i) a substantially contiguous group of text fragments as table of content entries and (ii) a different group of text fragments as linked text fragments linked with corresponding table of content entries. During the identifying, a number of text fragments that are candidates for identification as linked text fragments is reduced based on at least one reduction criterion (130) . The identified table of contents entries and linked text fragments (110) are validated based on at least one validation criterion (162) related to distribution of the linked text fragments.
申请公布号 EP1826684(A1) 申请公布日期 2007.08.29
申请号 EP20070102800 申请日期 2007.02.21
申请人 XEROX CORPORATION 发明人 MEUNIER, JEAN-LUC;DEJEAN, HERVE
分类号 G06F17/27 主分类号 G06F17/27
代理机构 代理人
主权项
地址