发明名称 Method and apparatus for structuring documents based on layout, content and collection
摘要 <p>A method and apparatus is provided for converting a document in a first format essentially comprising a flat layout structure into a structured document in a hierarchical form in accordance with predetermined attributes identified from the input format. The process comprises fragmenting the input document into a plurality of document content elements in accordance with a predetermined set of document attributes identifiable from the input document format. The content elements are clustered (16) into selective sets having similar document attributes. The clustered sets are validated (18) with reference to common textual properties organizational content common in documents in the collection. The clustered sets are then categorized (20) into predetermined categories comprising structured elements of the structured document format and the document content elements are organized (22) by hierarchical dependency from the predetermined categories wherein the organized document elements comprise the desired structured document format.</p>
申请公布号 EP1679625(A2) 申请公布日期 2006.07.12
申请号 EP20060250073 申请日期 2006.01.06
申请人 XEROX CORPORATION 发明人 DEJEAN, HERVE;LUX, VERONIKA;RIBEAU, SANDRINE
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项
地址