发明名称 Document processing method, system and medium
摘要 A technique for extracting a meaningful text block from a document where a table, an itemized list, a multiple column, etc., are arbitrarily laid out. A document is input which is laid out using blanks or the like, then a symbol is acquired which is associated with a spatial coordinate of the document. Consecutive characters of the same type are extracted from the symbol to generate a token and a space. A stream is generated from consecutive spaces in the column direction, while a text block is generated from streams and tokens. A link is generated between the text blocks to form a document graph. Validity of a connection (link) between the text blocks in the document graph is evaluated using a language model, then the text blocks are merged if the connection is valid.
申请公布号 US7046847(B2) 申请公布日期 2006.05.16
申请号 US20010891080 申请日期 2001.06.25
申请人 INTERNATIONAL BUSINESS MACHINES CORPORATION 发明人 HURST MATTHEW F.;NASUKAWA TETSUYA
分类号 G06F17/21;G06K9/34;G06F17/20;G06F17/30;G06T11/60 主分类号 G06F17/21
代理机构 代理人
主权项
地址