发明名称 Method and apparatus for identifying logical blocks of text in a document
摘要 A computer implemented method and apparatus for identifying logical blocks of text in a document where document structure information is absent. The method comprises accessing a document, wherein the document comprises a plurality of words; identifying word information for each word in the plurality of words; creating a plurality of text lines based on the word information, wherein each text line in the plurality of text lines comprises one or more words in the plurality of words; and creating a plurality of text blocks derived from the plurality of text lines.
申请公布号 US9223756(B2) 申请公布日期 2015.12.29
申请号 US201313800242 申请日期 2013.03.13
申请人 ADOBE SYSTEMS INCORPORATED 发明人 Agrawal Ram Bhushan
分类号 G06F17/21 主分类号 G06F17/21
代理机构 Keller Jolley Preece 代理人 Keller Jolley Preece
主权项 1. A computer implemented method comprising: accessing a document, wherein the document comprises a plurality of words; calculating a threshold horizontal gap between words based only on the average vertical height of the plurality of words, wherein horizontal and vertical are orthogonal directions within the document; creating a plurality of text lines from the plurality of words based on the threshold horizontal gap between words based on the average vertical height of the plurality of words, wherein each text line in the plurality of text lines comprises one or more words in the plurality of words; and creating a plurality of text blocks derived from the plurality of text lines.
地址 San Jose CA US