发明名称 LIST RECOGNIZING METHOD AND LIST RECOGNIZING SYSTEM
摘要 A list recognizing method and system, which comprises: parsing and analyzing metadata information within an original fixed-layout document, and extracting basic elements within a page; segmenting the basic elements, extracting segmented text lines within the page to obtain fragments; building an undirected graph with respect to the fragments; detecting indent features of a bullet according to features of the basic elements; training a learning model according to the indent features, local features of the fragments and neighborhood relation features among the fragments, obtaining model parameters, and establishing a list recognizing model; and invoking the list recognizing model to perform list recognizing on the required document, so as to get recognition result. This machine learning method may recognize not only a list, but also the contextual relationship between the first line and its subsequent lines of a list, and realize analyzing and understanding a layout of the list of the fixed-layout document ultimately. The accuracy of list recognizing on a fixed-layout document can be improved even if the bullets of the first line of the list are various.
申请公布号 US2015095022(A1) 申请公布日期 2015.04.02
申请号 US201314096431 申请日期 2013.12.04
申请人 Founder Apabi Technology Limited ;Peking University Founder Group Co., Ltd. 发明人 XU Canhui;TANG Zhi;XU Jianbo;TAO Xin
分类号 G06F17/27 主分类号 G06F17/27
代理机构 代理人
主权项 1. A list recognizing method, comprising: parsing and analyzing metadata information within an original fixed-layout document, and extracting basic elements within a page; segmenting the basic elements, extracting segmented text lines within the page to obtain fragments; building an undirected graph with respect to the fragments; detecting indent features of a bullet according to features of the basic elements; training a learning model according to the indent features, local features of the fragments and neighborhood relation features among the fragments, obtaining model parameters, and establishing a list recognizing model; and invoking the list recognizing model to perform list recognizing on the required document, so as to get recognition results.
地址 Beijing CN