发明名称 |
LIST RECOGNIZING METHOD AND LIST RECOGNIZING SYSTEM |
摘要 |
A list recognizing method and system, which comprises: parsing and analyzing metadata information within an original fixed-layout document, and extracting basic elements within a page; segmenting the basic elements, extracting segmented text lines within the page to obtain fragments; building an undirected graph with respect to the fragments; detecting indent features of a bullet according to features of the basic elements; training a learning model according to the indent features, local features of the fragments and neighborhood relation features among the fragments, obtaining model parameters, and establishing a list recognizing model; and invoking the list recognizing model to perform list recognizing on the required document, so as to get recognition result. This machine learning method may recognize not only a list, but also the contextual relationship between the first line and its subsequent lines of a list, and realize analyzing and understanding a layout of the list of the fixed-layout document ultimately. The accuracy of list recognizing on a fixed-layout document can be improved even if the bullets of the first line of the list are various. |
申请公布号 |
US2015095022(A1) |
申请公布日期 |
2015.04.02 |
申请号 |
US201314096431 |
申请日期 |
2013.12.04 |
申请人 |
Founder Apabi Technology Limited ;Peking University Founder Group Co., Ltd. |
发明人 |
XU Canhui;TANG Zhi;XU Jianbo;TAO Xin |
分类号 |
G06F17/27 |
主分类号 |
G06F17/27 |
代理机构 |
|
代理人 |
|
主权项 |
1. A list recognizing method, comprising:
parsing and analyzing metadata information within an original fixed-layout document, and extracting basic elements within a page; segmenting the basic elements, extracting segmented text lines within the page to obtain fragments; building an undirected graph with respect to the fragments; detecting indent features of a bullet according to features of the basic elements; training a learning model according to the indent features, local features of the fragments and neighborhood relation features among the fragments, obtaining model parameters, and establishing a list recognizing model; and invoking the list recognizing model to perform list recognizing on the required document, so as to get recognition results. |
地址 |
Beijing CN |