发明名称 Methods and apparatus for identifying tables in digital files
摘要 A method for identifying a table in a digital file includes extracting lines from a layout of the digital file, wherein the lines comprise horizontal lines and vertical lines. The method also includes identifying intersected line groups, wherein each intersected line group comprises a horizontal line of the extracted horizontal lines and a vertical line of the extracted vertical lines, the horizontal line and the vertical line intersecting with each other. The method further includes determining whether the number of intersected lines in each intersected line group is larger than a first threshold. If yes, the method further includes identifying an area in which the intersected line groups are located as a table area. If no, the method further includes performing vertical projection on characters in the area, and identifying the area as a table area based on results of the vertical projection.
申请公布号 US9348848(B2) 申请公布日期 2016.05.24
申请号 US201313871862 申请日期 2013.04.26
申请人 Peking University Founder Group Co., Ltd.;Beijing Founder Apabi Technology Ltd. 发明人 Dong Ning;Huang Wenjuan
分类号 G06F17/30;G06F17/24;G06K9/00 主分类号 G06F17/30
代理机构 Finnegan, Henderson, Farabow, Garrett & Dunner, LLP 代理人 Finnegan, Henderson, Farabow, Garrett & Dunner, LLP
主权项 1. A method for identifying a table in a digital file, comprising: extracting lines from a layout of the digital file, the lines comprising horizontal lines and vertical lines; identifying intersected line groups, each intersected line group comprising a horizontal line of the extracted horizontal lines and a vertical line of the extracted vertical lines, the horizontal line and the vertical line intersecting with each other; and determining whether the number of intersected lines in each intersected line group is larger than a first threshold, and if it is determined that the number of intersected lines in each intersected line group is larger than the first threshold, identifying an area in which the intersected line groups are located as a table area; if it is determined that the number of intersected lines in each intersected line group is not larger than the first threshold: performing vertical projection on characters in the area; and identifying the area as a table area based on a result of the vertical projection, wherein performing vertical projection on the characters comprises: obtaining a distance between any two characters in the area; identifying neighboring characters based on the distance; combining the neighboring characters to form a text block; and performing vertical projection on the combined text block; and wherein identifying the area as the table area based on the result of the vertical projection comprises: determining an interval range of each projected text block in a column direction; determining the number of rows in each column based on the determined interval range; and determining whether the number of rows is larger than or equal to 2, and if it is determined that the number of rows is larger than or equal to 2, determining the area to be a table area.
地址 Beijing CN