发明名称 Method and system for converting hypertext markup language web page to plain text
摘要 A method for converting an HTML web page to plain text includes extracting from HTML source code of the HTML web page a portion containing a plurality of character strings and tags, calculating length and position of each character string in the extracted portion so as to find a first predetermined percentage of the character strings with the longest lengths, analyzing a number of position intervals between adjacent ones of the character strings belonging to the first predetermined percentage of the character strings with the longest lengths, labeling the corresponding character strings as belonging to a same block if the number of position intervals is not greater than a second predetermined value so as to find a largest character string block, and deleting the tags in the largest character string block so as to obtain main content of the HTML web page in plain text.
申请公布号 US8196036(B2) 申请公布日期 2012.06.05
申请号 US20080031855 申请日期 2008.02.15
申请人 HUANG TZU-KUEI;TSAI HONG-YANG;ESOBI, INC. 发明人 HUANG TZU-KUEI;TSAI HONG-YANG
分类号 G06F17/00 主分类号 G06F17/00
代理机构 代理人
主权项
地址