发明名称 Method and System for Extracting Post Contents From Forum Web Page
摘要 The present application discloses a method and a system for extracting post contents from a forum web page. The method includes: acquiring a forum web page; converting the forum web page into a DOM tree, wherein the DOM tree at least includes a root node and at least one child node attached to the root node; generating frequent patterns for the root node and the at least one child node in a one-to-one correspondence mode; determining a node corresponding to information contents in the forum web page according to a frequent pattern, satisfying a preset condition, in the frequent patterns; and extracting the information contents in the forum web page from the node corresponding to the information contents in the forum web page based on a preset common sub-tree algorithm.
申请公布号 US2014156799(A1) 申请公布日期 2014.06.05
申请号 US201314093157 申请日期 2013.11.29
申请人 PEKING UNIVERSITY FOUNDER GROUP CO., LTD. ;BEIJING FOUNDER ELECTRONICS CO., LTD. ;PEKING UNIVERSITY 发明人 Zhang Tao;Yang Jianwu;Yu Xiaoming
分类号 H04L12/24 主分类号 H04L12/24
代理机构 代理人
主权项 1. A method for extracting post contents from a forum web page, comprising: acquiring a forum web page; converting the forum web page into a DOM (Document Object Model) tree, wherein the DOM tree at least comprises a root node and at least one child node attached to the root node; generating frequent patterns for the root node and the at least one child node in a one-to-one correspondence mode; determining a node corresponding to information contents in the forum web page according to a frequent pattern, satisfying a preset condition, in the frequent patterns; and extracting the information contents in the forum web page from the node corresponding to the information contents in the forum web page based on a preset common sub-tree algorithm.
地址 Beijing CN
您可能感兴趣的专利