发明名称 |
Method and System for Extracting Post Contents From Forum Web Page |
摘要 |
The present application discloses a method and a system for extracting post contents from a forum web page. The method includes: acquiring a forum web page; converting the forum web page into a DOM tree, wherein the DOM tree at least includes a root node and at least one child node attached to the root node; generating frequent patterns for the root node and the at least one child node in a one-to-one correspondence mode; determining a node corresponding to information contents in the forum web page according to a frequent pattern, satisfying a preset condition, in the frequent patterns; and extracting the information contents in the forum web page from the node corresponding to the information contents in the forum web page based on a preset common sub-tree algorithm. |
申请公布号 |
US2014156799(A1) |
申请公布日期 |
2014.06.05 |
申请号 |
US201314093157 |
申请日期 |
2013.11.29 |
申请人 |
PEKING UNIVERSITY FOUNDER GROUP CO., LTD. ;BEIJING FOUNDER ELECTRONICS CO., LTD. ;PEKING UNIVERSITY |
发明人 |
Zhang Tao;Yang Jianwu;Yu Xiaoming |
分类号 |
H04L12/24 |
主分类号 |
H04L12/24 |
代理机构 |
|
代理人 |
|
主权项 |
1. A method for extracting post contents from a forum web page, comprising:
acquiring a forum web page; converting the forum web page into a DOM (Document Object Model) tree, wherein the DOM tree at least comprises a root node and at least one child node attached to the root node; generating frequent patterns for the root node and the at least one child node in a one-to-one correspondence mode; determining a node corresponding to information contents in the forum web page according to a frequent pattern, satisfying a preset condition, in the frequent patterns; and extracting the information contents in the forum web page from the node corresponding to the information contents in the forum web page based on a preset common sub-tree algorithm.
|
地址 |
Beijing CN |