发明名称 ウェブページ情報を抽出する方法およびシステム
摘要 A method of extracting web page information includes analyzing a document object model (DOM) structure of a sample page to obtain a position of information to be extracted. A node corresponding to the position of the information to be extracted is rendered in the DOM structure as a target node. Starting from the target node, relative position information is traversed recursively until the root node is found to create candidate paths. The candidate paths are rendered as a path set. A DOM structure of a page to be extracted is analyzed, information is located in the DOM structure of the page starting from the root node in the path set, and an extracted node candidate set is obtained. A node having highest robustness from the extracted node candidate set is selected to be a final extracted node and extracted information is obtained using the extracted node.
申请公布号 JP5944985(B2) 申请公布日期 2016.07.05
申请号 JP20140515962 申请日期 2012.06.13
申请人 アリババ・グループ・ホールディング・リミテッドALIBABA GROUP HOLDING LIMITED 发明人 カイ ボーヤン;チアン チー
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项
地址