发明名称 Data Extraction Method, Computer Program Product and System
摘要 Disclosed is a method of automatically extracting data from a target web page, comprising selecting (302) data in a source web page; determining (304) the respective DOM (document object model) trees of the source and target web page, and identifying the one or more nodes comprising the selected data in the source web page DOM tree; determining (306) matching paths in the respective DOM trees; for selected data in a node of an unmatched branch of the source web page DOM tree, identifying (308) the nearest matched path in the source web page; identifying (310) the unmatched branch nearest to the corresponding matched path in the target web page; determining (312) if said identified unmatched branch in the target web page DOM tree comprises a target node matching the selected data node; and if so: extracting (322) data from the target node if the mismatch between the respective unmatched branches does not exceed a predefined threshold. A computer program product and system implementing this method are also disclosed.
申请公布号 US2012059859(A1) 申请公布日期 2012.03.08
申请号 US200913258480 申请日期 2009.11.25
申请人 JIAO LI-MEI;XIONG YUHONG 发明人 JIAO LI-MEI;XIONG YUHONG
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项
地址