发明名称 EXTRACTING DATA CONTENT ITEMS USING TEMPLATE MATCHING
摘要 Systems and methods for extracting data content items from a web page are provided. A template is created by labeling data content items of interest associated with a web page and generating a template Document Object Model (DOM) tree based on the labeled web page. DOM trees are also generated for additional web pages that contain data content items for which extraction may be desired. These DOM trees are compared to the template DOM tree to determine alignment there between. The aligned data content items may then be extracted from the additional web pages and indexed, as desired. Labeling the data content items of interest prior to generating a template DOM tree allows for the desired data content items to be specified and more accurately extracted from related and/or similarly structured web pages.
申请公布号 US2009063500(A1) 申请公布日期 2009.03.05
申请号 US20070848987 申请日期 2007.08.31
申请人 MICROSOFT CORPORATION 发明人 ZHAI YANHONG;LI YI;QIAN RICHARD;GAO HONG;TAN LEI
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项
地址