发明名称 Domain constraint based data record extraction
摘要 Embodiments for a Mining Data Records based on Anchor Trees (MiBAT) process are disclosed. In accordance with at least one embodiment, the MiBAT process extracts data records containing user-generated content from web documents. The web document is processed into a Document Object Model (DOM) tree in which sub-trees of the DOM tree represent the data records of the web document. Domain constraints are used to locate structured portions of the DOM tree. Anchor trees are then located as being sets of sibling sub-trees which contain the domain constraints. The anchor trees are then used to determine a record boundary (i.e. the start offset and length) of the data records. Finally, the data records are extracted based on the anchor trees and the record boundaries.
申请公布号 US8983980(B2) 申请公布日期 2015.03.17
申请号 US201012945517 申请日期 2010.11.12
申请人 Microsoft Technology Licensing, LLC 发明人 Song Xinying;Cao Yunbo;Lin Chin-Yew
分类号 G06F17/30;G06F7/00;G06F17/22 主分类号 G06F17/30
代理机构 代理人 Choi Dan;Yee Judy;Minhas Micky
主权项 1. A system comprising: one or more processors; memory; a parser module maintained in the memory and executable by the one or more processors to process a document into a Document Object Model (DOM) tree that includes at least two nodes corresponding to user-generated content within data records of the document; and a record extraction module maintained in the memory and executable by the one or more processors to locate two or more anchor trees in the DOM tree as being a first set of sibling sub-trees in the DOM tree that each include a domain constraint associated with a structured portion of individual ones of the data records, determine a minimal distance as a minimum among distances between any two anchor trees of the two or more anchor trees determined based at least in part on a number of sibling sub-trees in the DOM tree that are between the any two anchor trees, determine a record boundary based at least in part on the minimal distance, the record boundary being a second set of sibling sub-trees in the DOM tree around individual ones of the anchor trees that include at least a portion of the user-generated content, and extract the data records around at least one of the anchor trees based at least in part on the record boundary.
地址 Redmond WA US