发明名称 Robust wrappers for web extraction
摘要 A computer-implemented method to determine a robust wrapper includes developing a model indicative of the temporal history of a document, such as a web document written in a markup language. Based on the developed model, robustness characteristics are determined for a plurality of different wrappers representing associated paths to the data item in a representation of the document. Based on a result of the determining operation, a result wrapper of the plurality of wrappers is provided. The result wrapper has a desired robustness characteristic.
申请公布号 US8762829(B2) 申请公布日期 2014.06.24
申请号 US200812344076 申请日期 2008.12.24
申请人 Yahoo! Inc. 发明人 Dalvi Nilesh;Bohannon Philip;Sha Fei
分类号 G06F17/22 主分类号 G06F17/22
代理机构 Weaver Austin Villeneuve & Sampson LLP 代理人 Weaver Austin Villeneuve & Sampson LLP
主权项 1. A computer-implemented method to determine a robust wrapper representing a data item of a plurality of data items in a document represented by a markup language, comprising: based on archival data representative of a temporal history of the document, developing a model indicative of the temporal history; based on the developed model, determining robustness characteristics for a plurality of different wrappers representing associated paths to the data item in a representation of the document, the robustness characteristics for each wrapper representing a likelihood that the corresponding wrapper will continue to be effective for extracting the data item when the document changes; and based on a result of the determining operation, providing, as a result wrapper, one of the plurality of wrappers that has a desired robustness characteristic; wherein the representation of the document relates to a tree having a plurality of nodes; wherein the temporal history of the document relates at least to an original tree and a plurality of changed trees indicative of trees that appeared during the temporal history of the document that are different from the original tree; wherein the developing operation includes: obtaining a plurality of pairs, each one of the plurality of pairs including a first element (T) and a second element (T′), wherein T is the original tree and T′ is a different one of the plurality of changed trees;determining a plurality of different change operations indicative of changes made to change T into T′ for each one of the plurality of pairs; andassociating each one of the plurality of different change operations with a probability value indicative of a probability that the associated change operation is applied to the document.
地址 Sunnyvale CA US