发明名称 System of generating new schema based on selective HTML elements
摘要 The present invention provides a system which is able to detect similar web page elements which are described in mark-up language, such that the content of those elements can be captured. Text content may then be sent to a text classifier for further analysis.
申请公布号 US9460231(B2) 申请公布日期 2016.10.04
申请号 US201113637483 申请日期 2011.03.28
申请人 BRITISH TELECOMMUNICATIONS public limited company 发明人 Thompson Simon G;Nguyen Duong T;Thint Marcus Alfred;Gharib Hamid
分类号 G06F17/30;G06F17/00;G06F17/22 主分类号 G06F17/30
代理机构 Nixon & Vanderhye P.C. 代理人 Nixon & Vanderhye P.C.
主权项 1. A method of automatically generating a mark-up language schema, the method comprising the steps of: a) receiving a plurality of training samples, the or each training sample identifying one or more mark-up language elements stored within an online data resource; b) for each of the plurality of received training samples, automatically generating a candidate mark-up language schema; c) for each of the plurality of candidate mark-up language schema, comparing that candidate schema with the remainder of the candidate schemas to determine how many of the schema match and selecting a candidate mark-up language schema if the proportion of matching candidate schema exceeds a predetermined threshold; d) if none of the plurality of candidate mark-up language schema matches a sufficient number of the other schema, generating a further mark-up language schema and executing a further instance of step c); and e) reiterating step d) until one of the candidate schemas matches with a sufficient number of the other schema.
地址 London GB