发明名称 |
System of generating new schema based on selective HTML elements |
摘要 |
The present invention provides a system which is able to detect similar web page elements which are described in mark-up language, such that the content of those elements can be captured. Text content may then be sent to a text classifier for further analysis. |
申请公布号 |
US9460231(B2) |
申请公布日期 |
2016.10.04 |
申请号 |
US201113637483 |
申请日期 |
2011.03.28 |
申请人 |
BRITISH TELECOMMUNICATIONS public limited company |
发明人 |
Thompson Simon G;Nguyen Duong T;Thint Marcus Alfred;Gharib Hamid |
分类号 |
G06F17/30;G06F17/00;G06F17/22 |
主分类号 |
G06F17/30 |
代理机构 |
Nixon & Vanderhye P.C. |
代理人 |
Nixon & Vanderhye P.C. |
主权项 |
1. A method of automatically generating a mark-up language schema, the method comprising the steps of:
a) receiving a plurality of training samples, the or each training sample identifying one or more mark-up language elements stored within an online data resource; b) for each of the plurality of received training samples, automatically generating a candidate mark-up language schema; c) for each of the plurality of candidate mark-up language schema, comparing that candidate schema with the remainder of the candidate schemas to determine how many of the schema match and selecting a candidate mark-up language schema if the proportion of matching candidate schema exceeds a predetermined threshold; d) if none of the plurality of candidate mark-up language schema matches a sufficient number of the other schema, generating a further mark-up language schema and executing a further instance of step c); and e) reiterating step d) until one of the candidate schemas matches with a sufficient number of the other schema. |
地址 |
London GB |