发明名称 Method for extracting relevant content from a markup language file, in particular from a HTML file
摘要 A computer implemented method allowing automated extraction of relevant content from a markup language file is proposed. The method provides an allocation of a respective dimension to markup instructions and creates a data structure based on the retrieved markup language file. The data structure contains an item for each content element of the file. A relative or absolute coordinate in terms of a one-dimensional markup space is based on the dimension(s) allocated to markup instruction(s) and associated with the item created for the given content element. A grouping criterion is used for grouping content elements. The grouping criterion includes at least a condition that takes into account distance between adjacent content elements in terms of the one-dimensional markup space. Content clusters are defined using the data structure by: - creating groups of content elements using the grouping criterion; - creating a separate cluster for each group that satisfies a clustering criterion; - associating to each cluster information including at least the size of each content element in the given cluster and the amount of content elements grouped in the given cluster; A numerical expression of relevancy for each cluster is calculated using a first function which increases with an increasing size of the content element(s) in the cluster and decreases with an increasing amount of content elements in the cluster and a second function which increases with an increasing sum of the nth power of the size of each content element in the cluster and decreases with an increasing amount of content elements in the cluster and/or a third function which decreases with increasing distance in terms of the one-dimensional markup space between the content elements in the cluster. The content of certain clusters is subsequently selected, combined and output as being relevant on the basis of the cluster relevancy.
申请公布号 EP2096561(A1) 申请公布日期 2009.09.02
申请号 EP20080152105 申请日期 2008.02.28
申请人 THE EUROPEAN COMMUNITY, REPRESENTED BY THE EUROPEAN COMMISSION 发明人 VAN DER GOOT, ERIK
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项
地址