发明名称 Methods, apparatus and computer programs for characterizing web resources
摘要 Methods, apparatus and computer programs are provided for characterizing Web-based information resources based on their interactions. A Web-based information resource is a single Web document or a collection of related Web documents. Unlike simple text documents, Web documents contain hyperlinks and other HTML tags. Different types of interactions, including inbound hyperlinks, outbound hyperlinks and internal links associated with a Web-based information resource, are used to characterize the Web-based information resource. A DOM tree representing the tag structure of a Web-based information resource is used to identify text items likely to be useful as context for a hyperlink anchor text, and the anchor text is combined with the context to generate a representation. The representation of Web-based information resources based on interactions can be used for clustering and classification, and in Web mining applications such as query disambiguation and automatic taxonomy generation.
申请公布号 US2006026496(A1) 申请公布日期 2006.02.02
申请号 US20040901275 申请日期 2004.07.28
申请人 JOSHI SACHINDRA;KRISHNAPURAM RAGHURAM;ROY SHOURYA 发明人 JOSHI SACHINDRA;KRISHNAPURAM RAGHURAM;ROY SHOURYA
分类号 G06F17/21 主分类号 G06F17/21
代理机构 代理人
主权项
地址