发明名称 Method and Apparatus for Retrieving and Indexing Hidden Pages
摘要 A method and system is provided for autonomously downloading and indexing Hidden Web pages from Websites having site-specific search interfaces. The method may be implemented using a crawler program or the like to autonomously cull Hidden Web content. The method includes the steps of selecting a query term and issuing a query to a site-specific search interface containing Hidden Web pages. A results index is then acquired and the Hidden Web pages are downloaded from the results index. A plurality of potential query terms are then identified from the downloaded Hidden Web pages. The efficiency of each potential query term is then estimated and a next query term is selected from the plurality of potential query terms, wherein the next selected query term has the greatest efficiency. The next selected query term is then issued to the site-specific search interface using the next query term. The process is repeated until all or most of the Hidden Web pages are discovered. In one aspect of the invention, the efficiency of each potential query term is expressed as a ratio of number of new documents returned for the potential query term to the cost associated with issuing the potential query.
申请公布号 US2008097958(A1) 申请公布日期 2008.04.24
申请号 US20050570330 申请日期 2005.05.27
申请人 THE REGENTS OF THE UNIVERSITY OF CALIFORNIA 发明人 NTOULAS ALEXANDROS;CHO JUNGHOO;ZERFOS PETROS
分类号 G06F15/16 主分类号 G06F15/16
代理机构 代理人
主权项
地址