摘要 |
Path-based ranking of unvisited Web pages for WWW crawling is provided, via identifying all the paths beginning with a“seed”URL and leading to visited relevant web pages as“good-path set”, and for each unvisited web page, identifying the paths beginning from the“seed”URL leading to it as“partial-path set”; classifying all the visited web pages and labeling each web Page with the labels of a class or classes it belongs to; training a statistic model for generalizing the common patterns among all ones of“good-path set”; and evaluating the“partial-path set”with the statistic model and ranking the unvisited web pages with the evaluation results.
|