摘要 |
A method of classifying URLs by analyzing each URL discovered by a crawler and matching against a set of words corresponding to each class such as pornography, archive, obituary, business news, archive, politics, terrorism, etc. A count of the prefix of the URL to the class is updated and an action is performed with respect to electronic documents on the computer system based on the count. The action performed could be blocking the computer system from the crawling, or adjusting the frequency with which the computer system should be crawled.
|