发明名称 Focused crawling to identify potentially malicious sites using Bayesian URL classification and adaptive priority calculation
摘要 For each page of a set, a Bayesian classification of the URL associated with the page is performed, and a maliciousness probability is assigned to the URL based on the Bayesian classification. A traversal priority is assigned to each page of the set, the assigned traversal priorities initially directing a breadth first traversal of the set of pages. The assigned traversal priorities of a subset of the pages of the set are modified to direct higher priority traversals, responsive to the maliciousness probabilities of the URLs corresponding to the pages of the subset. Each page of the set is traversed in the order specified by the traversal priorities, and analyzed during traversal to determine whether the page is malicious.
申请公布号 US9298824(B1) 申请公布日期 2016.03.29
申请号 US201012832062 申请日期 2010.07.07
申请人 Symantec Corporation 发明人 Vinnik Alex;Gubin Maxim;Kislyuk Oleg
分类号 H04L29/06;G06F17/30 主分类号 H04L29/06
代理机构 Patent Law Works LLP 代理人 Patent Law Works LLP
主权项 1. A computer implemented method for expediting maliciousness analysis using focused crawling to reorder breadth first crawling when analyzing pages for maliciousness, the reordering based on fast classifications involving just page URLs (Universal Resource Locators), the method comprising: during a first pass of analyzing pages for maliciousness, at least one computer system: ordering a set of URL pages to analyze for maliciousness using a breadth first ordering which is based on proximity in a page level to a root level URL page; during a second pass of analyzing pages for maliciousness, the at least one computer system: training a Bayesian classifier with known malicious URLs and known benign URLs;classifying just a URL of each page, without examining the content, with a Bayesian classifier of conditional probabilities based on the training, wherein maliciousness of each page is unknown;assigning a maliciousness probability for each page based on the Bayesian classification of just the URL of each page;receiving a number of pages to represent the degree that focused crawling modifies breadth first ordering, the number of pages being a subset of the number of the set of URL pages being analyzed; andadjusting the ordering of breadth first with focused crawling to account for malicious probability according to the degree defined by the received number of pages,wherein the content of each pages is scanned according to the adjusted ordering in order to analyze the content for maliciousness.
地址 Mountain View CA US