发明名称 |
Efficient string search |
摘要 |
Some embodiments of an efficient string search have been presented. In one embodiment, a string of bytes representing content written in a non-delimited language is received, wherein the content has been classified into a predetermined category. In a single pass through the string of bytes, a set of N-grams is searched for simultaneously. Statistical information on occurrences of the N-grams, if any, in the string of bytes is collected. In some embodiments, a model is generated based on the statistical information, where the model is usable by a content filter to classify content. |
申请公布号 |
US8775164(B2) |
申请公布日期 |
2014.07.08 |
申请号 |
US201313973859 |
申请日期 |
2013.08.22 |
申请人 |
SonicWALL, Inc. |
发明人 |
Raffill Thomas E.;Zhu Shunhui;Yanovsky Roman;Yanovsky Boris;Gmuender John |
分类号 |
G06F17/27 |
主分类号 |
G06F17/27 |
代理机构 |
Lewis Roca Rothgerber LLP |
代理人 |
Lewis Roca Rothgerber LLP |
主权项 |
1. A method for searching a string of bytes, the method comprising:
receiving a document comprising a non-delimited language over a communication network, wherein the received document has been pre-classified as being of a content type; executing instructions stored in memory, wherein execution of the instructions by a processor:
searches the received document using a set of a plurality of N-grams,
wherein each N-gram corresponds to a pre-selected keyword for identifying content of the content type of the received document, andwherein the search proceeds in a single pass through the received document using a finite state machine having a plurality of states, wherein the plurality of states are coupled to each other via one or more paths and the plurality of states are based on the plurality of N-grams,determines statistical information based on occurrence of one or more N-grams found in the received document, andgenerates a model for classifying as the type incoming strings of bytes representing non-segmented text written in the non-delimited language, the model generated based on the determined statistical information, wherein the model includes a predetermined number of conditions for classifying the incoming strings as the type; and making the model available over a communication network to one or more content filters for use in classifying documents as being of the content type. |
地址 |
San Jose CA US |