发明名称 Efficient string search
摘要 Some embodiments of an efficient string search have been presented. In one embodiment, a string of bytes representing content written in a non-delimited language is received, wherein the content has been classified into a predetermined category. In a single pass through the string of bytes, a set of N-grams is searched for simultaneously. Statistical information on occurrences of the N-grams, if any, in the string of bytes is collected. In some embodiments, a model is generated based on the statistical information, where the model is usable by a content filter to classify content.
申请公布号 US8775164(B2) 申请公布日期 2014.07.08
申请号 US201313973859 申请日期 2013.08.22
申请人 SonicWALL, Inc. 发明人 Raffill Thomas E.;Zhu Shunhui;Yanovsky Roman;Yanovsky Boris;Gmuender John
分类号 G06F17/27 主分类号 G06F17/27
代理机构 Lewis Roca Rothgerber LLP 代理人 Lewis Roca Rothgerber LLP
主权项 1. A method for searching a string of bytes, the method comprising: receiving a document comprising a non-delimited language over a communication network, wherein the received document has been pre-classified as being of a content type; executing instructions stored in memory, wherein execution of the instructions by a processor: searches the received document using a set of a plurality of N-grams, wherein each N-gram corresponds to a pre-selected keyword for identifying content of the content type of the received document, andwherein the search proceeds in a single pass through the received document using a finite state machine having a plurality of states, wherein the plurality of states are coupled to each other via one or more paths and the plurality of states are based on the plurality of N-grams,determines statistical information based on occurrence of one or more N-grams found in the received document, andgenerates a model for classifying as the type incoming strings of bytes representing non-segmented text written in the non-delimited language, the model generated based on the determined statistical information, wherein the model includes a predetermined number of conditions for classifying the incoming strings as the type; and making the model available over a communication network to one or more content filters for use in classifying documents as being of the content type.
地址 San Jose CA US