发明名称 Efficient string search
摘要 Some embodiments of an efficient string search have been presented. In one embodiment, a string of bytes representing content written in a non-delimited language is received, wherein the content has been classified into a predetermined category. In a single pass through the string of bytes, a set of N-grams is searched for simultaneously. Statistical information on occurrences of the N-grams, if any, in the string of bytes is collected. In some embodiments, a model is generated based on the statistical information, where the model is usable by a content filter to classify content.
申请公布号 US9542387(B2) 申请公布日期 2017.01.10
申请号 US201414326230 申请日期 2014.07.08
申请人 DELL SOFTWARE INC. 发明人 Raffill Thomas E.;Zhu Shunhui;Yanovsky Roman;Yanovsky Boris;Gmuender John
分类号 G10L21/00;G06F17/28;G06F17/27;G06F17/30 主分类号 G10L21/00
代理机构 Polsinelli LLP 代理人 Polsinelli LLP
主权项 1. A method for classifying content written in a non-delimited language, the method comprising: receiving a string of bytes at a finite state machine (FSM), wherein the string of bytes is received by electronic hardware associated with the FSM after a user attempts to access information related to the string of bytes; performing a string search on the string of bytes, wherein the string search identifies that the string of bytes includes a set of N-grams that match one or more states in a set of states, wherein the FSM connects the one or more states in the set of states; collecting statistical information regarding the set of N-grams received in the string of bytes, wherein the collected statistical information corresponds to a condition in a model stored in a model repository; receiving the model from the model repository; identifying that the one or more states and that one or more N-grams in the set of N-grams correspond to the received model; identifying that a length of the received string of bytes also corresponds to the condition in the received model when the length of the received string of bytes is of a certain length; classifying content of the one or more N-grams as being prohibited according to the one or more states, the one or more N-grams, and the length that corresponds to the condition; and denying access to the content when the classification of the content is prohibited, wherein denying access to the content prevents the content from being displayed on a display accessible to the user after the user attempt to access the information relating to the string of bytes.
地址 Round Rock TX US