发明名称 Apparatus and method for identifying similarity via dynamic decimation of token sequence n-grams
摘要 An apparatus for identifying related code variants or text samples includes processing circuitry configured to execute instructions for receiving query binary code, processing the query binary code to generate one or more query code fingerprints comprising compressed representations of respective functional components of the query binary code, generating token sequence n-grams of the fingerprints, hashing the n-grams, partitioning samples by length to compare selected samples based on length, and identifying similarity via dynamic decimation of token sequence n-grams.
申请公布号 US9111095(B2) 申请公布日期 2015.08.18
申请号 US201414248622 申请日期 2014.04.09
申请人 The Johns Hopkins University 发明人 Cohen Jonathan D.
分类号 G06F11/00;G06F12/04;G06F12/16;G06F7/04;H04N7/16;G06F21/56 主分类号 G06F11/00
代理机构 代理人 Goepel James E.
主权项 1. An apparatus for identifying similarity via dynamic decimation of token sequence n-grams comprising: processing circuitry configured to execute instructions for: receiving query binary code; processing the query binary code to generate one or more query code fingerprints comprising compressed representations of respective functional components of the query binary code; generating token sequence n-grams of the fingerprints; hashing the n-grams; partitioning samples by length to compare selected samples based on length; and identifying similarity via dynamic decimation of token sequence n-grams, the dynamic decimation comprising: accepting a score threshold T;calculating a library cell function I( ) that maps a range of sizes to a library cell;calculating a decimation factor function KO that maps a library cell to a non-negative number;receiving a plurality of reference samples;processing each reference sample of the plurality of reference samples via operations including:producing a sequence of reference tokens from the reference sample;producing a full reference signature, the full reference signature comprising the hash values of the set of n-grams present in the sequence of reference tokens;choosing a library cell equal to I( ) applied to the size of the full reference signature;choosing a reference decimation factor equal to K( ) applied to the library cell;decimating the full reference signature by the decimation factor to produce a decimated reference signature; andrecording the decimated reference signature in the library cell together with an identifier of the reference sample;receiving one or more test samples;processing each test sample via operations including:producing a sequence of test tokens from the test sample;producing a full test signature, the full test signature comprising the hash values of the set of n-grams present in the sequence of test tokens;choosing a set of library cells on the basis of 10 applied to the size of the full test signature;for each library cell of the set of library cells:choosing a test decimation factor equal to KO applied to the library cell;decimating the full test signature by the test decimation factor to produce a decimated test signature;for each decimated reference signature in the library cell, scoring the decimated test signature against the decimated reference signature and reporting the resulting score and identifier stored with the decimated reference signature in the event that the score meets or exceeds T.
地址 Baltimore MD US