发明名称 Serial and parallel methods for I/O efficient suffix tree construction
摘要 System and method for suffix tree creation for large input data/text streams. The methodology leverages the structure of suffix trees to build a suffix tree by simultaneously tiling accesses to both the input string as well as the partially constructed suffix tree. The end result enables the indexing of very large input strings and at the same time maintain a bounded working set size and a fixed memory footprint. The method is employed for serial processing. Further, a scalable parallel suffix tree construction is realized that is suitable for implementation on parallel distributed memory systems that use effective collective communication and in-network caching. The methodology is also applied for suffix link recovery in both serial and parallel implementations.
申请公布号 US8914415(B2) 申请公布日期 2014.12.16
申请号 US201012697159 申请日期 2010.01.29
申请人 International Business Machines Corporation 发明人 Ghoting Amol N.;Makarychev Konstantin
分类号 G06F17/30 主分类号 G06F17/30
代理机构 Scully, Scott, Murphy & Presser, P.C. 代理人 Scully, Scott, Murphy & Presser, P.C. ;Percello, Esq. Louis J.
主权项 1. A method for building a suffix tree for a string of text or data of size n, said method comprising: providing a processing device having an associated memory storage device, allocating a fixed amount of memory within said associated memory storage device for storing an output sub-tree of the suffix tree and a portion of the input string during suffix tree building, wherein the length n of said input string exceeds the allocated fixed memory amount; constructing, using said processing device, a set of prefixes (p) for said input string; building a suffix sub-tree (Tp) for each prefix (p) of said constructed prefix set in said fixed memory amount utilizing a constructor method, said constructor method including: partitioning said string into a plurality of equal sized sub-string blocks of size B, and iteratively building a fixed portion of the output suffix sub-tree, wherein during each iteration i, where 0<i<n/B: accessing a sub-string block i; inserting each suffix starting with the prefix p into the suffix tree such that the input string references from the edges in the partially constructed suffix sub-tree lie in the ith block; and iteratively accessing each remaining equal sized sub-string block j>i of said plurality, one sub-string block at a time, to process suffixes in each block starting with said prefix (p) that have a string index in said equal sized block i and inserting those suffixes in said tree such that during suffix sub-tree building at each iteration, said equal sized sub-string blocks being referenced by said partially constructed suffix sub-tree, and said suffixes being inserted into the sub-tree are maintained within said fixed memory amount and reused; and, merging said suffix sub-trees to form said suffix tree.
地址 Armonk NY US