发明名称 Methods and systems for generation of document structures based on sequential constraints
摘要 Disclosed is a method that structures a sequentially-ordered set of elements, each being characterized by a set of features. N-grams (sequence of n features) are computed from a set for n contiguous elements, and n-grams which are repetitive (Kleene cross) are selected. Elements matching the most frequent repetitive n-gram are grouped together under a new node, and a new sequence is created. The method is iteratively applied to this new sequence. The output is an ordered set of trees.
申请公布号 US9524274(B2) 申请公布日期 2016.12.20
申请号 US201313911452 申请日期 2013.06.06
申请人 Xerox Corporation 发明人 Déjean Hervé
分类号 G06F17/00;G06F17/21;G06K9/62;G06K9/72;G06F17/22;G06K9/00 主分类号 G06F17/00
代理机构 Fay Sharpe LLP 代理人 Fay Sharpe LLP
主权项 1. A computer implemented method of hierarchically segmenting a sequence of elements associated with a digital version of a document comprising: a) obtaining a sequence of elements representing the document; b) defining a set of named features associated with each element of the sequence of elements, each named feature defined by a feature value type; c) computing a set of feature values associated with the set of named features for each element of the sequence; d) generating a set of n-grams from the sequence of elements, an n-gram including an ordered sequence of n features provided by a sequence of n named elements; e) electing sequential n-grams from the set of n-grams, the sequential n-grams defined as similar contiguous n-grams; f) selecting the most frequent sequential n-gram from the elected sequential n-grams; g) generating a new sequence of the elements by matching the selected most frequent sequential n-gram against the sequence of elements associated with the document, replacing matched elements of the sequence of elements with a respective node, and associating the matched elements of the sequence of elements as children of the respective node; and h) iteratively repeating steps d)-g) on the new sequence of elements generated in step g) until all sequential n-grams associated with the sequence of elements are matched against the sequence of elements associated with the document, the respective matched elements of the sequence of elements are replaced with a respective node, and the respective matched elements of the sequence of elements are associated as children of the respective node, wherein step d) includes: d1) calibrating the set of named feature values for each element of the sequence by assigning equal feature values to named features which are fuzzily equal; andd2) generating a set of n-grams from the sequence of elements and calibrated set of named feature values, an n-gram including an ordered sequence of n named features provided by a sequence of n elements; and wherein step g) includes: g1) matching the selected most frequent sequential n-gram against the sequence of elements to determine all sub-sequences of n contiguous element which are matched by the selected n-gram;g2) enriching the determined sub-sequences by generating n-grams for each sub-sequence; andg3) generating a new sequence of elements by replacing each sub-sequence of contiguous matched elements with a respective node and associating the matched elements of the sequence of elements as children of the respective node.
地址 Norwalk CT US