主权项 |
1. A computer implemented method of hierarchically segmenting a sequence of elements associated with a digital version of a document comprising:
a) obtaining a sequence of elements representing the document; b) defining a set of named features associated with each element of the sequence of elements, each named feature defined by a feature value type; c) computing a set of feature values associated with the set of named features for each element of the sequence; d) generating a set of n-grams from the sequence of elements, an n-gram including an ordered sequence of n features provided by a sequence of n named elements; e) electing sequential n-grams from the set of n-grams, the sequential n-grams defined as similar contiguous n-grams; f) selecting the most frequent sequential n-gram from the elected sequential n-grams; g) generating a new sequence of the elements by matching the selected most frequent sequential n-gram against the sequence of elements associated with the document, replacing matched elements of the sequence of elements with a respective node, and associating the matched elements of the sequence of elements as children of the respective node; and h) iteratively repeating steps d)-g) on the new sequence of elements generated in step g) until all sequential n-grams associated with the sequence of elements are matched against the sequence of elements associated with the document, the respective matched elements of the sequence of elements are replaced with a respective node, and the respective matched elements of the sequence of elements are associated as children of the respective node, wherein step d) includes:
d1) calibrating the set of named feature values for each element of the sequence by assigning equal feature values to named features which are fuzzily equal; andd2) generating a set of n-grams from the sequence of elements and calibrated set of named feature values, an n-gram including an ordered sequence of n named features provided by a sequence of n elements; and wherein step g) includes:
g1) matching the selected most frequent sequential n-gram against the sequence of elements to determine all sub-sequences of n contiguous element which are matched by the selected n-gram;g2) enriching the determined sub-sequences by generating n-grams for each sub-sequence; andg3) generating a new sequence of elements by replacing each sub-sequence of contiguous matched elements with a respective node and associating the matched elements of the sequence of elements as children of the respective node. |