发明名称 |
METHODS AND SYSTEMS FOR GENERATION OF DOCUMENT STRUCTURES BASED ON SEQUENTIAL CONSTRAINTS |
摘要 |
Disclosed is a method that structures a sequentially-ordered set of elements, each being characterized by a set of features. N-grams (sequence of n features) are computed from a set for n contiguous elements, and n-grams which are repetitive (Kleene cross) are selected. Elements matching the most frequent repetitive n-gram are grouped together under a new node, and a new sequence is created. The method is iteratively applied to this new sequence. The output is an ordered set of trees. |
申请公布号 |
US2014365872(A1) |
申请公布日期 |
2014.12.11 |
申请号 |
US201313911452 |
申请日期 |
2013.06.06 |
申请人 |
Xerox Corporation |
发明人 |
Déjean Hervé |
分类号 |
G06F17/21 |
主分类号 |
G06F17/21 |
代理机构 |
|
代理人 |
|
主权项 |
1. A computer implemented method of hierarchically segmenting a sequence of elements associated with a digital version of a document comprising:
a) obtaining a sequence of elements representing the document; b) defining a set of named features associated with each element of the sequence of elements, each named feature defined by a feature value type; c) computing a set of feature values associated with the set of named features for each element of the sequence; d) generating a set of n-grams from the sequence of elements, an n-gram including an ordered sequence of n features provided by a sequence of n named elements; e) electing sequential n-grams from the set of n-grams, the sequential n-grams defined as similar contiguous n-grams; f) selecting the most frequent sequential n-gram from the elected sequential n-grams; and g) generating a new sequence of the elements by matching the selected most frequent sequential n-gram against the sequence of elements associated with the document, replacing matched elements of the sequence of elements with a respective node, and associating the matched elements of the sequence of elements as children of the respective node. |
地址 |
Norwalk CT US |