发明名称 |
METHOD TO IDENTIFY COMMON STRUCTURES IN FORMATTED TEXT DOCUMENTS |
摘要 |
A computer implemented method, computer program product and data processing system, for identifying common structures shared across a plurality of formatted text documents. The common structure is presented as a sequence of landmarks, each of which has a starting and ending marker to describe the borders of text. The common structure is identified by counting the occurrences of repeating text segments across documents. Frequently co-occurred adjacent segments become candidates for markers of landmarks. In addition, styling information of textual content within a landmark is extracted and mapped to rules. The rules are used to merge and summarize content from multiple documents, which gives an advantage over current practice of content concatenation.
|
申请公布号 |
US2011137900(A1) |
申请公布日期 |
2011.06.09 |
申请号 |
US20090634176 |
申请日期 |
2009.12.09 |
申请人 |
INTERNATIONAL BUSINESS MACHINES CORPORATION |
发明人 |
CHANG YUAN-CHI;MUKHERJEE DEBDOOT;SINHA VIBHA SINGHAL;SRIVASTAVA BIPLAV |
分类号 |
G06F17/00;G06F17/21;G06F17/30 |
主分类号 |
G06F17/00 |
代理机构 |
|
代理人 |
|
主权项 |
|
地址 |
|