发明名称 METHOD TO IDENTIFY COMMON STRUCTURES IN FORMATTED TEXT DOCUMENTS
摘要 A computer implemented method, computer program product and data processing system, for identifying common structures shared across a plurality of formatted text documents. The common structure is presented as a sequence of landmarks, each of which has a starting and ending marker to describe the borders of text. The common structure is identified by counting the occurrences of repeating text segments across documents. Frequently co-occurred adjacent segments become candidates for markers of landmarks. In addition, styling information of textual content within a landmark is extracted and mapped to rules. The rules are used to merge and summarize content from multiple documents, which gives an advantage over current practice of content concatenation.
申请公布号 US2011137900(A1) 申请公布日期 2011.06.09
申请号 US20090634176 申请日期 2009.12.09
申请人 INTERNATIONAL BUSINESS MACHINES CORPORATION 发明人 CHANG YUAN-CHI;MUKHERJEE DEBDOOT;SINHA VIBHA SINGHAL;SRIVASTAVA BIPLAV
分类号 G06F17/00;G06F17/21;G06F17/30 主分类号 G06F17/00
代理机构 代理人
主权项
地址