发明名称 Method and system for motif extraction in electronic documents
摘要 A method, system, and computer program product for extracting text motifs from the electronic documents is disclosed. A user provides a largest-maximal repeat or a super-maximal repeat as a first text block. The occurrences of the first text block are detected to identify the second text blocks in the vicinity of the occurrences of the first text block on the basis of pre-defined parameters. The text motifs are determined by combining the first text block and the second text block. Finally, the text motifs are extracted from the electronic documents.
申请公布号 US9483463(B2) 申请公布日期 2016.11.01
申请号 US201213608312 申请日期 2012.09.10
申请人 Xerox Corporation 发明人 Galle Matthias;Renders Jean-Michel
分类号 G10L15/22;G10L15/187;G06Q30/04;G06Q30/02;G06Q40/00;G06F17/27;G06F17/24;G06F19/18;G06F19/22;G06F7/24 主分类号 G10L15/22
代理机构 Jones Robb, PLLC 代理人 Jones Robb, PLLC
主权项 1. A method for extracting one or more text motifs from one or more electronic documents, the method comprising: receiving, by a processor, a first text block from a user, wherein the first text block corresponds to at least one of a largest-maximal repeat (LMR) or a super-maximal repeat (SMR); receiving, by the processor via an input device coupled to the processor, an array of repeats, wherein the array of repeats comprises a list of repeats of one or more text blocks occurring at one or more positions in the one or more electronic documents; detecting, by the processor, one or more occurrences of the first text block in the one or more electronic documents based on the array of repeats; identifying, in the one or more electronic documents by the processor, a second text block in vicinity of an occurrence of the first text block based on a pre-defined set of parameters, such that the second text block and the first text block are repeated together at least two times in the one or more electronic documents, wherein the pre-defined set of parameters comprises a maximum number of text blocks in the one or more text motifs; determining, by the processor, the one or more text motifs in the one or more electronic documents, wherein each of the one or more text motifs is a combination of the first text block and the second text block; checking, by the processor, for extension of the one or more text motifs if a number of text blocks in the one or more text motifs is less than the maximum number of text blocks in the one or more text motifs, wherein the checking for extension further comprises using the one or more text motifs as a new text block and repeating the identifying and determining steps for the new text block until the maximum number of text blocks is reached; extracting, by the processor, the one or more text motifs from each of the one or more electronic documents; and creating, by the processor, a template by collating the one or more extracted text motifs.
地址 Norwalk CT US