发明名称 EXTENDED-CONTEXT-DIVERSE REPEATS
摘要 A method for identifying repeat subsequences based a diversity of on their extended contexts includes identifying repeat subsequences of symbols in a sequence that are left and/or right maximal and which have at least a threshold value of different left and/or right contexts. The different right contexts are all right-maximal repeats with respect to subsequences of the symbols that immediately follow an occurrence of the respective repeat subsequence and similarly, the different left contexts are all left-maximal repeats with respect to subsequences of the symbols that immediately precede an occurrence of the respective repeat subsequence. This class of repeat subsequences, referred to as extended-context diverse repeats, since the contexts are not limited to a single symbol, can be output or used for characterizing the sequence or a collection of sequences, such as a document or collection of documents.
申请公布号 US2015370781(A1) 申请公布日期 2015.12.24
申请号 US201414311993 申请日期 2014.06.23
申请人 Xerox Corporation 发明人 GALLÉ Matthias
分类号 G06F17/27 主分类号 G06F17/27
代理机构 代理人
主权项 1. A method comprising: receiving a sequence of symbols, the symbols being drawn from an alphabet; with a processor, providing for identifying repeat subsequences of the symbols in the sequence, each of the identified repeat subsequences being a repeat subsequence which is at least one of left-maximal and right-maximal in the sequence, each identified repeat subsequence having at least one of: at least one different right context in the sequence, each of the at least one different right contexts comprising a respective different subsequence of the symbols in the sequence which immediately follows an occurrence of the repeat subsequence in the sequence, each of the different right contexts being a right-maximal repeat with respect only to subsequences of the symbols that immediately follow an occurrence of the respective repeat subsequence, andat least one different left context in the sequence, each of the at least one different left contexts comprising a respective different subsequence of the symbols in the sequence which immediately precedes an occurrence of the repeat subsequence in the sequence, each of the different left contexts being a left-maximal repeat with respect only to subsequences of the symbols that immediately precede an occurrence of the respective repeat subsequence; and outputting at least one of: at least one of the identified repeat subsequences as an extended-context-diverse repeat subsequence, andinformation based on the identified extended-context-diverse repeat subsequences of symbols.
地址 Norwalk CT US