发明名称 NOISE REMOVAL SYSTEM FOR DOCUMENT DATA
摘要 PROBLEM TO BE SOLVED: To provide a technique for automatically deleting an unnecessary character string from various kinds of document data as the preprocessing of automatic keyword extraction. SOLUTION: A noise removal system 40 for document data includes a typical character string noise removing part 44 which reads respective kinds of document data from a source document DB 42, performs matching of the respective kinds of document data by row unit, extracts patterns having the same character string, calculates the appearance frequency of each pattern, multiplexes the lengths of the respective patterns by the appearance frequencies, so as to calculate noise scores, calculates deviation values respectively based on the noise scores of the respective patterns, determines the row as a noise row when the deviation value is 50 or more, and then, performs deletion from the respective kinds of document data. COPYRIGHT: (C)2010,JPO&INPIT
申请公布号 JP2009271796(A) 申请公布日期 2009.11.19
申请号 JP20080122781 申请日期 2008.05.08
申请人 NOMURA RESEARCH INSTITUTE LTD 发明人 TAKEHARA GASUAKI
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项
地址