摘要 |
PROBLEM TO BE SOLVED: To provide a technique for automatically deleting an unnecessary character string from various kinds of document data as the preprocessing of automatic keyword extraction. SOLUTION: A noise removal system 40 for document data includes a typical character string noise removing part 44 which reads respective kinds of document data from a source document DB 42, performs matching of the respective kinds of document data by row unit, extracts patterns having the same character string, calculates the appearance frequency of each pattern, multiplexes the lengths of the respective patterns by the appearance frequencies, so as to calculate noise scores, calculates deviation values respectively based on the noise scores of the respective patterns, determines the row as a noise row when the deviation value is 50 or more, and then, performs deletion from the respective kinds of document data. COPYRIGHT: (C)2010,JPO&INPIT
|