发明名称 Method and system for normalizing dirty text in a document
摘要 A method and system of normalizing dirty text in a document. The present invention creates a thesaurus that evolves over time as new document collections are analyzed. This thesaurus, which is used by an editor, contains standard terms and phrases, and their corresponding variations of these standard terms and phrases. Documents are run through this editor and misspelled words or phrases, joined words, and ad hoc abbreviations are replaced with standard terms from the thesaurus. The present invention also enables normalization of documents in cases where a list of standard terms must be inferred from the corpus of the document. The normalizer will facilitate data mining applications which can not function properly with dirty text, resulting in more accurate analysis of documents. Over time, as the thesaurus evolves, collecting more words and phrases, the process of generating the thesaurus will become more automated.
申请公布号 US2003014448(A1) 申请公布日期 2003.01.16
申请号 US20010905610 申请日期 2001.07.13
申请人 CASTELLANOS MARIA;STINGER JAMES R. 发明人 CASTELLANOS MARIA;STINGER JAMES R.
分类号 G06F17/27;(IPC1-7):G06F15/00 主分类号 G06F17/27
代理机构 代理人
主权项
地址