发明名称 Data cleansing system and method
摘要 An automated system and method is provided for debugging training data used to train an automated language identifier. The system and method collects texts written in a particular language, generates an occurrence count for words in each text by counting the number of times each of the words is found within the text, and generates an occurrence ratio (OR) of each of the words by dividing the occurrence count by the total number of words in each text. Words are then filtered from the texts in which their occurrence ratios are substantially higher than their occurrence ratios in at least one of the other texts, to generate a clean text.
申请公布号 US7729899(B2) 申请公布日期 2010.06.01
申请号 US20070702811 申请日期 2007.02.06
申请人 BASIS TECHNOLOGY CORPORATION 发明人 OTSUKA NOBUO
分类号 G06F17/28;G06F17/27 主分类号 G06F17/28
代理机构 代理人
主权项
地址