发明名称 Automated data cleanup by substitution of words of the same pronunciation and different spelling in speech recognition
摘要 The described implementations relate to automated data cleanup. One system includes a language model generated from language model seed text and a dictionary of possible data substitutions. This system also includes a transducer configured to cleanse a corpus utilizing the language model and the dictionary. The transducer can process speech recognition data in some cases by substituting a second word for a first word which shares pronunciation with the first word but is spelled differently. In some cases, this can be accomplished by establishing corresponding probabilities of the first word and second word based on a third word that appears in sequence with the first word.
申请公布号 US9460708(B2) 申请公布日期 2016.10.04
申请号 US200912561521 申请日期 2009.09.17
申请人 Microsoft Technology Licensing, LLC 发明人 Zweig Geoffrey;Ju Yun-Cheng
分类号 G06F17/20;G06F17/27;G10L15/06;G10L15/187 主分类号 G06F17/20
代理机构 代理人 Corie Alin;Swain Sandy;Minhas Micky
主权项 1. A system, comprising: a language model generated from language model seed text, the language model seed text comprising first entries that correctly utilize a first word and second entries that correctly utilize a second word, wherein the second word shares a pronunciation with the first word and has a different spelling than the first word; a dictionary of available data substitutions, the available data substitutions including a substitution of the second word for the first word; a transducer configured to process speech recognition data utilizing the language model and the dictionary, wherein, to process the speech recognition data, the transducer is further configured to: establish probabilities including a first probability of a first alternative that replaces an occurrence of the first word in the speech recognition data with the second word, and a second probability of a second alternative that leaves the occurrence of the first word in the speech recognition data without modification, the probabilities being established based on a third word that appears in sequence with the occurrence of the first word in the speech recognition data; and when the first probability exceeds the second probability, applying the first alternative by replacing the occurrence of the first word in the speech recognition data with the second word that shares the pronunciation with the first word and has a different spelling than the first word; and a computing device configured to execute at least the transducer.
地址 Redmond WA US
您可能感兴趣的专利