发明名称 |
Bootstrapping language models for spoken dialog systems using the world wide web |
摘要 |
A system, method and computer readable medium that generates a language model from data from a web domain is disclosed. The method may include filtering web data to remove unwanted data from the web domain data, extracting predicate/argument pairs from the filtered web data, generating conversational utterances by merging the extracted predicate/argument pairs into conversational templates, and generating a web data language model using the generated conversational utterances. |
申请公布号 |
US9299345(B1) |
申请公布日期 |
2016.03.29 |
申请号 |
US200611425243 |
申请日期 |
2006.06.20 |
申请人 |
AT&T Intellectual Property II, L.P. |
发明人 |
Gilbert Mazin;Hakkani-Tur Dilek Z. |
分类号 |
G10L15/00;G10L15/14;G10L15/22;G10L15/30 |
主分类号 |
G10L15/00 |
代理机构 |
|
代理人 |
|
主权项 |
1. A method comprising:
identifying, via a processor communicating with Internet resources, common task independent web-sentences based on frequently occurring phrases across multiple websites from a web domain stored in a data store; selectively removing the common task independent web-sentences from the web domain data, to yield filtered web domain data comprising domain-specific data; identifying, via the processor, predicate/argument pairs from the filtered web domain data; replacing, via the processor, the predicate/argument pairs with predicate/argument tokens; generating, via the processor, conversational utterances by merging the predicate/argument tokens with manually written conversational templates while preserving a relative frequency of the manually written conversational templates, to yield generated conversational utterances; and generating, via the processor, a web data language model using the generated conversational utterances, and providing it as an initial language model for deployment of an automated speech recognition system. |
地址 |
Atlanta GA US |