发明名称 Bootstrapping language models for spoken dialog systems using the world wide web
摘要 A system, method and computer readable medium that generates a language model from data from a web domain is disclosed. The method may include filtering web data to remove unwanted data from the web domain data, extracting predicate/argument pairs from the filtered web data, generating conversational utterances by merging the extracted predicate/argument pairs into conversational templates, and generating a web data language model using the generated conversational utterances.
申请公布号 US9299345(B1) 申请公布日期 2016.03.29
申请号 US200611425243 申请日期 2006.06.20
申请人 AT&T Intellectual Property II, L.P. 发明人 Gilbert Mazin;Hakkani-Tur Dilek Z.
分类号 G10L15/00;G10L15/14;G10L15/22;G10L15/30 主分类号 G10L15/00
代理机构 代理人
主权项 1. A method comprising: identifying, via a processor communicating with Internet resources, common task independent web-sentences based on frequently occurring phrases across multiple websites from a web domain stored in a data store; selectively removing the common task independent web-sentences from the web domain data, to yield filtered web domain data comprising domain-specific data; identifying, via the processor, predicate/argument pairs from the filtered web domain data; replacing, via the processor, the predicate/argument pairs with predicate/argument tokens; generating, via the processor, conversational utterances by merging the predicate/argument tokens with manually written conversational templates while preserving a relative frequency of the manually written conversational templates, to yield generated conversational utterances; and generating, via the processor, a web data language model using the generated conversational utterances, and providing it as an initial language model for deployment of an automated speech recognition system.
地址 Atlanta GA US