发明名称 Multi-domain machine translation model adaptation
摘要 A method adapted to multiple corpora includes training a statistical machine translation model which outputs a score for a candidate translation, in a target language, of a text string in a source language. The training includes learning a weight for each of a set of lexical coverage features that are aggregated in the statistical machine translation model. The lexical coverage features include a lexical coverage feature for each of a plurality of parallel corpora. Each of the lexical coverage features represents a relative number of words of the text string for which the respective parallel corpus contributed a biphrase to the candidate translation. The method may also include learning a weight for each of a plurality of language model features, the language model features comprising one language model feature for each of the domains.
申请公布号 US9235567(B2) 申请公布日期 2016.01.12
申请号 US201313740508 申请日期 2013.01.14
申请人 XEROX CORPORATION 发明人 Mylonakis Markos;Cancedda Nicola
分类号 G06F17/27;G06F17/28 主分类号 G06F17/27
代理机构 Fay Sharpe LLP 代理人 Fay Sharpe LLP
主权项 1. A method comprising: training a statistical machine translation model which outputs a score for a candidate translation, in a target language, of a text string in a source language, the training comprising: learning a weight for each of a set of lexical coverage features that are aggregated in the statistical machine translation model, the lexical coverage features comprising a lexical coverage feature for each of a plurality of parallel corpora, each of the lexical coverage features representing a relative number of words contributed by a respective one of the parallel corpora to the translation of the text string, the lexical coverage features being computed based on membership statistics which represent the membership, in each of the plurality of parallel corpora, of each biphrase used in generating the candidate translation, each parallel corpus corresponding to a respective domain from a set of domains and comprising pairs of text strings, each pair comprising a source text string in the source language and a target text string in the target language; and using the trained model in a statistical machine translation system for translation of a new source text string in the source language, wherein the training is performed with a computer processor.
地址 Norwalk CT US