发明名称 Word breaker from cross-lingual phrase table
摘要 Automatically creating word breakers which segment words into morphemes is described, for example, to improve information retrieval, machine translation or speech systems. In embodiments a cross-lingual phrase table, comprising source language (such as Turkish) phrases and potential translations in a target language (such as English) with associated probabilities, is available. In various examples, blocks of source language phrases from the phrase table are created which have similar target language translations. In various examples, inference using the target language translations in a block enables stem and affix combinations to be found for source language words without the need for input from human-judges or prior knowledge of source language linguistic rules or a source language lexicon.
申请公布号 US9330087(B2) 申请公布日期 2016.05.03
申请号 US201313861146 申请日期 2013.04.11
申请人 Microsoft Technology Licensing, LLC 发明人 El-Sharqwi Mohamed Ahmed;Chalabi Achraf Abdel-Moneim Tawfik Mahmoud
分类号 G06F17/27;G06F17/28;G06F17/20;G10L21/00;G10L25/00 主分类号 G06F17/27
代理机构 代理人 Corie Alin;Swain Cassandra T.;Minhas Micky
主权项 1. A computer-implemented process, comprising: receiving a parallel corpus of a source language and a target language; applying a machine translation training process to the parallel corpus to generate a cross-lingual phrase table comprising a plurality of source language phrases, each source language phrase having at least one target language translation; applying a blocking operation to the cross-lingual phrase table to group phrases of the source language into blocks by searching the cross-lingual phrase table to find blocks of two or more source language phrases that share similar translations in the target language; searching each of the different source language phrases in each block to identify a stem of a word of the source language, the stem in each block comprising a same sequence of characters occurring in each of the different source language phrases of that block; searching each of the different source language phrases in each block to find a plurality of affixes of the stem of that block, each affix in each block comprising a sequence of characters preceding or following the characters comprising the stem in any of the different source language phrases in that block; generating a set of morphemes comprising the stems and affixes of words of the source language; in response to receipt of a user query in the source language, applying the set of morphemes to automatically create one or more different forms of one or more words of the user query; and performing an expanded query search using the automatically created different forms of the words of the user query.
地址 Redmond WA US