发明名称 Word boundary probability estimating, probabilistic language model building, kana-kanji converting, and unknown word model building
摘要 Calculates a word n-gram probability with high accuracy in a situation where a first corpus), which is a relatively small corpus containing manually segmented word information, and a second corpus, which is a relatively large corpus, are given as a training corpus that is storage containing vast quantities of sample sentences. Vocabulary including contextual information is expanded from words occurring in first corpus of relatively small size to words occurring in second corpus of relatively large size by using a word n-gram probability estimated from an unknown word model and the raw corpus. The first corpus (word-segmented) is used for calculating n-grams and the probability that the word boundary between two adjacent characters will be the boundary of two words (segmentation probability). The second corpus (word-unsegmented), in which probabilistic word boundaries are assigned based on information in the first corpus (word-segmented), is used for calculating a word n-grams.
申请公布号 US2006015326(A1) 申请公布日期 2006.01.19
申请号 US20050180153 申请日期 2005.07.13
申请人 INTERNATIONAL BUSINESS MACHINES CORPORATION 发明人 MORI SHINSUKE;TAKUMA DAISUKE
分类号 G06F17/27 主分类号 G06F17/27
代理机构 代理人
主权项
地址