发明名称 Weakly supervised part-of-speech tagging with coupled token and type constraints
摘要 A method and system are provided for a part-of-speech tagger that may be particularly useful for resource-poor languages. Use of manually constructed tag dictionaries from dictionaries via bitext can be used as type constraints to overcome the scarcity of annotated data in some instances. Additional token constraints can be projected from a resource-rich source language via word-aligned bitext. Several example models are provided to demonstrate this such as a partially observed conditional random field model, where coupled token and type constraints may provide a partial signal for training. The disclosed method achieves a significant relative error reduction over the prior state of the art.
申请公布号 US9311299(B1) 申请公布日期 2016.04.12
申请号 US201313955491 申请日期 2013.07.31
申请人 Google Inc. 发明人 Petrov Slav;Das Dipanjan;McDonald Ryan;Nivre Joakim;Tackstrom Oscar
分类号 G06F17/28;G06F17/27 主分类号 G06F17/28
代理机构 Fish & Richardson P.C. 代理人 Fish & Richardson P.C.
主权项 1. A computer-implemented method comprising: obtaining a word in a first language; selecting a first, token-level set of one or more parts-of-speech tags to associate with the word in the first language, comprising: identifying a translation of the word in a second language, andselecting, as the first, token-level set of one or more parts-of-speech tags to associate with the word in the first language, a set of one or more parts-of-speech tags that are associated with the translation of the word in the second language; selecting a second, token-level set of one or more parts-of-speech tags to associate with the word in the first language, comprising: when the word in the first language has no associated part-of-speech tag indicated for the word in the first language in a tag dictionary, selecting, as the second, token-level set of the one or more parts of speech tags, all of one or more of the parts-of-speech tags that (i) are in the first, token-level set of one or more parts-of-speech tags, and (ii) are associated as parts-of-speech tags with words in the tag dictionary, orwhen the word in the first language has one or more associated parts-of-speech tags indicated for the word in the first language in the tag dictionary, selecting, as the second, token-level set of the one or more parts-of-speech-tags, the one or more parts-of-speech tags that (I) are in the first, token-level set of one or more parts-of-speech tags, and (II) are indicated in the tag dictionary as associated with the word in the first language; and providing the word and the second, token-level set of the one or more parts-of-speech tags as training data for training a machine-based part-of-speech tagger.
地址 Mountain View CA US