主权项 |
1. A computer-implemented method comprising:
obtaining a word in a first language; selecting a first, token-level set of one or more parts-of-speech tags to associate with the word in the first language, comprising:
identifying a translation of the word in a second language, andselecting, as the first, token-level set of one or more parts-of-speech tags to associate with the word in the first language, a set of one or more parts-of-speech tags that are associated with the translation of the word in the second language; selecting a second, token-level set of one or more parts-of-speech tags to associate with the word in the first language, comprising:
when the word in the first language has no associated part-of-speech tag indicated for the word in the first language in a tag dictionary, selecting, as the second, token-level set of the one or more parts of speech tags, all of one or more of the parts-of-speech tags that (i) are in the first, token-level set of one or more parts-of-speech tags, and (ii) are associated as parts-of-speech tags with words in the tag dictionary, orwhen the word in the first language has one or more associated parts-of-speech tags indicated for the word in the first language in the tag dictionary, selecting, as the second, token-level set of the one or more parts-of-speech-tags, the one or more parts-of-speech tags that (I) are in the first, token-level set of one or more parts-of-speech tags, and (II) are indicated in the tag dictionary as associated with the word in the first language; and providing the word and the second, token-level set of the one or more parts-of-speech tags as training data for training a machine-based part-of-speech tagger. |