摘要 |
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for identifying non-compositional compounds. In one aspect, a method includes the actions of receiving a collection of phrases, each phrase including two or more words; for each phrase, determining if the phrase is a non-compositional compound, a non-compositional compound being a phrase of two or more words where the words composing the phrase have different meanings in a compound than their conventional meanings individual, the determining including: identifying a similar term for a term of the phrase, substituting the similar term for the term of the phrase to generate a substitute phrase, calculating a similarity between the phrase and the substitute phrase, and identifying the phrase as a non-compositional compound when the calculated similarity is less than a specified threshold value.
|