摘要 |
Computer-program products and methods for automatically annotating terms, such as ambiguous terms, in an electronic text document are disclosed. In one embodiment, a method of annotating a text document includes determining, by a computing device, a term of interest within the text document. The method further includes searching a data structure including incongruous term pairs (tx, tt) determined from a controlled vocabulary for the term of interest appearing as a term tt, wherein the term tt is a linguistic head of a term tx of the incongruous term pairs (tx, tt). The method further includes annotating the term of interest with a meaning provided by the controlled vocabulary only if a term tx of the incongruous term pairs (tx, tt) associated with the term of interest in the data structure is not present within a predetermined textual distance of the term of interest in the text document. |
主权项 |
1. A method of annotating a text document, the method comprising:
determining, by a computing device, a term of interest within the text document;
searching a data structure storing incongruous term pairs (tx, tt) determined from a controlled vocabulary for the term of interest appearing as a term tt, wherein the term tt is a linguistic head of a term tx of the incongruous term pairs (tx, tt) and term tx is a linguistic derivative of term tt, wherein terms tx, tt have a hierarchical relationship corresponding to the controlled vocabulary;determining a plurality of compound noun phrases within the controlled vocabulary, wherein each compound noun phrase includes terms tx and tt;determining a semantic distance between the second term and the first term, and for each compound noun phrase wherein the semantic distance between tx, and tt is greater than a predetermined threshold distance, saving the compound noun phrase in a data structure as an incongruous term pair (tx, tt), wherein the incongruous term pair has a linguistic discrepancy and a semantic discrepancy, wherein the second term is term tt and the first term is t; andannotating, by the computing device, the term of interest with a meaning provided by the controlled vocabulary only when each term tx of the incongruous term pairs (tx, tt) including the term of interest as term tt in the data structure is not present within a predetermined textual distance of the term of interest in the text document. |