发明名称 IDENTIFYING PRIMARILY MONOSEMOUS KEYWORDS TO INCLUDE IN KEYWORD LISTS FOR DETECTION OF DOMAIN-SPECIFIC LANGUAGE
摘要 Techniques are described for generating a monosemous (i.e., single sense) keyword list associated with a particular domain (e.g., a medical or financial domain) for document classification. An input term frequency dictionary, a candidate keyword list, and a document corpus may be used to generate the keyword list. A collection of documents is divided into two sets, one related to a target domain and one not. A statistical approach may be used to evaluate each term in the candidate list to determine a measure of how monosemous each remaining candidate term is, i.e., how strongly the term (or short phrase) identifies with a single sense. Terms with a primarily single sense related to the target domain are added to the monosemous keyword list. The keyword list may be used to identify documents associated with the domain, allowing, the appropriate protections to be applied to the document (e.g., do not send outside an enterprise boundary or permit copying).
申请公布号 US2014181983(A1) 申请公布日期 2014.06.26
申请号 US201213722682 申请日期 2012.12.20
申请人 SYMANTEC 发明人 HART Michael
分类号 G06F21/60 主分类号 G06F21/60
代理机构 代理人
主权项 1. A method for generating a monosemous keyword list, the method comprising: receiving a first document corpus, wherein each document in the first document corpus is associated with a target domain; receiving a second document corpus, wherein each document in the second document corpus is unrelated to the target domain; for each of a plurality of candidate terms: determining a first frequency of usage of the candidate term within first document corpus,determining a second frequency of usage of the candidate term within second document corpus,based on the first and second frequency of usage, determining whether the candidate term has a substantially single sense associated with the target domain, andif so, adding the candidate term to the monosemous keyword list.
地址 Mountain View CA US