摘要 |
One embodiment of the present invention provides a system that detects sensitive content in a document. In doing so, the system receives a document, identifies a set of terms in the document that are candidate sensitive terms, and generates a combination of terms based on the identified terms that is associated with a semantic meaning. Next, the system performs searches through a corpus based on the combination of terms and determines hit counts returned for each term in the combination and for the combination. The system then determines whether the combination of terms is sensitive based on the hit count for the combination and the hit counts for the individual terms in the combination, and generates a result that indicates portions of the document which contain sensitive combinations. |