摘要 |
The subject invention comprises a system for data mining, preferably comprising a sample generator component (110); a filtering system component (130); and a buffering component. The sample generator component is preferably configured to communicate with a plurality of search engines (120) and to generate queries based on a sample repository of positive and negative sample documents, and comprises a feature extraction algorithm. The subject invention also comprises a method for data mining; comprising the steps of (a) identifying candidate sample documents based on a category (125); (b) filtering candidate documents by applying a categorization model (135); (c) buffering the filtered documents (145); (d) labelling the buffered documents as positive or negative examples of the category (155); (e) retraining the categorization model, based on the labeled set of positive and negative example documents (165); (f) repeating steps (b) and (e) until all candidate documents are processed; and (g) storing all labeled documents in a database.
|