主权项 |
1. A method comprising:
receiving a plurality of data files from a plurality of data sources that comprise textual content; categorizing the plurality of data files using a taxonomy of categories in which each category has associated sample textual content defining a concept for the category, the categorizing comprising, for each category:
comparing, for each of the plurality of data files, the textual content of the data file with the sample textual content for the category;calculating, based on the comparing and for each of the plurality of data files, a file score corresponding to the degree of similarity between the defined concept of the category and a determined concept for the data file; andassociating, for each of the plurality of data files, the data file with the category if the file score is equal to or greater than a pre-determined minimum score for the category; and providing at least a portion of the data file and/or the associated file score. |