摘要 |
A method enables identification of a similarity level between a user-provided data item and a data item within a set of data documents. The method includes a representation generator determining, for each term in an enumeration of terms, occurrence information. The representation generator generates, for each term, a sparse distributed representation (SDR) using the occurrence information. The method includes receiving, by a filtering module, a filtering criterion. The method includes generating, by the representation generator, for the filtering criterion, at least one SDR. The method includes generating, by the representation generator, for a first of a plurality of streamed documents received from a data source, a compound SDR. The method includes determining, by a similarity engine executing on the second computing device, a distance between the filtering criterion SDR and the generated compound SDR. The method includes acting on the first streamed document, based upon the determined distance. |
主权项 |
1. A computer-implemented method for identifying a level of similarity between a user-provided data item and a data item within a set of data documents, the method comprising:
clustering, by a reference map generator executing on a first computing device, in a two-dimensional metric space, a set of data documents selected according to at least one criterion, generating a semantic map; associating, by the semantic map, a coordinate pair with each of the set of data documents; generating, by a parser executing on the first computing device, an enumeration of terms occurring in the set of data documents; determining, by a representation generator executing on the first computing device, for each term in the enumeration, occurrence information including—(i) a number of data documents in which the term occurs, (ii) a number of occurrences of the term in each data document, and (iii) the coordinate pair associated with each data document in which the term occurs; generating, by the representation generator, for each term in the enumeration, a sparse distributed representation (SDR) using the occurrence information; storing, in an SDR database, each of the generated SDRs; receiving, by a filtering module executing on a second computing device, from a third computing device, a filtering criterion; generating, by the representation generator, for the filtering criterion, at least one SDR; receiving, by the filtering module, a plurality of streamed documents from a data source; generating, by the representation generator, for a first of the plurality of streamed documents, a compound SDR for a first of the plurality of streamed documents; determining, by a similarity engine executing on the second computing device, a distance between the filtering criterion SDR and the generated compound SDR for the first of the plurality of streamed documents; and acting, by the filtering module, on the first streamed document, based upon the determined distance. |