发明名称 |
Determining similarity of linguistic objects |
摘要 |
A computer-implemented system for searching includes a data store accessible via a network for storing a data set; an indexing system coupled to the network and indexing the data set, the indexing system configured to generate content vectors for terms in the data set; generate index vectors for terms in the data set; and generate a bitset signature from the index vector. The system further includes a search module coupled to the network and configured to receive a search query and perform a search on one or more terms in the search query by accessing a bitset signature and content vector corresponding to the term; retrieving bitset signatures that are within a predetermined closeness to the bitset signature; selecting content vectors corresponding to retrieved bitset signatures; and selecting content vectors that are within a predetermined similarity to the term content vector; and return the terms corresponding to the content vectors. |
申请公布号 |
US9298757(B1) |
申请公布日期 |
2016.03.29 |
申请号 |
US201313801278 |
申请日期 |
2013.03.13 |
申请人 |
INTERNATIONAL BUSINESS MACHINES CORPORATION |
发明人 |
Ponvert Elias;Tran Michael Tuyen |
分类号 |
G06F17/30 |
主分类号 |
G06F17/30 |
代理机构 |
Edell, Shapiro & Finnan, LLC |
代理人 |
Tham Yeen;Edell, Shapiro & Finnan, LLC |
主权项 |
1. A computer-implemented system for searching, comprising:
a data store accessible via a network for storing a data set; an indexing system coupled to the network and indexing the data set, the indexing system including a processor configured to:
generate content vectors for terms in the data set, wherein the content vectors define a similarity metric;generate index vectors for the terms in the data set from the content vectors to access the terms in the data set; andgenerate bitset signatures from the index vectors to determine similarity with the terms in the data set, wherein the bitset signatures include a first section for positive magnitude values and a second section for negative magnitude values, and generating the bitset signatures from the index vectors comprises:
for each of a predetermined number of highest magnitude positive values in the corresponding index vector, setting a corresponding bitset signature value at a corresponding dimension in the first section to a predetermined value; andfor each of a predetermined number of highest magnitude negative values in the corresponding index vector, setting a corresponding bitset signature value at a corresponding dimension in the second section to the predetermined value; and a search module coupled to the network and including a processor configured to receive a search query and perform a search on each of one or more terms in the search query by:
accessing a bitset signature and content vector corresponding to a term in the search query;retrieving bitset signatures that are within a predetermined closeness to the accessed bitset signature;selecting content vectors corresponding to the retrieved bitset signatures;identifying the selected content vectors that are within a predetermined similarity to the accessed content vector corresponding to the term in the search query; andreturning the terms of the data set corresponding to the identified content vectors. |
地址 |
Armonk NY US |