发明名称 Determining similarity of linguistic objects
摘要 A computer-implemented system for searching includes a data store accessible via a network for storing a data set; an indexing system coupled to the network and indexing the data set, the indexing system configured to generate content vectors for terms in the data set; generate index vectors for terms in the data set; and generate a bitset signature from the index vector. The system further includes a search module coupled to the network and configured to receive a search query and perform a search on one or more terms in the search query by accessing a bitset signature and content vector corresponding to the term; retrieving bitset signatures that are within a predetermined closeness to the bitset signature; selecting content vectors corresponding to retrieved bitset signatures; and selecting content vectors that are within a predetermined similarity to the term content vector; and return the terms corresponding to the content vectors.
申请公布号 US9298757(B1) 申请公布日期 2016.03.29
申请号 US201313801278 申请日期 2013.03.13
申请人 INTERNATIONAL BUSINESS MACHINES CORPORATION 发明人 Ponvert Elias;Tran Michael Tuyen
分类号 G06F17/30 主分类号 G06F17/30
代理机构 Edell, Shapiro & Finnan, LLC 代理人 Tham Yeen;Edell, Shapiro & Finnan, LLC
主权项 1. A computer-implemented system for searching, comprising: a data store accessible via a network for storing a data set; an indexing system coupled to the network and indexing the data set, the indexing system including a processor configured to: generate content vectors for terms in the data set, wherein the content vectors define a similarity metric;generate index vectors for the terms in the data set from the content vectors to access the terms in the data set; andgenerate bitset signatures from the index vectors to determine similarity with the terms in the data set, wherein the bitset signatures include a first section for positive magnitude values and a second section for negative magnitude values, and generating the bitset signatures from the index vectors comprises: for each of a predetermined number of highest magnitude positive values in the corresponding index vector, setting a corresponding bitset signature value at a corresponding dimension in the first section to a predetermined value; andfor each of a predetermined number of highest magnitude negative values in the corresponding index vector, setting a corresponding bitset signature value at a corresponding dimension in the second section to the predetermined value; and a search module coupled to the network and including a processor configured to receive a search query and perform a search on each of one or more terms in the search query by: accessing a bitset signature and content vector corresponding to a term in the search query;retrieving bitset signatures that are within a predetermined closeness to the accessed bitset signature;selecting content vectors corresponding to the retrieved bitset signatures;identifying the selected content vectors that are within a predetermined similarity to the accessed content vector corresponding to the term in the search query; andreturning the terms of the data set corresponding to the identified content vectors.
地址 Armonk NY US