发明名称 |
Method and Apparatus for Assessing Similarity Between Online Job Listings |
摘要 |
Job listings retrieved from external sources are pre-processed prior to being stored in the search engine production database and duplicate records identified prior to storage in a production database for the search engine. Inter-source and intra-source hash values are calculated for each job listing and the values compared. Job listings having the same intra-source hash are judged to be duplicates of each other. Descriptions whose intra-source hash values do not match, but whose inter-source hash values match are judged to be duplicate candidates and subject to further processing. Suffixes for each such record are stored to a data structure such as a suffix array and the records searched and compared based on the suffix arrays. Records having a pre-determined number of contiguous words in common are judged to be duplicates. Duplicate records are identified before the data set is stored to the production data base.
|
申请公布号 |
US2008065630(A1) |
申请公布日期 |
2008.03.13 |
申请号 |
US20060530432 |
申请日期 |
2006.09.08 |
申请人 |
LUO TONG;WECK PETER MICHAEL;SEQUEIRA ANTONY;TENDULKAR NEELESH;BENTOV SHAI;LEVINE JAMES DOUGLAS |
发明人 |
LUO TONG;WECK PETER MICHAEL;SEQUEIRA ANTONY;TENDULKAR NEELESH;BENTOV SHAI;LEVINE JAMES DOUGLAS |
分类号 |
G06F17/30 |
主分类号 |
G06F17/30 |
代理机构 |
|
代理人 |
|
主权项 |
|
地址 |
|