发明名称 Method and Apparatus for Assessing Similarity Between Online Job Listings
摘要 Job listings retrieved from external sources are pre-processed prior to being stored in the search engine production database and duplicate records identified prior to storage in a production database for the search engine. Inter-source and intra-source hash values are calculated for each job listing and the values compared. Job listings having the same intra-source hash are judged to be duplicates of each other. Descriptions whose intra-source hash values do not match, but whose inter-source hash values match are judged to be duplicate candidates and subject to further processing. Suffixes for each such record are stored to a data structure such as a suffix array and the records searched and compared based on the suffix arrays. Records having a pre-determined number of contiguous words in common are judged to be duplicates. Duplicate records are identified before the data set is stored to the production data base.
申请公布号 US2008065630(A1) 申请公布日期 2008.03.13
申请号 US20060530432 申请日期 2006.09.08
申请人 LUO TONG;WECK PETER MICHAEL;SEQUEIRA ANTONY;TENDULKAR NEELESH;BENTOV SHAI;LEVINE JAMES DOUGLAS 发明人 LUO TONG;WECK PETER MICHAEL;SEQUEIRA ANTONY;TENDULKAR NEELESH;BENTOV SHAI;LEVINE JAMES DOUGLAS
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项
地址