发明名称 Generating similarity scores for matching non-identical data strings
摘要 A system and method for determining the likelihood of two documents describing substantially similar subject matter is presented. A set of tokens for each of two documents is obtained, each set representing strings of characters found in the corresponding document. A matrix of token pairs is determined, each token pair comprising a token from each set of tokens. For each token pair in the matrix, a similarity score is determined. Those token pairs in the matrix with a similarity score above a threshold score are selected and added to a set of matched tokens. A similarity score for the two documents is determined according to the scores of the token pairs added to the set of matched tokens. The determined similarity score is provided as the likelihood that the first and second documents describing substantially similar subject matter.
申请公布号 US7814107(B1) 申请公布日期 2010.10.12
申请号 US20070754241 申请日期 2007.05.25
申请人 AMAZON TECHNOLOGIES, INC. 发明人 THIRUMALAI SRIKANTH;TERRA EGIDIO;MOHAN VIJAI;TOMKO MARK J.;EMERY GRANT M.;MANOHARAN ASWATH
分类号 G06F7/00;G06F17/00 主分类号 G06F7/00
代理机构 代理人
主权项
地址