发明名称 System and method for identification of near duplicate user-generated content
摘要 A computer-implemented system and method for identification of near duplicate user-generated content in a networked system are disclosed. The apparatus in an example embodiment includes a data receiver to receive a first instance of user-generated content; a tokenizer to tokenize the first instance into a set of words, create a set of portions from the tokenized first instance, and assign weight to each portion of the set of portions; a magnitude calculator to calculate a magnitude for the first instance based on the weight of each portion; a resemblance score calculator to search a data store for a second instance with at least one portion in common with the first instance and calculate a resemblance score between the first instance and the second instance; and an account linker to link accounts associated with each of the first instance and the second instance.
申请公布号 US9454610(B2) 申请公布日期 2016.09.27
申请号 US201514727622 申请日期 2015.06.01
申请人 eBay Inc. 发明人 Schuil Robin Johan
分类号 G06F17/30;G06Q30/08 主分类号 G06F17/30
代理机构 Schwegman, Lundberg & Woessner, P.A. 代理人 Schwegman, Lundberg & Woessner, P.A.
主权项 1. A method comprising: using a computer processor, automatically identifying, among a plurality of existing instances of user-generated electronic content, one or more instances that are near duplicates of a new instance of user-generated electronic content, as determined based on a measured degree of similarity between the existing instances and the new instance; and for each of the identified near-duplicate existing instances of user-generated electronic content, determining whether a single account holder is responsible for submitting the new instance of user-generated electronic content and the near-duplicate existing instance of user-generated electronic content, wherein, for at least one of the near-duplicate existing instances of user-generated electronic content, the determination that a single account holder is responsible for submitting the new instance and the near-duplicate existing instance is based on finding an intersection between user data associated with a first account associated with the new instance and a second account associated with the near-duplicate existing instance, the user data identifying the single account holder and being distinct from the user-generated electronic content.
地址 San Jose CA US