发明名称 |
System and method for identification of near duplicate user-generated content |
摘要 |
A computer-implemented system and method for identification of near duplicate user-generated content in a networked system are disclosed. The apparatus in an example embodiment includes a data receiver to receive a first instance of user-generated content; a tokenizer to tokenize the first instance into a set of words, create a set of portions from the tokenized first instance, and assign weight to each portion of the set of portions; a magnitude calculator to calculate a magnitude for the first instance based on the weight of each portion; a resemblance score calculator to search a data store for a second instance with at least one portion in common with the first instance and calculate a resemblance score between the first instance and the second instance; and an account linker to link accounts associated with each of the first instance and the second instance. |
申请公布号 |
US9454610(B2) |
申请公布日期 |
2016.09.27 |
申请号 |
US201514727622 |
申请日期 |
2015.06.01 |
申请人 |
eBay Inc. |
发明人 |
Schuil Robin Johan |
分类号 |
G06F17/30;G06Q30/08 |
主分类号 |
G06F17/30 |
代理机构 |
Schwegman, Lundberg & Woessner, P.A. |
代理人 |
Schwegman, Lundberg & Woessner, P.A. |
主权项 |
1. A method comprising:
using a computer processor,
automatically identifying, among a plurality of existing instances of user-generated electronic content, one or more instances that are near duplicates of a new instance of user-generated electronic content, as determined based on a measured degree of similarity between the existing instances and the new instance; and for each of the identified near-duplicate existing instances of user-generated electronic content, determining whether a single account holder is responsible for submitting the new instance of user-generated electronic content and the near-duplicate existing instance of user-generated electronic content, wherein, for at least one of the near-duplicate existing instances of user-generated electronic content, the determination that a single account holder is responsible for submitting the new instance and the near-duplicate existing instance is based on finding an intersection between user data associated with a first account associated with the new instance and a second account associated with the near-duplicate existing instance, the user data identifying the single account holder and being distinct from the user-generated electronic content. |
地址 |
San Jose CA US |