发明名称 System and method for identification of near duplicate user-generated content
摘要 A computer-implemented system and method relates to identifying near duplicate content. An example embodiment includes a data receiver to receive a first instance of user-generated content and a tokenizer to tokenize the first instance into a set of words, create a set of portions from the tokenized first instance, and assign weight to each portion of the set of portions. The example embodiment also includes a magnitude calculator to calculate a magnitude for the first instance based on the weight of each portion and a resemblance score calculator to search a data store for a second instance with at least one portion in common with the first instance and calculate a resemblance score between the first instance and the second instance.
申请公布号 US9058378(B2) 申请公布日期 2015.06.16
申请号 US200812101561 申请日期 2008.04.11
申请人 eBay Inc. 发明人 Schuil Robin Johan
分类号 G06F17/00;G06F17/30;G06Q30/08 主分类号 G06F17/00
代理机构 Schwegman Lundberg & Woessner, P.A. 代理人 Schwegman Lundberg & Woessner, P.A.
主权项 1. A method comprising: receiving a first instance of user-generated content, the content being any part of a content repository related to product offerings in a network-based marketplace; tokenizing, by use of a processor, the first instance into a set of words parsed from the first instance content; creating a set of portions from the tokenized first instance, each portion of the set of portions comprising a plurality of tokens parsed from the tokenized first instance; assigning weight to each portion of the set of portions, the weight being based on a quantity of occurrences of that corresponding portion in the content repository; calculating a magnitude for the first instance based on the weight of each portion; searching the content repository for a second instance with at least one portion in common with the first instance, the second instance including content that is any part of the content repository; calculating a resemblance score between the first instance and the second instance; and in response to the resemblance score being equal to or greater than a pre-defined threshold, testing whether accounts associated with each of the first instance and the second instance belong to the same user by comparing user data associated with the accounts, the user data identifying a user or an account and being distinct from the user-generated content; and in response to finding an intersection between the user data associated with the accounts, linking the accounts.
地址 San Jose CA US