发明名称 CONTENT BASED SIMILARITY DETECTION
摘要 Content Based Similarity Detection. A computer implemented method includes computing a hash of each word in a collection of books to produce a numerical integer token using a reduced representation and computing an Inverse Document Frequency (IDF) vector comprising the number of books the token appears in, for every token in the collection of books. The method also includes creating a token occurrence count vector for each book in the collection and normalizing the token occurrence count vector using the IDF vector to create a Term Frequency-Inverse Document Frequency (TF-IDF) vector. Further, the method includes reducing each TF-IDF vector by using random projections to obtain a final signature representing each book in the collection, reducing each TF-IDF vector by using random projections to obtain a final signature representing each book in the collection and using a trained machine learning algorithm, determining whether each of the list of candidate books is similar to the target book.
申请公布号 US2015154497(A1) 申请公布日期 2015.06.04
申请号 US201414219613 申请日期 2014.03.19
申请人 Kobo Incorporated 发明人 BRAZIUNAS Darius;CHRISTENSEN Jordan;GIVONI Inmar Ella;ISAAC Neil
分类号 G06N5/04;G06F17/30;G06N99/00 主分类号 G06N5/04
代理机构 代理人
主权项 1. A computer implemented method comprising: computing a hash of each word in a collection of books to produce a numerical integer token using a reduced representation; computing an Inverse Document Frequency (IDF) vector comprising the number of books said token appears in, for every token in said collection of books; creating a token occurrence count vector for each said book in said collection; normalizing said token occurrence count vector using said IDF vector to create a Term Frequency-Inverse Document Frequency (TF-IDF) vector; reducing each said TF-IDF vector by using random projections to obtain a final signature representing each said book in said collection; creating at least two similarity scores between a target book and a list of candidate books; and using a trained machine learning algorithm, determining whether each of said list of candidate books is similar to said target book.
地址 Toronto CA