发明名称 DOCUMENT SIMILARITY CALCULATION METHOD, AND METHOD AND DEVICE FOR DETECTING APPROXIMATELY DUPLICATE DOCUMENTS
摘要 <p>The present invention relates to a document similarity calculation method, and a method and device for detecting approximately duplicate documents. The calculation method comprises: respectively conducting word segmentation processing on two documents to be detected to obtain respective word segmentation sets of the documents to be detected; calculating the editing similarity of all word segmentation pairs in the two word segmentation sets, wherein two pieces of word segmentation of each of the word segmentation pairs respectively come from two of the word segmentation sets; establishing an edge between the word segmentation pairs of which the editing similarity satisfies the requirements in all the word segmentation pairs, wherein the editing similarity is the weight of the edge corresponding to the word segmentation pairs, and then, obtaining a weighted bipartite graph; calculating the maximum weighted matching value of the weighted bipartite graph; and using the maximum weighted matching value to calculate the similarity between the documents to be detected. The document similarity calculation method, and the method and device for detecting approximately duplicate documents provided in the present invention have a high accuracy rate and can effectively identify approximately duplicate documents containing incorrectly edited word segmentation sets, thereby improving the detection accuracy of the approximately duplicate documents, reducing the calculation complexity and optimizing the calculation efficiency.</p>
申请公布号 WO2014206241(A1) 申请公布日期 2014.12.31
申请号 WO2014CN80318 申请日期 2014.06.19
申请人 HUAWEI TECHNOLOGIES CO., LTD. 发明人 LI, GUOLIANG;FENG, JIANHUA;WEI, JIANSHENG
分类号 G06F17/27 主分类号 G06F17/27
代理机构 代理人
主权项
地址