发明名称 PHRASE-BASED DETECTION OF DUPLICATE DOCUMENTS IN AN INFORMATION RETRIEVAL SYSTEM
摘要 An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.
申请公布号 US2015248415(A1) 申请公布日期 2015.09.03
申请号 US201514713374 申请日期 2015.05.15
申请人 GOOGLE INC. 发明人 Patterson Anna L.
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项 1. A computer-implemented method of selecting documents in a document collection in response to a query, the method comprising: receiving a query including a first phrase and a second phrase; retrieving, by at least one processor of a computing system, a posting list of documents containing the first phrase; for each document in the posting list: accessing, by at least one processor of the computing system, a list of related phrases of the first phrase, wherein the list indicates whether a related phrase is present in the document, the first phrase predicting the occurrence of each of the related phrases in the document collection, wherein the first phrase predicts an occurrence of a related phrase based on a measure of an actual co-occurrence rate of the related phrase and the first phrase in the document collection exceeding an expected co-occurrence rate of the related phrase and the first phrase in the document collection;comparing, by at least one processor of the computing system, the second phrase to the list of related phrases that are present document; andwhen the comparison indicates that the second phrase is a related phrase of the first phrase that is present in the document, then selecting the document to include in a result to the query, without retrieving a posting list of documents containing the second phrase.
地址 Mountain View CA US