发明名称 PHRASE-BASED DETECTION OF DUPLICATE DOCUMENTS IN AN INFORMATION RETRIEVAL SYSTEM
摘要 An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.
申请公布号 US2014156647(A1) 申请公布日期 2014.06.05
申请号 US201313919830 申请日期 2013.06.17
申请人 Google Inc. 发明人 Patterson Anna L.
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项 1. A method of detecting duplicate documents in search results, the method comprising: receiving a query comprising at least one phrase; retrieving a plurality of documents responsive to the query to form a search result; for each of the retrieved documents, generating a document description comprising selected sentences of the document, wherein the selected sentences are ordered in the document description as a function of a number of phrases in each sentence; responsive to the document description at least two documents matching, discarding at least one of the two documents from the search result.
地址 Mountain View CA US