发明名称 SEARCH AND RETRIEVAL OF ELECTRONIC DOCUMENTS USING KEY-VALUE BASED PARTITION-BY-QUERY INDICES
摘要 Methods and systems for providing a search engine capability for large datasets are disclosed. These methods and systems employ a Partition-by-Query index containing key-values pairs corresponding to keys reflecting concept-ordered search phrases and values reflecting ordered lists of document references that are responsive to the concept-ordered search phrase in a corresponding key. A large Partition-by-Query index may be partitioned across multiple servers depending on the size of the index, or the size of the index may be reduced by compressing query-references pairs into clusters. The methods and systems described herein may to provide suggestions and spelling corrections to the user, thereby improving the user's search engine experience while meeting user expectations for search quality and responsiveness.
申请公布号 US2015356106(A1) 申请公布日期 2015.12.10
申请号 US201514831017 申请日期 2015.08.20
申请人 Uber Technologies, Inc. 发明人 Hendrey Geoffrey Rummens
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项 1. A method for configuring a search engine to provide suggested search queries in response to input search queries for searching a corpus of documents, wherein each document contains a plurality of tokens, the method comprising: generating, by a computing device, tokens from the documents in the corpus; generating, by the computing device, for each of the tokens a plurality of residual strings, wherein each residual string for a token comprises a one-character or multi-character variation of the token; generating, by the computing device, for each token a direct producer list, the direct producer list for a token comprising the plurality of residual strings for the token, and an associated weight for each residual string based upon the number of characters variations between the token and the residual string; forming, by the computing device, for each residual string at least one indirect producer list by propagating to the residual string the direct producer lists of the tokens from which the residual string was generated; propagating, by the computing device, each token with a corresponding weight to other tokens having one or more common residual strings, wherein the corresponding weight is based on upon the weights of residual strings associated with both the token and the other tokens; storing, by the computing device, the tokens, the associated weights propagated to each of the tokens, and the indirect producer list for each residual string associated with each of the tokens, as a confusion set, wherein the residual strings and the tokens in the indirect producer list are the suggested search queries for the residual string.
地址 San Francisco CA US