发明名称 Method for improving search engine efficiency
摘要 In a method for improving the efficiency of a search engine in accessing, searching and retrieving information in the form of documents stored in document or content repositories, the search engine comprises an array of search nodes hosted on one or more servers. An index of the stored document is created. The search engine processes a user search query and returns a result set of query-matching documents. The index of the search engine is configured on the basis of one or more document properties and partitioned, replicated and distributed over the array of the search nodes. The search queries are processed on the basis of the distributed index. The method realizes a framework for distributing the index of a search engine across several hosts in a computer cluster, relying on three orthogonal mechanisms for index distribution, namely index partitioning, index replication, and assignment of replicas to hosts. In this manner, different ways of configuring the index of a search engine are obtained and provide a much improved resource usage and performance, combined with any desired level of fault tolerance.
申请公布号 US8799264(B2) 申请公布日期 2014.08.05
申请号 US200812332979 申请日期 2008.12.11
申请人 Microsoft Corporation 发明人 Gehrke Johannes;Van Renesse Robbert;Schneider Fred
分类号 G06F7/08;G06F17/30 主分类号 G06F7/08
代理机构 代理人 Wong Tom;Ross Jim;Minhas Micky
主权项 1. A method for improving the efficiency of a search engine in accessing, searching and retrieving information in the form of documents stored in document or content repositories, comprising: using an indexing subsystem of the search engine to crawl the stored documents and generate an index, wherein applying a user search query to the index returns a result set of at least some query-matching documents, wherein the search engine comprises an array of search nodes hosted on one or more servers, wherein the array of search nodes comprises r rows and c columns characterized by classifying a query keyword in two dimensions, a first dimension being a posting list size and a second dimension being an arrival rate that is determined using an arrival time of each query keyword, and wherein the index of the search engine is configured on a basis of one or more document properties, at least one of a fault-tolerance level, a required search performance, document meta-properties, and an optimal resource utilization; partitioning the index; replicating the index to create replicas that each comprise a same content as each of the other replicas; distributing the partitioned and replicated index over the array of search nodes such that index partitions and replicas thereof are assigned to the servers hosting the array of search nodes, wherein distributing the index takes into account at least one of: posting lists size differences and posting lists popularity differences; and processing a search query on the basis of the distributed index.
地址 Redmond WA US