发明名称 Selection of atoms for search engine retrieval
摘要 Methods are provided for populating search indexes with atoms identified in documents. Documents that are to be indexed are identified, and for each document, atoms are identified and are categorized as unigrams, n-grams, and n-tuples. A list of atom/document pairs is generated such that an information metric can be computed for each pair. An information metric represents a ranking of the atom in relation to the particular document. Based on the information metric, some atom/document pairs are discarded and others are indexed.
申请公布号 US9342582(B2) 申请公布日期 2016.05.17
申请号 US201113045278 申请日期 2011.03.10
申请人 Microsoft Technology Licensing, LLC 发明人 Risvik Knut Magne;Hopcroft Mike;Bennett John G.;Kalyanaraman Karthik;Chilimbi Trishul
分类号 G06F17/30;G06F7/00;G06F17/00 主分类号 G06F17/30
代理机构 代理人 Meyers Jessica;Ross Jim;Minhas Micky
主权项 1. A method for populating one or more search indexes with atoms identified in a plurality of documents, the method comprising: identifying a set of documents to be indexed in a search index; for each document in the set of documents, identifying a plurality of atoms, the plurality of atoms comprising one or more unigrams, one or more n-grams, and one or more n-tuples; based on the identified set of documents and the plurality of atoms, generating a list of atom/document pairs; computing an information metric for each atom/document pair, wherein the information metric represents a pre-computed ranking of the atom used during a search query in relation to the particular document; based on the information metric for each atom/document pair, selecting a subset of the atom/document pairs that are most relevant to the particular document from which the atoms were identified; populating the search index using the subset of the atom/document pairs for the particular document, wherein identifying relevant documents for the search query from the search index is based on a pruning algorithm that computes a preliminary score for each of the documents to select a subset of the set of documents based on the preliminary score, wherein the preliminary score is computed using the information metric pre-computed for each atom/document pair and a simplified scoring function that approximates a final ranking algorithm utilized in identifying the relevant documents.
地址 Redmond WA US