发明名称 Deriving document similarity indices
摘要 Methods, systems, and computer program products are provided for deriving and updating document similarity indices for a plurality of documents. The number of maintained similarities can be controlled to conserve CPU and storage resources.
申请公布号 US8793242(B2) 申请公布日期 2014.07.29
申请号 US201313922168 申请日期 2013.06.19
申请人 Microsoft Corporation 发明人 Gherman Sorin;Mukerjee Kunal;Prout Adam
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人 Chen Nicholas;Haslam Brian;Minhas Micky
主权项 1. A computing system comprising: at least one processor; and one or more storage device having stored computer-executable instructions which, when executed by the at least one processor, implement a method for deriving a document similarity index for a plurality of documents, the method comprising: an act of accessing a document;an act of computing a tag index for the document, the tag index including one more keyword/weight pairs, each keyword/weight pair mapping a keyword to a corresponding weight for the keyword to indicate a significance of the keyword within the document;an act of identifying a specified number of most significant keywords in the document based on weights in the tag index;for at least one keyword in the specified number of the most significant keywords, an act of determining the corresponding weight of the at least one keyword in each document in the plurality of documents;an act of identifying a plurality of candidate documents, from the among the plurality of documents, based on the corresponding weights of the specified number of the most significant keywords in the plurality of documents, at least some of the specified number of the most significant keywords in the document also being significant keywords in each of the plurality of candidate documents;for each candidate document in the plurality of candidate documents, an act of calculating a full similarity between the document and candidate document by determining the weight of additional keywords from the document within the candidate document; andan act of selecting full similarities for one or more candidate documents for inclusion in the document similarity index to indicate documents that are similar to the document, selection of the full similarities for the one or more candidate documents being based on at least the full similarity calculations.
地址 Redmond WA US