发明名称 IDENTIFYING SALIENT ITEMS IN DOCUMENTS
摘要 A set of representations of item-page pairs of items and respective web pages that include the respective items is obtained, each representation including feature function values indicating weights associated with features of associated web pages, the features including page classification features. An annotated set of labeled training data that is annotated with salience annotation values of items for respective web pages that include the items is obtained. The salience annotation values are determined based on a soft function, by determining a first count of a total number of user queries associated with corresponding visits to the respective web pages, and determining a ratio of a second count to the first count, the second count determined as a cardinality of a subset of the corresponding visits that are associated with user queries that include the item, the subset included in the corresponding visits. Models are trained using the annotated set.
申请公布号 US2014279730(A1) 申请公布日期 2014.09.18
申请号 US201313798198 申请日期 2013.03.13
申请人 MICROSOFT CORPORATION 发明人 Gamon Michael;Pantel Patrick;Song Xinying;Yano Tae;Apacible Johnson Tan
分类号 G06N99/00 主分类号 G06N99/00
代理机构 代理人
主权项 1. A system comprising: a device that includes at least one processor, the device including a salient item identification engine comprising instructions tangibly embodied on a computer readable storage medium for execution by the at least one processor, the salient item identification engine including: a log data acquisition component configured to obtain query data and corresponding click data that indicates web pages visited, in association with respectively corresponding user queries, based on information mined from a web search log; anda soft labeling component configured to determine a salience annotation value of an item for respective ones of the web pages, based on determining a first count of a total number of the user queries that are associated with one or more corresponding visits to the respective ones of the web pages, and determining a ratio of a second count to the first count, the second count determined as a cardinality of a subset of the corresponding visits that are associated with a group of the user queries that include the item, the subset included in the one or more corresponding visits.
地址 Redmond WA US