发明名称 High precision set expansion for large concepts
摘要 A set expansion system is described herein that improves precision, recall, and performance of prior set expansion methods for large sets of data. The system maintains high precision and recall by 1) identifying the qualify of particular lists and applying that quality through a weight, 2) allowing for the specification or negative examples in a set of seeds to reduce the introduction of bad entities into the set, and 3) applying a cutoff to eliminate lists that include a low number of positive matches. The system may perform multiple passes to first generate a good candidate result set and then refine the set to find a set with highest quality. The system may also apply Map Reduce or other distributed processing techniques to allow calculation in parallel. Thus, the system efficiently expands large concept sets from a potentially small set of initial seeds from readily available web data.
申请公布号 US9547718(B2) 申请公布日期 2017.01.17
申请号 US201113325072 申请日期 2011.12.14
申请人 Microsoft Technology Licensing, LLC 发明人 Huang Jiewen;Chen Zhimin;Arasu Arvind;Narasayya Vivek
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人 Chen Nicholas;Drakos Kate;Minhas Micky
主权项 1. A computer-implemented method to measure a quality of a candidate result set expanded from a set of seed items, the method comprising: receiving one or more seed items that represent members of a concept set for which a user wants to automatically generate additional members of the concept set, wherein the one or more seed items includes positive seeds known to be members of the concept set and negative seeds that are known not to be members of the concept set and wherein the negative seeds are items found with items that are members of the concept set and are separately identified from the positive seeds; automatically identifying additional seeds based on the one or more seed items, wherein the additional seeds includes both positive and negative seeds, wherein the additional negative seeds are identified based on a similarity to the negative seeds; receiving one or more lists that include some items that are members of the concept set and other items that are not members of the concept set; receiving a candidate result set that expands the one or more seed items to include items suspected of being members of the concept set; determining a weight for each of the one or more lists based on the one or more seed items, wherein each weight corresponds to an initial measure of a quality of a list, and wherein determining each weight comprises calculating a ratio of a number of positive seeds present in the list plus a number of negative seeds not present in the list to a total number of seeds; determining a similarity metric of each item in the candidate result set with the one or more seed items based on which of the one or more lists contain each item and the weight for each of the one or more lists: determining the quality of the candidate result set by combining the similarity metric for each item in the candidate result set; and outputting the quality of the candidate result set, wherein the preceding steps are performed by at least one processor.
地址 Redmond WA US