发明名称 Product synthesis from multiple sources
摘要 Methods and systems for automatically synthesizing product information from multiple data sources into an on-line catalog are disclosed, and in particular, for automatically synthesizing the product information based on attribute-value pairs. Information for a product may be obtained, via entity extraction, feed ingestion, and other mechanisms, from a plurality of structured and unstructured data sources having different taxonomies and schemas. Product information may additionally or alternatively be obtained or derived based on popularity data. The product information may be cleansed, segmented and normalized. The product information may be clustered so closest products, attribute names and attribute values are associated. A representative value for an attribute name may be determined, and the on-line catalog may be updated so that entries are comprehensive, meaningful and useful to a catalog user. Updates from at least 500 million different data sources may be scheduled to occur as frequently as several times daily.
申请公布号 US9384233(B2) 申请公布日期 2016.07.05
申请号 US201213693040 申请日期 2012.12.04
申请人 Microsoft Technology Licensing, LLC 发明人 Fuxman Ariel;Nguyen Hoa;Freire de Lima e Silva Juliana;Paparizos Stelios;Agrawal Rakesh;Chen Zhimin;Colagiovanni Lawrence William;Sikchi Prakash
分类号 G06F17/30;G06Q30/02;G06Q30/06 主分类号 G06F17/30
代理机构 代理人 Ream Dave;Wong Tom;Minhas Micky
主权项 1. A computer implemented method of automatically synthesizing product information from multiple data sources into an on-line catalog, comprising, as implemented on a computer: obtaining historical information corresponding to an existing product represented in the on-line catalog from a plurality of historical data sources, the historical information comprising a plurality of historical attribute-value pairs, and each historical attribute-value pair comprising a historical attribute name and a corresponding historical attribute value; determining a correspondence between a first historical attribute name included in a first historical product schema of a first historical data source and a first catalog attribute name of the existing product included in a catalog schema of the on-line catalog, the catalog schema comprising a plurality of catalog attribute-value pairs, and each catalog attribute-value pair comprising a different catalog attribute name and a corresponding catalog attribute value, wherein the first historical attribute name and the first catalog attribute name are not the same name; wherein determining the correspondence between the first historical attribute name included in the first historical product schema of the first historical data source and the first catalog attribute name of the existing product included in the catalog schema of the on-line catalog comprises: obtaining a range of words for each of the first historical attribute name and the first catalog attribute name;determining a value distance between the range of words for each of the first historical attribute name and the first catalog attribute name; andidentifying pairs of words between words of the range of words for each of the first historical attribute name and the first catalog attribute name whose value distance falls below a given threshold;wherein the identified pairs comprise the correspondence between the first historical attribute name included in the first historical product schema of the first historical data source and the first catalog attribute name of the existing product included in the catalog schema of the on-line catalog; determining an association between the first catalog attribute name and at least part of a first historical attribute value corresponding to the first historical attribute name of the first historical data source; and storing the existing association between the first catalog attribute name and the at least part of the first historical attribute value corresponding to the first historical attribute name of the first historical data source in the catalog schema; wherein obtaining the incoming and the historical information from the plurality of incoming and historical data sources comprises obtaining unstructured data and structured data in a plurality of different schemas from the plurality of incoming and historical data sources.
地址 Redmond WA US