发明名称 Obtaining data from electronic documents
摘要 Techniques for obtaining information from an electronic document include accessing a set of related electronic documents; identifying a product page associated with the set of related electronic documents using a page recognition model, the product page comprising a plurality of terms; filtering the plurality of terms into a first set of terms and a second set of terms, the first set of terms and the second set of terms including different terms of the plurality of terms, each term in the first set of terms identified as potentially being associated with a product name, and each term in the second set of terms identified as not being associated with a product name; and identifying each term in the first set of terms as being associated with a product name or not being associated with a product name with a name recognition model.
申请公布号 US9348811(B2) 申请公布日期 2016.05.24
申请号 US201213452791 申请日期 2012.04.20
申请人 SAP SE 发明人 Hartl Florian;Miao Yingjie
分类号 G06F17/27;G06F17/21 主分类号 G06F17/27
代理机构 Fish & Richardson P.C. 代理人 Fish & Richardson P.C.
主权项 1. A method performed with a computing system for obtaining information from a set of related electronic documents, the method comprising: accessing the set of related electronic documents that are each hosted on one or more respective web servers that are accessible through a network, the accessing including retrieving data associated with the set of related electronic documents through the network; analyzing markup language of an electronic document of the set of related electronic documents to identify markup language tags of the electronic document; analyzing, using a page recognition module, the markup language tags to identify the electronic document as a product page, the page recognition model generated based on a first machine learning algorithm, and the product page comprising a plurality of terms; filtering the plurality of terms into a first set of terms and a second set of terms, the first set of terms and the second set of terms including different terms of the plurality of terms, each term in the first set of terms identified as potentially being associated with a product name, and each term in the second set of terms identified as not being associated with a product name; for each term of the first set of terms, identifying a noun phrase that includes the term and determining one or more features of each of the noun phrase and the term; for each feature of the one or more features: determining, for each term of the first of terms, a first feature value of the noun phrase and a second feature value of the term, anddetermining, for each term of the first set of terms, an overall feature value for the term based on the first feature value and the second feature value; identifying each term in the first set of terms as being associated with a product name or not being associated with a product name with a name recognition model, the name recognition model generated based on the overall feature value for each feature of the term; and providing for display on a graphical user interface, one or more of the first set of terms that are identified as being associated with a product name.
地址 Walldorf DE