发明名称 Methods and systems to train models to extract and integrate information from data sources
摘要 Methods and systems to model and acquire data from a variety of data and information sources, to integrate the data into a structured database, and to manage the continuing reintegration of updated data from those sources over time. For any given domain, a variety of individual information and data sources that contain information relevant to the schema can be identified. Data elements associated with a schema may be identified in a training source, such as by user tagging. A formal grammar may be induced appropriate to the schema and layout of the training source. A Hidden Markov Model (HMM) corresponding to the grammar may learn where in the sources the elements can be found. The system can automatically mutate its schema into a grammar matching the structure of the source documents. By following an inverse transformation sequence, data that is parsed by the mutated grammar can be fit back into the original grammar structure, matching the original data schema defined through domain modeling. Features disclosed herein may be implemented with respect to web-scraping and data acquisition, and to represent data in support of data-editing and data-merging tasks. A schema may be defined with respect to a graph-based domain model.
申请公布号 US8805861(B2) 申请公布日期 2014.08.12
申请号 US200912467235 申请日期 2009.05.15
申请人 Google Inc. 发明人 Boyan Justin;McDonald Glenn;Benthall Margaret;Molnar Ray
分类号 G06F7/00 主分类号 G06F7/00
代理机构 Morgan, Lewis & Bockius LLP 代理人 Morgan, Lewis & Bockius LLP
主权项 1. A non-transitory computer readable storage medium storing at least one program configured for execution by at least one processor of a computer system, the at least one program comprising instructions to: obtain a domain model comprising a set of entity types having corresponding properties and relationships between entities in a set of entities, wherein the domain model is characterized by a domain grammar; receive a first tag layout of a first source document obtained from a first information source associated with the domain model, the first tag layout comprising: (i) a plurality of user-provided navigational tags, wherein a user-provided navigational tag in the plurality of a user-provided navigational tags indicates a navigational position of the first source document relative to a second source document, from the first information source, navigationally connected with the first source document, and(ii) a plurality of corresponding user-identified tokens in the first source document, wherein a user-identified token in the plurality of corresponding user-identified tokens includes a portion of content of the first source document; select a page grammar in plurality of page grammars for the first source document in accordance with the plurality of user provided navigational tags; extract information from a third source document having a predefined degree of tag layout similarity to the first source document using the page grammar, wherein the second source document is obtained from a second information source; and transform the information extracted from the second source document in accordance with the domain grammar, thereby extracting and integrating information from a plurality of information sources.
地址 Mountain View CA US