发明名称 METHOD AND APPARATUS FOR FORMING A STRUCTURED DOCUMENT FROM UNSTRUCTURED INFORMATION
摘要 Illustrative embodiments improve upon prior machine learning techniques by introducing an additional classification layer that mimics human visual pattern recognition. Building upon classification passes that extract contextual information, illustrative embodiments look for hints of high-level semantic categorization that manifest as visual artifacts in the document, such as font family, font weight, text color, text justification, white space, or CSS class name. An improved lightweight markup language enables display of machine-categorized tokens on a screen for human correction, thereby providing ground truths for further machine classification.
申请公布号 US2016117295(A1) 申请公布日期 2016.04.28
申请号 US201514980998 申请日期 2015.12.28
申请人 Locu, Inc. 发明人 Olszewski Marek;Sidiroglou Stylianos;Ansel Jason;Piette Marc;Reinsberg Rene
分类号 G06F17/22;G06F17/21 主分类号 G06F17/22
代理机构 代理人
主权项 1. A method, comprising: receiving, by a computer, an unstructured input document; extracting, by the computer, a plurality of tokens from the input document, each token of the plurality of tokens having a corresponding visual style of a plurality of visual styles; producing, by the computer for a first token of the plurality of tokens, a first probability distribution of the first token, the first probability distribution comprising a plurality of first probabilities each indicating a probability that the first token belongs to a corresponding class of a plurality of classes that are each: related to information conveyed by the plurality of tokens; andspecific to a type of unstructured data items of the input document; determining, by the computer from the plurality of tokens, a plurality of surrounding tokens that occur near the first token within the input document; determining, by the computer, a first classification probability of the plurality of surrounding tokens, the first classification probability identifying the class in which the plurality of surrounding tokens are most likely to be classified; modifying, by the computer based on the class identified by the first classification probability, each of the plurality of first probabilities to produce a corresponding second probability of a plurality of second probabilities in a second probability distribution; producing, by the computer based on the visual style of the first token and the second probability distribution, a third probability distribution comprising a plurality of third probabilities each associated with a corresponding second probability of the plurality of second probabilities; determining, by the computer based at least on the third probability distribution, a classification of the first token into one of the plurality of classes; and forming, by the computer, a structured document from the first token and the classification.
地址 Cambridge MA US