发明名称 Document Division Method and System
摘要 Computer-readable media stores instructions that perform operations including receiving a first electronic document; determining a first information gain value associated with a first line that divides the first electronic document into a first portion and a second portion; determining a second information gain value associated with a second line that divides the first electronic document into a third portion and a fourth portion; and determining which of the first information gain value and second information gain value is greater. Information gain values are determined by calculating a difference between an entropy value associated with a line and an entropy value associated with an electronic document. Entropy values associated lines or electronic documents are determined based at least in part on document objects in the portions created by a line or an electronic document.
申请公布号 US2015193407(A1) 申请公布日期 2015.07.09
申请号 US201213370981 申请日期 2012.02.10
申请人 Baluja Shumeet 发明人 Baluja Shumeet
分类号 G06F17/24;G06F17/30;G06F17/21 主分类号 G06F17/24
代理机构 代理人
主权项 1. One or more tangible computer-readable media storing instructions that, when executed by a processor, perform operations comprising: receiving a first electronic document; determining a first information gain value associated with a first line that divides the first electronic document into a first portion and a second portion, including determining a difference between an entropy value associated with the first line and an entropy value associated with the first electronic document; determining a second information gain value associated with a second line that divides the first electronic document into a third portion and a fourth portion, including determining a difference between an entropy associated with the second line and the entropy value associated with the first electronic document, wherein each of the entropy value associated with the first line, the entropy value associated with the second line, and the entropy value associated with the first electronic document is determined based at least in part on document objects in the portions created by the first line, the second line, and the first electronic document respectively; determining which of the first information gain value and second information gain value is greater; in response to determining that the first information gain value is greater, generating a second electronic document that includes at least a portion defined by the first line and using the first information gain value to recursively divide the portions defined by the first line; in response to determining that the second information gain value is greater, generating a third electronic document that includes at least a portion defined by the second line and using the second information gain value to recursively divide the portions defined by the second line, wherein the information gain values are calculated as a difference between an information value of the first document before the first document is divided, and an information value for parts of the first document after the first document is divided, and wherein the information values are determined as a function of an amount of visual objects in a zone of the first document compared to a total visual amount of the zone of the first document.
地址 Leesburg VA US