摘要 |
Embodiments of the present invention relate to classifying pages of an electronic document, such as a scanned book page. An algorithm, such as a constrained conditional random fields algorithm, is applied to the contents of the electronic document to determine the type of page the electronic document is. Page types may include table of contents (TOC), index, table of figures (TOF), bibliography, epilogue, prologue, foreword, glossary, or other types of pages typically found in a book, magazine, or other publication. Once determined, the contents of the page are extracted using the same algorithm, and labeled.
|