摘要 |
Systems and methods are described that facilitate determining an original document format for a scanned document by analyzing a bitmap thereof. Text objects are extracted from the document, binarized, and segmented to identify text. Page orientation and text size are used to distinguish between a slideshow-type document, and a word processing or spreadsheet-type document. To further distinguish between the word processing and spreadsheet types, text column structure and count is analyzed.
|