摘要 |
Described herein are systems and methods for processing documents of unknown or unspecified format. Embodiments include methods (such as computer implemented methods), computer programs configured to perform such methods, carrier media embodying code for allowing a computer system to perform such methods, and computer systems configured to perform such methods. According to one embodiment, the method includes extracting raw encoded text from a document, and applying a process thereby to identify markers/delimiters (for example the beginnings and ends of sections), apply decompression (where necessary), and identify a most likely character encoding protocol. This allows for conversion of the raw encoded text into meaningful text. Document Stream Input - Chunk Identification Phase Decompression Phase 4, Encoding Determination Phase Output Phase |