发明名称 |
Targeted optical character recognition (OCR) for medical terminology |
摘要 |
Embodiments of the present invention provide concepts for correcting optical character recognition (OCR) errors from and OCR scan result by sequentially applying an anagram hash (AH) and Levenshtein-Distance (LD) measurement for concurrent character identity-based (machine code) and character shape-based (OCR-Key) corrections. The OCR-Key classifies characters by shape into one or more disjoint and overlapping classes. Similar shaped-based classes appearing in consecutive characters are appended to a cardinality term, a repetition count of the class. The LD measurement groups OCR-Keys and differentiates on both class and cardinality to arrive at a shape-based mismatch error between competing candidate words from an associated dictionary and a target word from the OCR scan. The shape-based LD measurement errors are then functionally merged with the character identity-based deletion, substitution, and insertion errors to find a minimum error for the set of candidate words, corresponding to the preferred candidate word match to the target word. |
申请公布号 |
US9633271(B2) |
申请公布日期 |
2017.04.25 |
申请号 |
US201615140849 |
申请日期 |
2016.04.28 |
申请人 |
OPTUM, INC. |
发明人 |
Stella Casey |
分类号 |
G06K9/18;G06K9/03;G06K9/00;G06K9/72;G06K9/62;G06T7/00;G06K9/20;G06F3/0488 |
主分类号 |
G06K9/18 |
代理机构 |
Alston & Bird LLP |
代理人 |
Alston & Bird LLP |
主权项 |
1. A method for correcting optical character recognition (OCR) errors from an OCR scan result, comprising:
registering an OCR machine code from the OCR scan result; mapping the registered OCR machine code to one or more of an OCR-Key according to one or more character shape functions, wherein (a) the mapping comprises at least mapping to a set of character shape classes comprising disjoint classes, overlapping classes, and combinations thereof, and (b) the character shape classes are classified based at least on character cardinality and character orientation; selecting, from a dictionary, a set of candidate words as possible matches to the OCR-Key; calculating a set of errors between the set of candidate words and the OCR machine code and the OCR-Key; and selecting a preferred candidate word from the set of candidate words with the smallest set of errors. |
地址 |
Minnetonka MN US |