发明名称 |
Targeted optical character recognition (OCR) for medical terminology |
摘要 |
Embodiments of the present invention provide concepts for correcting optical character recognition (OCR) errors from and OCR scan result by sequentially applying an anagram hash (AH) and Levenshtein-Distance (LD) measurement for concurrent character identity-based (machine code) and character shape-based (OCR-Key) corrections. The OCR-Key classifies characters by shape into one or more disjoint and overlapping classes. Similar shaped-based classes appearing in consecutive characters are appended to a cardinality term, a repetition count of the class. The LD measurement groups OCR-Keys and differentiates on both class and cardinality to arrive at a shape-based mismatch error between competing candidate words from an associated dictionary and a target word from the OCR scan. The shape-based LD measurement errors are then functionally merged with the character identity-based deletion, substitution, and insertion errors to find a minimum error for the set of candidate words, corresponding to the preferred candidate word match to the target word. |
申请公布号 |
US9361531(B2) |
申请公布日期 |
2016.06.07 |
申请号 |
US201414336416 |
申请日期 |
2014.07.21 |
申请人 |
Optum, Inc. |
发明人 |
Stella Casey |
分类号 |
G06K9/03;G06K9/18;G06K9/00;G06K9/72;G06K9/62;G06K9/20;G06F3/0488 |
主分类号 |
G06K9/03 |
代理机构 |
Alston & Bird LLP |
代理人 |
Alston & Bird LLP |
主权项 |
1. A method for correcting optical character recognition (OCR) errors from an OCR scan result, comprising:
registering an OCR machine code from the OCR scan result; mapping the registered OCR machine code to one or more of an OCR-Key according to one or more character shape functions by:
mapping from one or more segments of an individual character represented by the OCR machine code to one or more character shape classes;mapping from a combination of one or more characters represented by the OCR machine code to the one or more character shape classes;mapping according to a character shape function comprising a repetition encoding of a characteristic shape, wherein the repetition encoding is designated as a cardinality of the one or more character shape classes;mapping according to a character shape function, wherein a set of consecutive characters in a common character shape class are grouped into a single class with a cardinality equal to the sum of the cardinalities of each of the consecutive characters; andmapping to the set of character shape classes comprising disjoint classes, overlapping classes, and combinations thereof; selecting, from a dictionary, a first set of candidate words as possible matches to the registered OCR machine code; selecting, from the dictionary, a second set of candidate words as possible matches to the OCR-Key; calculating a set of errors between the second set of candidate words and the OCR machine code and the OCR-Key; and selecting a preferred candidate word from the second set of candidate words with the smallest set of errors. |
地址 |
Minnetonka MN US |