发明名称 Targeted optical character recognition (OCR) for medical terminology
摘要 Embodiments of the present invention provide concepts for correcting optical character recognition (OCR) errors from and OCR scan result by sequentially applying an anagram hash (AH) and Levenshtein-Distance (LD) measurement for concurrent character identity-based (machine code) and character shape-based (OCR-Key) corrections. The OCR-Key classifies characters by shape into one or more disjoint and overlapping classes. Similar shaped-based classes appearing in consecutive characters are appended to a cardinality term, a repetition count of the class. The LD measurement groups OCR-Keys and differentiates on both class and cardinality to arrive at a shape-based mismatch error between competing candidate words from an associated dictionary and a target word from the OCR scan. The shape-based LD measurement errors are then functionally merged with the character identity-based deletion, substitution, and insertion errors to find a minimum error for the set of candidate words, corresponding to the preferred candidate word match to the target word.
申请公布号 US9361531(B2) 申请公布日期 2016.06.07
申请号 US201414336416 申请日期 2014.07.21
申请人 Optum, Inc. 发明人 Stella Casey
分类号 G06K9/03;G06K9/18;G06K9/00;G06K9/72;G06K9/62;G06K9/20;G06F3/0488 主分类号 G06K9/03
代理机构 Alston & Bird LLP 代理人 Alston & Bird LLP
主权项 1. A method for correcting optical character recognition (OCR) errors from an OCR scan result, comprising: registering an OCR machine code from the OCR scan result; mapping the registered OCR machine code to one or more of an OCR-Key according to one or more character shape functions by: mapping from one or more segments of an individual character represented by the OCR machine code to one or more character shape classes;mapping from a combination of one or more characters represented by the OCR machine code to the one or more character shape classes;mapping according to a character shape function comprising a repetition encoding of a characteristic shape, wherein the repetition encoding is designated as a cardinality of the one or more character shape classes;mapping according to a character shape function, wherein a set of consecutive characters in a common character shape class are grouped into a single class with a cardinality equal to the sum of the cardinalities of each of the consecutive characters; andmapping to the set of character shape classes comprising disjoint classes, overlapping classes, and combinations thereof; selecting, from a dictionary, a first set of candidate words as possible matches to the registered OCR machine code; selecting, from the dictionary, a second set of candidate words as possible matches to the OCR-Key; calculating a set of errors between the second set of candidate words and the OCR machine code and the OCR-Key; and selecting a preferred candidate word from the second set of candidate words with the smallest set of errors.
地址 Minnetonka MN US