发明名称 Parsing author name groups in non-standardized format
摘要 The present invention is directed to a method and corresponding system for parsing author name text strings in documents. The method and system may electronically scan a document that contains an author name text string comprising a set of initials, one or more author surnames, and punctuation. The author name text string may be in non-standardized format. The method and system may identify a character sequence in the document as potentially being the author name test string based on (i) a sequence of title-case words, capital letter, and punctuation, and (ii) the character sequence ending with a recognized indicator. The method and system may parse the identified character sequence by converting any punctuation and whitespace between terms in the character sequence to a single space character, identifying a pattern of surname and set of initials comprising each author name contained in the character sequence, and marking up the components of surname and set of initials comprising each author name. The method and system may use the marked up character sequence to identify and correct errors in punctuation and capitalization in the character sequence, and output an updated character sequence in standardized format.
申请公布号 US9430451(B1) 申请公布日期 2016.08.30
申请号 US201514676664 申请日期 2015.04.01
申请人 Inera, Inc. 发明人 Kleshchevich Igor;Rosenblum Bruce D.;Golfman Irina
分类号 G06F17/20;G06F17/21;G06F17/22;G06F17/27 主分类号 G06F17/20
代理机构 Hamilton, Brook, Smith & Reynolds, P.C. 代理人 Hamilton, Brook, Smith & Reynolds, P.C.
主权项 1. A computer-implemented method for parsing author names in a document, the method comprising: electronically scanning a document that contains an author name text string, the author name text string comprising a set of initials, one or more surnames, and punctuation, and the author name text string comprising at least one author name in non-standardized format; identifying a character sequence in the document as potentially being the author name text string, wherein the identifying is based on: (i) a sequence of title-case words, capital letters, and punctuation in the character sequence, and (ii) the character sequence ending with a recognized indicator; and parsing the identified character sequence and determining whether the identified character sequence is the author name text string, wherein the parsing and determining comprises: updating the identified character sequence by converting any punctuation or whitespace between terms in the character sequence to single space character, anddetermining whether one or more author names are contained in the updated character sequence by identifying a pattern of surname and set of initials in the updated character sequence, such that an author name in non-standardized format in the document is identified and output in standardized format.
地址 Belmont MA US