发明名称 Adaptive parser-centric text normalization
摘要 Embodiments of the present invention relate to a customizable text normalization framework providing for domain adaptability through modular replacement generators. In one embodiment, a method of and computer program product for text normalization are provided. An input sequence comprising a plurality of tokens is received. A plurality of generators is applied to the input sequence to generate a set of candidate replacements of the tokens of the sequence. A plurality of subsets of the set of candidate replacements is determined such that the candidate replacements of each subset are syntactically consistent. A probability is determined for each of the subsets. A subset of the plurality of subsets having the highest probability is selected. Each candidate replacement of the selected subset is applied to the input sequence to generate an output sequence. The output sequence is outputted.
申请公布号 US9471561(B2) 申请公布日期 2016.10.18
申请号 US201314141036 申请日期 2013.12.26
申请人 INTERNATIONAL BUSINESS MACHINES CORPORATION 发明人 Baldwin Tyler S.;Ho Ching-Tien (Howard);Kimelfeld Benny;Li Yunyao;Zhang Congle
分类号 G06F17/27;G06F17/30 主分类号 G06F17/27
代理机构 Foley Hoag LLP 代理人 Huestis Erik;Kenny Stephen;Foley Hoag LLP
主权项 1. A method comprising: receiving at a computing node an input sequence comprising a plurality of tokens; applying by a processor of the computing node a plurality of domain-specific generators to the input sequence to generate a set of candidate replacements of the tokens of the input sequence; creating in a memory of the computing node a directed graph comprising a plurality of nodes and a plurality of edges, each node having an associated candidate replacement of the set of candidate replacements, and each edge connecting a first node to a second node, the second node being associated with a consistent follower of the candidate replacement associated with the first node, and creating the plurality of edges comprising determining syntactic consistency between each pair of the set of candidate replacements; determining by the processor a plurality of paths in the directed graph, each of the plurality of paths comprising at least one of the plurality of edges; determining by the processor a score for each of the paths; selecting by the processor a path of the plurality of paths having the highest score; applying by the processor each candidate replacement of the selected path to the input sequence to generate a normalized output sequence; and evaluating a correctness of the normalized output sequence by parsing the normalized output sequence to obtain a parse result and comparing the parse result with a gold standard that is obtained by parsing a manually normalized sequence.
地址 Armonk NY US