发明名称 DOCUMENT-SPECIFIC GAZETTEERS FOR NAMED ENTITY RECOGNITION
摘要 A method for entity recognition employs document-level entity tags which correspond to mentions appearing in the document, without specifying their locations. A named entity recognition model is trained on features extracted from text samples tagged with document-level entity tags. A text document to be labeled is received, the text document being tagged with at least one document-level entity tag. A document-specific gazetteer is generated, based on the at least one document-level entity tag. The gazetteer includes a set of entries, one entry for each of a set of entity names. For a text sequence of the document, features for tokens of the text sequence are extracted. The features include document-specific features for tokens matching at least a part of the entity name of one of the gazetteer entries. Entity labels are predicted for the tokens in the text sequence with the named entity recognition model, based on the extracted features.
申请公布号 US2017060835(A1) 申请公布日期 2017.03.02
申请号 US201514837687 申请日期 2015.08.27
申请人 Xerox Corporation 发明人 Radford William;Carreras Xavier;Henderson James Brinton
分类号 G06F17/27;G06K9/00;G06F17/30 主分类号 G06F17/27
代理机构 代理人
主权项 1. An entity recognition method comprising: providing a named entity recognition model which has been trained on features extracted from training samples tagged with document-level entity tags, each training sample comprising at least one text sequence; receiving a text document to be labeled, the text document being tagged with at least one document-level entity tag; generating a document-specific gazetteer based on the at least one document-level entity tag, the document-specific gazetteer including a set entries, one entry for each of a set of entity names; for a text sequence of the document, extracting features for tokens of the text sequence, the features including document-specific features for tokens matching at least a part of the entity name of one of the gazetteer entries; predicting entity labels for tokens in the document text sequence with the named entity recognition model, based on the extracted features, and wherein at least one of the generating, extracting, and predicting is performed with a processor.
地址 Norwalk CT US