发明名称 METHOD AND SYSTEM FOR SIMPLIFYING IMPLICIT RHETORICAL RELATION PREDICTION IN LARGE SCALE ANNOTATED CORPUS
摘要 The present invention provides a method and system directed to predicting implicit rhetorical relations between two spans of text, e.g., in a large annotated corpus, such as the Penn Discourse Treebank (“PDTB”), Rhetorical Structure Theory corpus, and the Discourse Graph Bank, and particularly directed to determining a rhetorical relation in the absence of an explicit discourse marker. Surface level features may be used to capture pragmatic information encoded in the absent marker. In one manner a simplified feature set based only on raw text and semantic dependencies is used to improve performance for all relations. By using surface level features to predict implicit rhetorical relations for the large annotated corpus the invention approaches a theoretical maximum performance, suggesting that more data will not necessarily improve performance based on these and similarly situated features.
申请公布号 US2015039294(A1) 申请公布日期 2015.02.05
申请号 US201414323653 申请日期 2014.07.03
申请人 Howald Blake;Nystrom Andrew 发明人 Howald Blake;Nystrom Andrew
分类号 G06F17/28;G06F17/30;G06N99/00;G06F17/27 主分类号 G06F17/28
代理机构 代理人
主权项 1. A computer-implemented method for predicting implicit rhetorical relation between spans of text in the absence of an explicit discourse marker, the method represented as instructions stored in memory for recall and processing by a processor such that when executed the method provides a feature vector model comprising a representation of simplified feature set based on raw text and semantic dependencies implemented with a machine learning process, wherein the model comprises one or more inputs and one or more outputs, the method comprising: a. identifying by use of a processor executing a set of code a first factor associated with a first relation and associated with a first span of text Arg1 and a second factor associated with a second relation and associated with a second span of text Arg2; and b. processing one or more of the following features: (1) sequence expressing the first and second relations as a normalized percentage; (2) text unigram, bigram and/or trigrams of Arg1 and Arg2; (3) unigram, bigram and trigram dependencies of Arg1 and Arg2; and (4) the occurrence of one or more of a date, time, location, person, money, percent, organization named entity.
地址 Northfield MN US