发明名称 Extracting information from unstructured text using generalized extraction patterns
摘要 Methods, systems, and apparatus, including computer program products, for extracting information from unstructured text. Fact pairs are used to extract basic patterns from a body of text. Patterns are generalized by replacing words with classes of similar words. Generalized patterns are used to extract further fact pairs from the body of text. The process can begin with fact pairs, basic patterns, or generalized patterns.
申请公布号 US9043197(B1) 申请公布日期 2015.05.26
申请号 US200711774428 申请日期 2007.07.06
申请人 Google Inc. 发明人 Pasca Alexandru Marius;Lin Dekang
分类号 G06F17/27;G06F17/21;G06F17/28 主分类号 G06F17/27
代理机构 Fish & Richardson P.C. 代理人 Fish & Richardson P.C.
主权项 1. A method, implemented by a computing system, for deriving facts, comprising: obtaining data identifying a set of seed fact pairs, wherein each seed fact pair associates a fact subject phrase with an information phrase; determining that a first sentence matches a first seed fact pair in the set of seed fact pairs, wherein determining that the first sentence matches the first seed fact pair comprises determining that the fact subject phrase and the information phrase from the first seed fact pair both occur in the first sentence separated by one or more terms; extracting, by the system, from the first sentence, a basic infix-only pattern that includes the one or more terms that separate the fact subject phrase from the information phrase in the first sentence; generating, by the system, a generalized infix-only extraction pattern from the basic infix-only pattern, wherein generating the generalized infix-only extraction pattern from the basic infix-only pattern comprises: determining that a first term of the one or more terms that separate the fact subject phrase from the information phrase in the first sentence belongs to a first distributionally similar class, andsubstituting the first distributionally similar class for the first term in the basic infix-only pattern; determining that a second sentence matches the generalized infix-only pattern, comprising determining that the second sentence includes a term that belongs to the first distributionally similar class; and applying, by the system, the generalized infix-only pattern to the second sentence to extract a candidate fact pair from the second sentence.
地址 Mountain View CA US