发明名称 Document key phrase extraction method
摘要 A computer-implemented method of extracting key phrases from a document is disclosed comprising the steps of accessing a repository comprising linked subjects, the repository comprising first and second data structures representing the relationship between said subjects using different representation criteria; pruning the first data structure by removing links between subjects based on a further relationship between said subjects in the second data structure; matching phrases in said document to subjects in the pruned first data structure; further pruning the pruned first data structure by removing unmatched subjects that are not linked to matched subjects; determining a ranking for each matched subject; and selecting key phrases using the determined subject rankings. A computer program for implementing the steps of this method when executed on a computer is also disclosed.
申请公布号 US8935260(B2) 申请公布日期 2015.01.13
申请号 US200913264806 申请日期 2009.05.12
申请人 Hewlett-Packard Development Company, L.P. 发明人 Zhou Bao-Yao;Luo Ping;Yang Sheng-Wen;Xiong Yuhong;Liu Wei
分类号 G06F17/30;G06F17/27 主分类号 G06F17/30
代理机构 Elkington & Fife 代理人 Elkington & Fife ;Erte Nicholas
主权项 1. A computer-implemented method of extracting key phrases from a document comprising: accessing a repository comprising hyperlinked subjects, the repository comprising first and second data structures representing the relationship between said hyperlinked subjects using different representation criteria; pruning the first data structure by removing hyperlinks between subjects based on a further relationship between said subjects in the second data structure; matching phrases in said document to said subjects in the pruned first data structure; further pruning the pruned first data structure by removing unmatched subjects that are not hyperlinked to matched subjects; determining a ranking for each matched subject; and selecting key phrases using the determined subject rankings, wherein the first data structure is a directional graph comprising the subjects as nodes and the hyperlinks between subjects as edges between nodes; the second data structure is a directional graph comprising organized subject categories; and the further relationship comprises the shortest distance between respective categories to which respective subjects belong in the second data structure, the hyperlink between said subjects being removed if the shortest distance exceeds a threshold value.
地址 Houston TX US