发明名称 METHOD AND PLATFORM FOR TERM EXTRACTION FROM LARGE COLLECTION OF DOCUMENTS
摘要 <p>A method and platform for statistically extracting terms from large sets of documents is described. An importance vector is determined for each document in the set of documents based on importance values for words in each document. A binary document classification tree is formed by clustering the documents into clusters of similar documents based on the importance vector for each document. An infrastructure is built for the set of documents by generalizing the binary document classification tree. The document clusters are determined by dividing the generalized tree of the infrastructure into two parts and cutting away the upper part. Statistically significant individual key words are extracted from the clusters of similar documents. Key words are treated as seeds and terms are extracted by starting from the seeds and extending to their left or right contexts.</p>
申请公布号 WO2004114157(A1) 申请公布日期 2004.12.29
申请号 WO2004SG00179 申请日期 2004.06.14
申请人 AGENCY FOR SCIENCE, TECHNOLOGY AND RESEARCH;JI, DONGHONG;YANG, LINGPENG;NIE, YU 发明人 JI, DONGHONG;YANG, LINGPENG;NIE, YU
分类号 G06F17/30;(IPC1-7):G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项
地址