发明名称 System and method for extraction of off-topic part from conversation
摘要 A system and method extract off-topic parts from a conversation. The system includes a first corpus including documents of a plurality of fields; a second corpus including only documents of a field to which the conversation belongs; a determination means for determination as a lower limit subject word a word for which IDF value for the first corpus and IDF value for the second corpus are each below a first certain threshold value; a score calculation part for calculation as a score a TF-IDF value for each word included in the second corpus; a clipping part, for sequential cutting out of intervals from text data that are contents of the conversation; and an extraction part for extraction as an off-topic part an interval where average value of the score of words included in the clipped interval is larger than a second certain threshold value.
申请公布号 US9002843(B2) 申请公布日期 2015.04.07
申请号 US201313740473 申请日期 2013.01.14
申请人 International Business Machines Corporation 发明人 Itoh Nobuyasu;Nishimura Masafumi;Yamaguchi Yuto
分类号 G06F7/00;G06F17/30;G06F17/27 主分类号 G06F7/00
代理机构 Fleit Gibbons Gutman Bongini & Bianco PL 代理人 Fleit Gibbons Gutman Bongini & Bianco PL ;Gutman Jose
主权项 1. A system for extraction of an off-topic part from a conversation, the system comprising: a memory; a processing unit connected to the memory; a first corpus stored in the memory, the first corpus including documents of a plurality of fields; a second corpus stored in the memory, the second corpus including only documents of a field to which said conversation belongs; a determination means stored in the memory, the determination means interoperates with the processing unit for determination of, as a lower limit subject word, a word for which IDF value for said first corpus, wherein the IDF value in each corpus is found according to the following formula:IDF⁡(w)=log⁡(DDF⁡(w))  , where D indicates the number of documents contained in each corpus, and DF(w) indicates the number of documents that include a word w within the documents contained in each corpus, and IDF value for said second corpus are each below a first certain threshold value for each word included in said second corpus; a score calculation part stored in the memory, the score calculation part interoperates with the processing unit for calculation as a score a TF-IDF value, the TF-IDF value being determined based on the product of the term frequency of appearance in a target document and the log of the inverse of the proportion of document frequency of appearance of the term, for each word included in said second corpus, said score calculation part using a constant setting a lower limit rather than a TF-IDF value for said lower limit subject word; a clipping part stored in the memory, the clipping part interoperates with the processing unit, the clipping part, while displacing a window of a certain length sequentially over text data comprising words of said conversation acquired by speech recognition, for sequential cutting out of clipped intervals subject to processing from text data comprising words that are contents of said conversation; and an extraction part stored in the memory, the extraction part interoperates with the processing unit for extraction of, as an off-topic part of a conversation, a clipped interval where an average value of score of words included in the clipped interval is larger than a second certain threshold value.
地址 Armonk NY US