主权项 |
1. A system for extraction of an off-topic part from a conversation, the system comprising:
a memory; a processing unit connected to the memory; a first corpus stored in the memory, the first corpus including documents of a plurality of fields; a second corpus stored in the memory, the second corpus including only documents of a field to which said conversation belongs; a determination means stored in the memory, the determination means interoperates with the processing unit for determination of, as a lower limit subject word, a word for which IDF value for said first corpus, wherein the IDF value in each corpus is found according to the following formula:IDF(w)=log(DDF(w)) , where D indicates the number of documents contained in each corpus, and DF(w) indicates the number of documents that include a word w within the documents contained in each corpus, and IDF value for said second corpus are each below a first certain threshold value for each word included in said second corpus; a score calculation part stored in the memory, the score calculation part interoperates with the processing unit for calculation as a score a TF-IDF value, the TF-IDF value being determined based on the product of the term frequency of appearance in a target document and the log of the inverse of the proportion of document frequency of appearance of the term, for each word included in said second corpus, said score calculation part using a constant setting a lower limit rather than a TF-IDF value for said lower limit subject word; a clipping part stored in the memory, the clipping part interoperates with the processing unit, the clipping part, while displacing a window of a certain length sequentially over text data comprising words of said conversation acquired by speech recognition, for sequential cutting out of clipped intervals subject to processing from text data comprising words that are contents of said conversation; and an extraction part stored in the memory, the extraction part interoperates with the processing unit for extraction of, as an off-topic part of a conversation, a clipped interval where an average value of score of words included in the clipped interval is larger than a second certain threshold value. |