发明名称 System and method for automated detection of plagiarized spoken responses
摘要 Systems and methods are provided for automated detection of plagiarized spoken responses. A spoken response is processed to generate a text that is representative of the spoken response. The text is processed to remove disfluencies in the text and to identify a plurality of sentences in the text. A first numerical measure indicative of a number of words and phrases of the text that are included verbatim in a source text is determined. The source text has been designated as a source of plagiarized content. A second numerical measure indicative of an amount of the text that paraphrases portions of the source text is determined. A third numerical measure indicative of a similarity between sentences of the text and sentences of the source text is determined. A model is applied to the first, second, and third numerical measures to classify the spoken response as being plagiarized or non-plagiarized.
申请公布号 US9443513(B2) 申请公布日期 2016.09.13
申请号 US201514667101 申请日期 2015.03.24
申请人 Educational Testing Service 发明人 Evanini Keelan;Wang Xinhao
分类号 G10L15/18;G10L25/48;G06F17/27;G10L15/26;G10L15/197 主分类号 G10L15/18
代理机构 Jones Day 代理人 Jones Day
主权项 1. A computer-implemented method of classifying a spoken response as being plagiarized or non-plagiarized, the method comprising: processing a spoken response with a processing system to generate a first text that is representative of the spoken response; processing the first text with the processing system to remove disfluencies in the first text; processing the first text with the processing system to identify a plurality of n-grams in the first text; processing the first text with the processing system to identify a plurality of sentences in the first text; processing the plurality of n-grams and a source text with the processing system to determine a first numerical measure indicative of a number of words and phrases of the first text that are included verbatim in the source text, each of the n-grams being compared to n-grams of the source text to determine the first numerical measure, the source text having been designated as a source of plagiarized content; processing the first text and the source text with the processing system to determine a second numerical measure indicative of (i) an amount of the first text that paraphrases portions of the source text, or (ii) an amount of the first text that is semantically-similar to portions of the source text, the second numerical measure being determined by comparing units of text of the first text with corresponding units of text of the source text; processing the plurality of sentences and the source text with the processing system to determine a third numerical measure indicative of a similarity between sentences of the first text and sentences of the source text, each sentence of the plurality of sentences being compared to each sentence of the source text to determine the third numerical measure; and applying a model to the first numerical measure, the second numerical measure, and the third numerical measure to classify the spoken response as being plagiarized or non-plagiarized, the model including a first variable and an associated first weighting factor, the first variable receiving a value of the first numerical measure,a second variable and an associated second weighting factor, the second variable receiving a value of the second numerical measure, anda third variable and an associated third weighting factor, the third variable receiving a value of the third numerical measure.
地址 Princeton NJ US