发明名称 DATA SORTING FOR LANGUAGE PROCESSING SUCH AS POS TAGGING
摘要 Technology is disclosed that improves language coverage by selecting sentences to be used as training data for a language processing engine. The technology accomplishes the selection of a number of sentences by obtaining a group of sentences, computing a score for each sentence, sorting the sentences based on their scores, and selecting a number of sentences with the highest scores. The scores can be computed by dividing a sum of frequency values of unseen words (or n-grams) in the sentence by a length of the sentence. The frequency values can be based on posts in one or more particular domains, such as the public domain, the private domain, or other specialized domains.
申请公布号 US2017024376(A1) 申请公布日期 2017.01.26
申请号 US201514804802 申请日期 2015.07.21
申请人 Facebook, Inc. 发明人 Eck Matthias Gerhard
分类号 G06F17/28;G06N5/02 主分类号 G06F17/28
代理机构 代理人
主权项 1. A method for obtaining engine training data that has high coverage comprising: receiving a set of potential training data snippets comprising one or more n-grams; for each selected snippet of two or more of the potential training data snippets, computing a snippet score for the selected snippet by: identifying one or more n-grams of the selected snippet as unseen n-grams;obtaining a frequency value for the identified unseen n-grams;computing a sum of the obtained frequency values;computing a length value of the selected snippet; andcomputing the snippet score for the selected snippet by dividing the sum of the obtained frequency values by the length value of the selected snippet; sorting the set of potential training data snippets, as sorted snippets, based on the computed snippet scores; selecting, based on snippet locations in the sorted snippets, one or more of the potential training data snippets as the engine training data; and storing the engine training data in a memory, wherein the engine training data is used by an engine to perform automated language processing functions.
地址 Menlo Park CA US