发明名称 |
DATA SORTING FOR LANGUAGE PROCESSING SUCH AS POS TAGGING |
摘要 |
Technology is disclosed that improves language coverage by selecting sentences to be used as training data for a language processing engine. The technology accomplishes the selection of a number of sentences by obtaining a group of sentences, computing a score for each sentence, sorting the sentences based on their scores, and selecting a number of sentences with the highest scores. The scores can be computed by dividing a sum of frequency values of unseen words (or n-grams) in the sentence by a length of the sentence. The frequency values can be based on posts in one or more particular domains, such as the public domain, the private domain, or other specialized domains. |
申请公布号 |
US2017024376(A1) |
申请公布日期 |
2017.01.26 |
申请号 |
US201514804802 |
申请日期 |
2015.07.21 |
申请人 |
Facebook, Inc. |
发明人 |
Eck Matthias Gerhard |
分类号 |
G06F17/28;G06N5/02 |
主分类号 |
G06F17/28 |
代理机构 |
|
代理人 |
|
主权项 |
1. A method for obtaining engine training data that has high coverage comprising:
receiving a set of potential training data snippets comprising one or more n-grams; for each selected snippet of two or more of the potential training data snippets, computing a snippet score for the selected snippet by:
identifying one or more n-grams of the selected snippet as unseen n-grams;obtaining a frequency value for the identified unseen n-grams;computing a sum of the obtained frequency values;computing a length value of the selected snippet; andcomputing the snippet score for the selected snippet by dividing the sum of the obtained frequency values by the length value of the selected snippet; sorting the set of potential training data snippets, as sorted snippets, based on the computed snippet scores; selecting, based on snippet locations in the sorted snippets, one or more of the potential training data snippets as the engine training data; and storing the engine training data in a memory, wherein the engine training data is used by an engine to perform automated language processing functions. |
地址 |
Menlo Park CA US |