发明名称 |
Language recognition based on vocabulary lists |
摘要 |
A method is implemented at a computer to determine that certain information content is composed or compiled in a specific language selected among two or more similar languages. The computer integrates a first vocabulary list of a first language and a second vocabulary list of a second language into a comprehensive vocabulary list. The integrating includes analyzing the first vocabulary list in view of the second vocabulary list to identify a first vocabulary sub-list that is used in the first language, but not in the second language. The computer then identifies, in the information content, a plurality of expressions that are included in the comprehensive vocabulary list, and a subset of expressions that are included in the first vocabulary sub-list. Upon a determination that a total frequency of occurrence of the subset of expressions meets predetermined occurrence criteria, the computer determines that the information content is composed in the first language. |
申请公布号 |
US9336197(B2) |
申请公布日期 |
2016.05.10 |
申请号 |
US201314108224 |
申请日期 |
2013.12.16 |
申请人 |
TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED |
发明人 |
Li Lu;Cheng Qiang;Ma Jianxiong;Rao Feng;Lu Duling;Lu Li;Zhang Xiang;Chen Bo |
分类号 |
G06F17/28;G06F17/27 |
主分类号 |
G06F17/28 |
代理机构 |
Morgan, Lewis & Bockius LLP |
代理人 |
Morgan, Lewis & Bockius LLP |
主权项 |
1. A computer-implemented method of recognizing a first language used in information content, comprising:
at a computer having one or more processors and memory for storing programs to be executed by the one or more processors:
integrating a first vocabulary list and a second vocabulary list that are built based on a first language and a second language, respectively, into a comprehensive vocabulary list, wherein the integrating includes analyzing the first vocabulary list in view of the second vocabulary list to at least identify a first vocabulary sub-list, in the comprehensive vocabulary list, that is used in the first language, but not in the second language;identifying, within the information content, a plurality of expressions that are included in the comprehensive vocabulary list;identifying, within the plurality of expressions, a subset of expressions that are included in the first vocabulary sub-list;determining that a total frequency of occurrence of the subset of expressions within the information content meets predetermined occurrence criteria; andin accordance with the determination, determining that the information content is composed in the first language. |
地址 |
Shenzhen, Guangdong Province CN |