发明名称 Language recognition based on vocabulary lists
摘要 A method is implemented at a computer to determine that certain information content is composed or compiled in a specific language selected among two or more similar languages. The computer integrates a first vocabulary list of a first language and a second vocabulary list of a second language into a comprehensive vocabulary list. The integrating includes analyzing the first vocabulary list in view of the second vocabulary list to identify a first vocabulary sub-list that is used in the first language, but not in the second language. The computer then identifies, in the information content, a plurality of expressions that are included in the comprehensive vocabulary list, and a subset of expressions that are included in the first vocabulary sub-list. Upon a determination that a total frequency of occurrence of the subset of expressions meets predetermined occurrence criteria, the computer determines that the information content is composed in the first language.
申请公布号 US9336197(B2) 申请公布日期 2016.05.10
申请号 US201314108224 申请日期 2013.12.16
申请人 TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED 发明人 Li Lu;Cheng Qiang;Ma Jianxiong;Rao Feng;Lu Duling;Lu Li;Zhang Xiang;Chen Bo
分类号 G06F17/28;G06F17/27 主分类号 G06F17/28
代理机构 Morgan, Lewis & Bockius LLP 代理人 Morgan, Lewis & Bockius LLP
主权项 1. A computer-implemented method of recognizing a first language used in information content, comprising: at a computer having one or more processors and memory for storing programs to be executed by the one or more processors: integrating a first vocabulary list and a second vocabulary list that are built based on a first language and a second language, respectively, into a comprehensive vocabulary list, wherein the integrating includes analyzing the first vocabulary list in view of the second vocabulary list to at least identify a first vocabulary sub-list, in the comprehensive vocabulary list, that is used in the first language, but not in the second language;identifying, within the information content, a plurality of expressions that are included in the comprehensive vocabulary list;identifying, within the plurality of expressions, a subset of expressions that are included in the first vocabulary sub-list;determining that a total frequency of occurrence of the subset of expressions within the information content meets predetermined occurrence criteria; andin accordance with the determination, determining that the information content is composed in the first language.
地址 Shenzhen, Guangdong Province CN