发明名称 Automatically Creating Training Data For Language Identifiers
摘要 Example apparatus and methods concern automatically creating labeled training data for automatic language identifiers. One embodiment includes logic to produce a predicted language classification for a post from geographic data associated with the post. The post may be associated with a micro-blog, a social media site, or other electronic communication service that traffics in short messages having frequent colloquialisms, non-standard spelling, emoticons, and unique usages of characters to convey meaning. The embodiment includes logic to produce an actual language classification for the post using a base language classifier. The embodiment includes logic to selectively add the post and a language label for the post to an automatically generated labeled training data upon determining that the predicted language classification matches the actual language classification. The automatically generated labeled training data may then be used to build target language models, which may include a target language classifier.
申请公布号 US2015006148(A1) 申请公布日期 2015.01.01
申请号 US201313943788 申请日期 2013.07.17
申请人 Microsoft Corporation 发明人 Goldszmit Moises;Najork Marc;Paparizos Stelios
分类号 G06F17/28 主分类号 G06F17/28
代理机构 代理人
主权项 1. A method, comprising: accessing a target corpus of electronic communications associated with an electronic communication service; identifying a member of the target corpus that includes an attribute from which a predicted classification of the member can be made, the attribute being separate from a message portion of the member; accessing the predicted classification of the member, where the predicted classification is a function of the attribute and where the predicted classification is made without reference to a base classifier; accessing an actual classification of the member, where the actual classification is made by the base classifier, the base classifier being configured to classify communications associated with the electronic communication service; and upon determining that the predicted classification matches the actual classification: adding a labeled member to a target training corpus stored in a data store, the labeled member comprising the member and data representing the actual classification.
地址 Redmond WA US