发明名称 Method and apparatus for identifying garbage template article
摘要 Method and apparatus for identifying garbage template articles in network communication field are disclosed. The method includes: extracting a feature from an eligible microblog article to generate an article feature including a punctuation feature, a topic feature, a bracket feature, a link feature and an account name feature; acquiring a garbage template list including garbage template feature, i.e. an article feature whose frequency reaches a preset threshold, wherein they are extracted in a same way; identifying the microblog article as a garbage template article when the article feature is the same as the garbage template feature. The apparatus includes: a feature extracting module, an acquiring module, and an identifying module. Features of a microblog article are extracted to determine whether the microblog article is a garbage template article, so that garbage template articles in the present microblog platform can be identified effectively and search engine resources are saved.
申请公布号 US9330075(B2) 申请公布日期 2016.05.03
申请号 US201314428314 申请日期 2013.09.17
申请人 Tencent Technology (Shenzhen) Company Limited 发明人 Hao Zhixin;He Jianguo;Zhang Guoqiang;He Xiaochen
分类号 G06F17/30;G06F17/22;G06F17/27;H04L29/06;H04L12/58;H04L29/08 主分类号 G06F17/30
代理机构 BrainSpark Associates, LLC 代理人 BrainSpark Associates, LLC
主权项 1. A method for identifying garbage template article, comprising: extracting, by a processor, a feature from an eligible microblog article to generate an article feature, wherein the article feature comprises at least a punctuation feature, a topic feature, a bracket feature, a link feature and an account name feature; acquiring, by the processor, a garbage template list which comprises garbage template feature, the garbage template feature being an article feature whose frequency reaches a preset threshold, and the way to extract the garbage template feature being the same as the way to extract the article feature; and identifying, by the processor, the microblog article as a garbage template article when the article feature is the same as the garbage template feature in the garbage template list, wherein the eligible microblog article is a microblog article which is in an original form and contains link and picture, and before extracting a feature, by the processor, from an eligible microblog article, the method further comprises: removing, by the processor, numbers and letters from the eligible microblog article, and removing the contents in various brackets from the microblog article while retaining the brackets; wherein extracting, by the processor, a feature from the eligible microblog article comprises: segmenting, by the processor, the eligible microblog article with punctuations to generate segment numbers in order; extracting, by the processor, the punctuation of each segment, using the extracted punctuations to constitute a string, and generating the punctuation feature; extracting, by the processor, the topic and the corresponding segment number of the segment which has a topic for each segment, using the extracted topics and segment numbers to constitute a string, and generating the topic feature; extracting, by the processor, the segment number and the corresponding type of brackets of the segment which has brackets for each segment, using the extracted segment numbers and type of brackets to constitute a string, and generating the bracket feature; generating, by the processor, a sequence as the link feature according to whether there is a link in each segment; and generating, by the processor, a sequence as the account name feature according to whether there is an account name identity in each segment.
地址 CN