发明名称 SYSTEM, METHOD AND APPARATUS FOR AUTOMATIC TOPIC RELEVANT CONTENT FILTERING FROM SOCIAL MEDIA TEXT STREAMS USING WEAK SUPERVISION
摘要 Presented are a system, method, and apparatus for automatic topic relevant content filtering from social media text streams using weak supervision. A computing device utilizes heuristic rules allowing topic filtering and a data stream data chunk identifier. A plurality of messages are transmitted as streaming message data from a social media network in real-time. The messages are split into a plurality of data stream data chunks according to the data stream data chunk identifier. A rule-based labeled data set L0 is built from one or more data instances in the first stream data chunk. An initial classifier is built based upon features of L0. The initial classifier is applied to a next data stream data chunk to build a labeled data set L1. A subset of representative instances S1 is selected from labeled data set L1. A first representative classifier C1 is constructed from representative instance S1.
申请公布号 US2016117400(A1) 申请公布日期 2016.04.28
申请号 US201514877970 申请日期 2015.10.08
申请人 Xerox Corporation 发明人 Agarwal Arvind;Dong Cailing
分类号 G06F17/30;G06N5/04;G06N99/00 主分类号 G06F17/30
代理机构 代理人
主权项 1. A method to filter relevant content from streaming message data streamed across social media networking software to a computing device utilizing a weak supervision strategy, the method comprising the steps of: Utilizing by the computing device one or more heuristic rules for use with a weakly supervised data stream filter utilizing the weak supervision strategy, the one or more heuristic rules allowing topic filtering; Utilizing by the computing device a data stream data chunk identifier; Receiving continuously by the computing device a plurality of messages as streaming message data in a data stream from the social media networking software in real-time; Splitting the plurality of messages in the data stream into a plurality of stream data chunks Di according to the data stream data chunk identifier and loading one or more of the stream data chunks into memory associated with the computing device; Building automatically by the computing device utilizing at least one of a plurality of heuristic rules a rule-based labeled data set L0 from one or more data instances in the first stream data chunk D0; Constructing an initial classifier C0 based upon one or more features of L0; Applying the initial classifier C0 to stream data chunk D1 to build automatically labeled data set L1; Selecting a subset of representative instances S1 from labeled data set L1; Constructing a first representative classifier C1 from representative instance S1; Applying the first representative classifier C1 in combination with the initial classifier C0 using one or more combination strategies to stream data chunk D2 to build automatically a labeled data set L2; Selecting a subset of representative instances S2 from labeled data set L2; and Constructing a second representative classifier C2 from representative instance S2.
地址 Norwalk CT US