发明名称 SELF-LEARNING BASED CRAWLING AND RULE-BASED DATA MINING FOR AUTOMATIC INFORMATION EXTRACTION
摘要 Methods and Systems for automatic information extraction by performing self-learning crawling and rule-based data mining is provided. The method determines existence of crawl policy within input information and performs at least one of front-end crawling, assisted crawling and recursive crawling. Downloaded data set is pre-processed to remove noisy data and subjected to classification rules and decision tree based data mining to extract meaningful information. Performing crawling techniques leads to smaller relevant datasets pertaining to a specific domain from multi-dimensional datasets available in online and offline sources.
申请公布号 US2016371603(A1) 申请公布日期 2016.12.22
申请号 US201615077563 申请日期 2016.03.22
申请人 Tata Consultancy Services Limited 发明人 A V Arun Kumar;RATH Hemant Kumar;NADAF Shameemraj M.;SIMHA Anantha
分类号 G06N99/00;G06N5/04;G06F17/30 主分类号 G06N99/00
代理机构 代理人
主权项 1. A computer implemented method for automatic information extraction comprising: receiving a request for information extraction and retrieving input information from the request; determining existence of a crawl policy wherein such determination is performed on the input information retrieved from the request ; performing assisted crawling, in case the input information contains the crawl policy; performing recursive crawling in case the input information does not contain the crawl policy and computing valid paths and links for building a new crawl policy, wherein recursive crawling is performed until destination files or web-page is reached; pre-processing dataset obtained after one of the assisted crawling and recursive crawling to remove noisy data to obtain pre-processed relevant dataset; and subjecting the pre-processed relevant dataset to classification rules and decision tree based data mining to extract meaningful information.
地址 Mumbai IN