发明名称 |
SELF-LEARNING BASED CRAWLING AND RULE-BASED DATA MINING FOR AUTOMATIC INFORMATION EXTRACTION |
摘要 |
Methods and Systems for automatic information extraction by performing self-learning crawling and rule-based data mining is provided. The method determines existence of crawl policy within input information and performs at least one of front-end crawling, assisted crawling and recursive crawling. Downloaded data set is pre-processed to remove noisy data and subjected to classification rules and decision tree based data mining to extract meaningful information. Performing crawling techniques leads to smaller relevant datasets pertaining to a specific domain from multi-dimensional datasets available in online and offline sources. |
申请公布号 |
US2016371603(A1) |
申请公布日期 |
2016.12.22 |
申请号 |
US201615077563 |
申请日期 |
2016.03.22 |
申请人 |
Tata Consultancy Services Limited |
发明人 |
A V Arun Kumar;RATH Hemant Kumar;NADAF Shameemraj M.;SIMHA Anantha |
分类号 |
G06N99/00;G06N5/04;G06F17/30 |
主分类号 |
G06N99/00 |
代理机构 |
|
代理人 |
|
主权项 |
1. A computer implemented method for automatic information extraction comprising:
receiving a request for information extraction and retrieving input information from the request; determining existence of a crawl policy wherein such determination is performed on the input information retrieved from the request ; performing assisted crawling, in case the input information contains the crawl policy; performing recursive crawling in case the input information does not contain the crawl policy and computing valid paths and links for building a new crawl policy, wherein recursive crawling is performed until destination files or web-page is reached; pre-processing dataset obtained after one of the assisted crawling and recursive crawling to remove noisy data to obtain pre-processed relevant dataset; and subjecting the pre-processed relevant dataset to classification rules and decision tree based data mining to extract meaningful information. |
地址 |
Mumbai IN |