发明名称 |
AUTOMATIC EXTRACTION USING MACHINE LEARNING BASED ROBUST STRUCTURAL EXTRACTORS |
摘要 |
A method and apparatus for automatically extracting information from a large number of documents through applying machine learning techniques and exploiting structural similarities among documents. A machine learning model is trained to have at least 50% accuracy. The trained machine learning model is used to identify information attributes in a sample of pages from a cluster of structurally similar documents. A structure-specific model of the cluster is created by compiling a list of top-K locations for each attribute identified by the trained machine learning model in the sample. These top-K lists are used to extract information from the pages of the cluster from which the sample of pages was taken.
|
申请公布号 |
US2010223214(A1) |
申请公布日期 |
2010.09.02 |
申请号 |
US20090395586 |
申请日期 |
2009.02.27 |
申请人 |
KIRPAL ALOK S;SATPAL SANDEEPKUMAR BHURAMAL;KSHIRSAGAR MEGHANA;SENGAMEDU SRINIVASAN H |
发明人 |
KIRPAL ALOK S.;SATPAL SANDEEPKUMAR BHURAMAL;KSHIRSAGAR MEGHANA;SENGAMEDU SRINIVASAN H. |
分类号 |
G06F15/18 |
主分类号 |
G06F15/18 |
代理机构 |
|
代理人 |
|
主权项 |
|
地址 |
|