发明名称 Computer method and apparatus for extracting data from web pages
摘要 Computer method and apparatus for extracting information from a Web page is disclosed. The invention apparatus is formed of an extractor coupled to receive Web pages from a source. The extractor uses natural language processing to extract desired information from the Web page. A storage subsystem receives from the extractor the extracted desired information and stores the extracted desired information in a database. The invention method for extracting data from a Web page includes the computer implemented steps of (i) using natural language processing, finding possible formal names on a given Web page, (ii) using pattern matching, searching the given Web page for formal names not found by the natural language processing, and (iii) refining a combined set of the found formal names to produce a working set of people and organization names extracted from the given Web page. The refining includes determining aliases of respective people and organization names, so as to effectively reduce duplicate names.
申请公布号 US2007027672(A1) 申请公布日期 2007.02.01
申请号 US20060436370 申请日期 2006.05.18
申请人 发明人 DECARY MICHEL;STERN JONATHAN;KARADIMITRIOU KOSMAS;ROTHMAN-SHORE JEREMY
分类号 G06F17/28 主分类号 G06F17/28
代理机构 代理人
主权项
地址