摘要 |
A method is provided for scraping information from a web page or other page of electronic content. As opposed to existing methods in which an entire page's HTML (HyperText Markup Language) code or DOM (Document Object Model) tree is parsed and pattern-matched, in the provided method only specific regions of interest are examined closely. An image snapshot of the page is created and investigated using routines for identifying regions of interest (e.g., paragraphs of text, faces). Regions comprising text are then converted into text using OCR (Optical Character Recognition) technology or a similar tool, and the resulting text can then be scanned for symbols, words or phrases of interest.
|