摘要 |
<p>Automatic approaches to scraping salient content from sources of content are provided that allow the salient content to be provided to the user or subjected for further processing such as clustering or sentiment analysis. Embodiments of the invention provide for: automated scraper induction based on document and/or contextual semantic cues and document structure analysis; identifying salient text, removing boiler-plate text, off-topic content and other non-salient content; deriving reusable descriptive extraction patterns for subsequent documents; applying descriptive extraction patterns for extraction from subsequent documents form the same source; intelligent identification of extraction success confidence score, using historical success scores; and employing confidence scores to automatically trigger new extraction pattern identification if extracted confidence is below an acceptable confidence threshold.</p> |