摘要 |
A document is received that has a plurality of lines with text. This document includes text associated with at least one topic of interest and text not associated with the at least one topic of interest. Thereafter, it is determined, for each line in the document, a length of the line and a number of off-topic indicators with the off-topic indicators characterizing portions of the document as likely being not being associated with the at least one topic of interest. Thereafter, a density for each line can be determined based on the determined line length and the determined number of off-topic indicators. The determined densities for each line are used to identify portions of the documents likely associated with the at least one topic of interest so that data characterizing the identified portions of the document can be provided. Related apparatus, systems, techniques and articles are also described. |