摘要 |
<p>A method and system for cleaning an electronic document are provided. The method comprises: identifying at least one sentence in the electronic document; numerically representing features of the sentence to obtain a numeric feature representation associated with the sentence; inputting the numeric feature representation into a machine learning classifier, the machine learning classifier being configured to determine, based on each numeric feature representation, whether the sentence associated with that numeric feature representation is a bad sentence; and removing sentences determined to be bad sentences from the electronic document to create a cleaned document.</p> |