发明名称 METHOD FOR COMPARING TEXT FILES WITH DIFFERENTLY ARRANGED TEXT SECTIONS IN DOCUMENTS
摘要 A method for comparing and analysing digital documents includes searching for unambiguous roots in both documents. These roots are unique units that occur in both documents. The roots can be individual words, word groups or other unambiguous textual formatting functions. There is then a search for identical roots in the other document (Root1 from Content1, and Root2 from Content2, with Root1=Root2). If a pair is found, the area around these roots is compared until there is no longer any agreement. During the area search, both preceding words and subsequent words are analysed. The areas that are found in this way, Area1 around Root1 and Area2 around Root2, are stored in lists, List1 and List2, allocated to Doc1 and Doc2. This procedure is repeated until no roots can be found any longer. The result is either a remaining area that has no overlaps, or complete identity of the documents.
申请公布号 US2017060939(A1) 申请公布日期 2017.03.02
申请号 US201514835025 申请日期 2015.08.25
申请人 Schlafender Hase GmbH Software & Communications 发明人 Braun Elmar
分类号 G06F17/30;G06F17/27 主分类号 G06F17/30
代理机构 代理人
主权项 1. A computer implemented method for systematically comparing the contents of at least two digitally stored documents (Doc1, Doc2), which are stored on digital medium and which are loaded by a computer to be compared by a computer, wherein the stored documents (Doc1, Doc2) have marked and unmarked areas, wherein at the beginning all the areas are unmarked, wherein the documents have repetitions comprising the following steps: a) Computing a histogram of each document, and comparing the histogram with a reference histogram; searching for an n, wherein n is a natural number, which modifies the frequencies of words in one of the histograms in a way that the comparison of the histograms matches within in a predefined range; b) searching for identical and roots (Root1, Root2) in the unmarked areas of the documents with n occurrences, of which there are at least two, wherein the roots comprise a string of text symbols, being in particular words, word groups or other unambiguous textual formatting functions, and must only occur exactly n times in each of the documents, and wherein if a root is not unambiguous it is discarded, and wherein a search for the root is carried out in the first document in order to determine unambiguity, and then a search for the root is carried out in the second document in order to determine its unambiguity; c) if roots have been found, comparison of the documents, starting with the roots (Root1, Root2), until there is no longer any agreement, wherein the areas (Area1, Area2) found in this way are being marked; d) repeating the above steps, starting with b) in a recursion until there are no longer any unique and identical roots or until no longer any found areas can be marked, wherein the marked areas are at first not taken into account in the search for roots and areas;
地址 Frankfurt am Main DE