主权项 |
1. A method for merging taxonomies comprising:
initializing a merged taxonomy, by a computer system, by merging at least one node of a plurality of nodes of a second taxonomy to at least one node of a plurality of nodes of a first taxonomy, the merged taxonomy comprising the first taxonomy and the second taxonomy; merging, by the computer system, the second taxonomy into the merged taxonomy by, traversing the second taxonomy from below at least one top level of the second taxonomy toward a bottom of the second taxonomy and performing:
comparing, by the computer system, one or more identifiers or titles of the at least one node of the plurality of nodes of the second taxonomy to one or more identifiers or titles of the at least one node of the plurality of nodes of the first taxonomy;comparing, by the computer system, one or more lineages of one or more unmerged nodes of the plurality of nodes of the second taxonomy in the merged taxonomy to one or more lineages of one or more nodes of the plurality of nodes of the first taxonomy; andmerging, by the computer system, the at least one node of the plurality of nodes of the second taxonomy and the at least one node of the plurality of nodes of the first taxonomy in the merged taxonomy if the comparison of the one or more identifiers or titles of the at least one node of the plurality of nodes of the second taxonomy to the one or more identifiers or titles of the at least one node of the plurality of nodes of the first taxonomy and the comparison of the one or more lineages of the one or more unmerged nodes of the second taxonomy in the merged taxonomy to the one or more lineages of the one or more nodes of the plurality of nodes of the first taxonomy satisfy a threshold condition; and for each unmerged node of the plurality of nodes of the second taxonomy, hereinafter a current node of the plurality of nodes of the second taxonomy:
filtering the plurality of nodes of the first taxonomy according to a Jaccard distance of titles thereof with respect to a title of the current node of the plurality of nodes of the second taxonomy to define a filtered set of nodes;calculating an edit distance between the current node of the plurality of nodes of the second taxonomy and each node of the filtered set of nodes;calculating a lineage score for each node of the filtered set of nodes according to a comparison of a lineage of the current node of the plurality of nodes of the second taxonomy in the merged taxonomy to lineages of nodes of the filtered set of nodes; andmerging the current node of the plurality of nodes of the second taxonomy with a selected node of the filtered set of nodes if a combined score of the edit distance and the lineage score for the selected node is both:
a greatest combined score for the nodes of the filtered set of nodes; andthe combined score for the selected node of the filtered set of nodes satisfies a combined score threshold condition. |