摘要 |
A system and method for matching data using probabilistic modeling techniques is provided. The system includes a computer system and a data matching model/engine. The present invention precisely and automatically matches and identifies entities from approximately matching short string text (e.g., company names, product names, addresses, etc.) by pre-processing datasets using a near-exact matching model and a fingerprint matching model, and then applying a fuzzy text matching model. More specifically, the fuzzy text matching model applies an Inverse Document Frequency function to a simple data entry model and combines this with one or more unintentional error metrics/measures and/or intentional spelling variation metrics/measures through a probabilistic model. The system can be autonomous and robust, and allow for variations and errors in text, while appropriately penalizing the similarity score, thus allowing dataset linking through text columns. |