Various fuzzy string match algorithms & Excellent video review

Fuzzy String Match

I’ve been studying up on fuzzy string match after controlling for misspellings, typos, dyslexia etc.  and I found a few articles discussing various approaches like:

Levenstein distance
Damerau–Levenshtein distance
n-gram
Soundex
Jaro-Winkler distance
Jaccard index

I found this video from two guys which took a process of checking to see if a name was on a terrorist watch lists which originally took 14 days to compute down to 5 minutes
What’s in a Name? Fast Fuzzy String Matching – Seth Verrinder & Kyle Putnam – Midwest.io 2015

What's in a Name? Fast Fuzzy String Matching – Seth Verrinder & Kyle Putnam – Midwest.io 2015

Below are my notes from watching the fuzzy string match video (it is ~40 minutes long but very interesting)

1) throw more hardware
2) use another variable/field (zip code / country etc.)
3) n-grams
4) metric trees (example: Lowenstein distance)
5) Brute force (Jaro Winkler is pretty fast already) (5X down to 70hrs )
6) Filtering- estimate similarity first then filter (7x down to 50 hrs 18 minutes in video)
· Length of strings (name length often is not normally distributed so doesn’t rule out too much) Probably still look at 70%
· 26 Character filter- search for character that isn’t shared- This dropped out quite a bit but was slow (300x down to 65 minutes)
o Bitmap filter- use bitwise operations to get unmatched count- very fast! (340X down to 60 minutes 20 minutes in video)
o 64 character filter (used all bits)- checked for multiple occurrences of a given letter

7) Minimize recalculation (4,000x down to 5 minutes – 28 minutes in video)
· sort names and groups into segments
· common length and first character
· used WolframAlpha to help show formula

Learnings from Fuzzy String Match process

· Measure performance and focus on bottleneck
· Order of magnitude doesn’t always tell you about actual performance
· Favor simplicity

Approximate text matching – Wikipedia, the free encyclopedia

In computer science, approximate string matching (often colloquially referred to as fuzzy string searching) is the technique of finding strings that match a pattern …

kiyoka/fuzzy-text-match · GitHub

fuzzy string match library for ruby. Contribute to text-match development by creating an account on GitHub.

text-match | RubyGems.org | your community gem host

text-match 0.9.7. calculate Jaro Winkler distance. Versions: 0.9.7 – December 21, 2013 (13.5 KB); 0.9.6 – December 21, 2013 (13.5 KB); 0.9.5 – March 26, …

Leave a Reply