Moving data from one column to many; Parsing a variable using an SPSS macro

one column to many

Older online vendor tools and databases would frequently put multi-select questions into one column having a pipe,tab,semicolon or comma delimiter (what was real fun is when they would use a comma for a delimiter in a CSV file).

This can be very problematic in nearly any tool. In this video I demonstrate how easy it can be to move data from one column to many with an SPSS macro.

Move one column to many

Parsing a variable in SPSS using a macro

Here is the SPSS macro demonstrated in the video:

 

SPSS Macros for replacing missing data

SPSS macro

SPSS Macros for replacing missing data

A lot of online vendor tools provide a data value on a multi-select question however they leave the value missing if the respondent didn’t select it. Most of the analysis I want to do needs to have a zero (or some other value) present in order to calculate the statistics correctly.

I wrote a few SPSS Macros that make it very easy to replace the missing values on your data.  Here are the two macros

 

SPSS Macros- Replacing missing data

Windows autocorrect- Auto-replace text in any Windows program with a Hotstring from AutoHotKey

windows autocorrect Windows Autocorrect– Auto-replace an abbreviation with full text in any windows program

Years ago I realized I was frequently typing the same thing over and over. For a while I used autocorrect in Microsoft Word which was helpful but had limitations (one huge one is that it only worked in Word). AutoHotKey is a free scripting program that allows you to do an amazing amount of things; one of which is how to use Hotstrings to autocorrect text in any windows program via Hotstrings.

I have hundreds of Hotstrings which allow me to type a few letters and have them, automatically, replaced with the full word, sentence, paragraph, etc.   Below is a simple example where I want to type my main website domain.  Instead of typing the entiere thing out I just type tac. which is quickly replaced with  http://the-automator.com

The second example is one where I want to have line-breaks inserted (This is my template when starting a new AutoHotKey script.)  I simply type tst.  and the following “magically” appears in its place via windows autocorrect.

The below video outlines how I use hotstrings to reduce the amount of mundane work that I have to do.  It is amazing how much time you can save once you get used to using them!  Remember your Hotstrings (windows autocorrect) will work in any Windows program!  Did I mention it was free?  What more could you ask for?

Hotstrings (Auto-replace text in any program)

Windows autocorrect is just one of many amazing things you can do with AutoHotKey however hotstrings is a great, easy, way to start down the path of automation and efficiency!

Various fuzzy string match algorithms & Excellent video review

Fuzzy String Match

I’ve been studying up on fuzzy string match after controlling for misspellings, typos, dyslexia etc.  and I found a few articles discussing various approaches like:

Levenstein distance
Damerau–Levenshtein distance
n-gram
Soundex
Jaro-Winkler distance
Jaccard index

I found this video from two guys which took a process of checking to see if a name was on a terrorist watch lists which originally took 14 days to compute down to 5 minutes
What’s in a Name? Fast Fuzzy String Matching – Seth Verrinder & Kyle Putnam – Midwest.io 2015

What's in a Name? Fast Fuzzy String Matching – Seth Verrinder & Kyle Putnam – Midwest.io 2015

Below are my notes from watching the fuzzy string match video (it is ~40 minutes long but very interesting)

1) throw more hardware
2) use another variable/field (zip code / country etc.)
3) n-grams
4) metric trees (example: Lowenstein distance)
5) Brute force (Jaro Winkler is pretty fast already) (5X down to 70hrs )
6) Filtering- estimate similarity first then filter (7x down to 50 hrs 18 minutes in video)
· Length of strings (name length often is not normally distributed so doesn’t rule out too much) Probably still look at 70%
· 26 Character filter- search for character that isn’t shared- This dropped out quite a bit but was slow (300x down to 65 minutes)
o Bitmap filter- use bitwise operations to get unmatched count- very fast! (340X down to 60 minutes 20 minutes in video)
o 64 character filter (used all bits)- checked for multiple occurrences of a given letter

7) Minimize recalculation (4,000x down to 5 minutes – 28 minutes in video)
· sort names and groups into segments
· common length and first character
· used WolframAlpha to help show formula

Learnings from Fuzzy String Match process

· Measure performance and focus on bottleneck
· Order of magnitude doesn’t always tell you about actual performance
· Favor simplicity

Approximate text matching – Wikipedia, the free encyclopedia

In computer science, approximate string matching (often colloquially referred to as fuzzy string searching) is the technique of finding strings that match a pattern …

kiyoka/fuzzy-text-match · GitHub

fuzzy string match library for ruby. Contribute to text-match development by creating an account on GitHub.

text-match | RubyGems.org | your community gem host

text-match 0.9.7. calculate Jaro Winkler distance. Versions: 0.9.7 – December 21, 2013 (13.5 KB); 0.9.6 – December 21, 2013 (13.5 KB); 0.9.5 – March 26, …