Feb. 10, 2015

Levenshtein distance between 10 million usernames and their passwords

Mark Burnett, a security researcher, recently released a collection of 10 million passwords along with their usernames. My question was, how different are 10 million usernames from their passwords?  Taking a tiny bit of time, I performed a simple analysis looking at the Levenshtein distance between them and composed the graph below.

What this means is, if people in this dataset used their username as a password (ex: user dino, password dino), but then changed it a little (password dino1), how many insertions, deletions or substitutions did these users have to make from the set?  See for yourself.

Distance of 0 means usernames and passwords are exactly identical (in the graph below, 213,133 passwords are same as their usernames).  Distance of 1 means one character was added, deleted or changed. And so on...

Find this interesting, or useful? Consider sharing the post.

One response to “Levenshtein distance between 10 million usernames and their passwords”

  1. ehsanul says:

    You probably want to normalize by password or username length.

Leave a Reply

Your email address will not be published. Required fields are marked *

Posts on this blog solely represent my personal opinions and technical experience.

© 2009-2017 Edin (Dino) Beslagic