Feb. 10, 2015

Levenshtein distance between 10 million usernames and their passwords

Mark Burnett, a security researcher, recently released a collection of 10 million passwords along with their usernames. My question was, how different are 10 million usernames from their passwords?  Taking a tiny bit of time, I performed a simple analysis looking at the Levenshtein distance between them and composed the graph below.

What this means is, if people in this dataset used their username as a password (ex: user dino, password dino), but then changed it a little (password dino1), how many insertions, deletions or substitutions did these users have to make from the set?  See for yourself.

Distance of 0 means usernames and passwords are exactly identical (in the graph below, 213,133 passwords are same as their usernames).  Distance of 1 means one character was added, deleted or changed. And so on...

1 comment \ Leave a comment
Jan. 26, 2015

Interactive form builder with physics

Make sure to click the export button when done.

No comments \ Leave a comment
Oct. 07, 2014

Language spoken at home: LANP05, LANP12 differences

When using the Census Bureau's American Community Survey (ACS5) dataset, there are two fields on the `persons` set that stand out for me: LANP05 and LANP12.  They are "Language spoken at home," except they are split into two forms: "Language spoken at home for data years prior to 2012" and "Language spoken at home for data year 2012."

The data dictionary explains that the set contains two data vintages (-2012, 2012+), but not why.

 

What's the actual difference?

LANP05 LANP12 Spoken in / Note
966. American Indian 602. Krio Sierra Leone
675. Sindhi Sindh region, Pakistan
689. Uighur AKA Uyghur tili, Uyghurche: Turkic language spoken in a Western province of China
694. Mongolian
750. Micronesian "The twenty Micronesian languages form a family of Oceanic languages..."
761. Trukese Apparently very rare: AKA Chuukese, Austronesian language family spoken primarily on the islands of Chuuk in the Caroline Islands in Micronesia with some speakers on Pohnpei and Guam.
819. Ojibwa AKA Chippewa, spoken in Canada and USA.

So in 2012, the Census Bureau made changes to the language classification but never normalized the fields. Instead, they kept them separate, and reported -9 to fill the missing gaps. While that's annoying to someone who just wants these two fields to be one, it's important to draw attention to the changes and be aware of them.

 

Combining into one field

Supposing you imported the 15,318,124 rows of persons data into MySQL, here's how to combine the two fields into one:

mysql> alter table persons add LANPX int(11) after LANP05;

This should take about 5 minutes to complete on Amazon-RDS. When ready, set all LANP05 non -9's to LANPX:

mysql> update persons set LANPX=LANP05 where LANP05!=-9;
Query OK, 14774037 rows affected (4 min 55.83 sec)
Rows matched: 14774037  Changed: 14774037  Warnings: 0

And all non -9 LANP12's to LANPX:

mysql> update persons set LANPX=LANP12 where LANP12!=-9;
Query OK, 544087 rows affected (2 min 22.56 sec)
Rows matched: 13276440  Changed: 544087  Warnings: 0

A reality check is that 14774037 + 544087 = 15318124, which is the changed + changed = total number of persons rows.

No comments \ Leave a comment
Apr. 17, 2014

Resizing with maintained aspect ratio applet

Maybe I'm just way too old school, but when dealing with resizing anything, I like to jot down a simple fractional equation. For example, if there's a video that's 530 x 298 pixels and you need to resize that to 505 pixels wide, and maintain the aspect ratio:

First, setup the simple equation like so:

\frac {530}{298}=\frac {505}{y}

Second, cross-multiply the fraction:

530*y = 505 * 298

Then solve for y, divide both sides by 530:

y = \frac{505 * 298}{530}

And you may want to round the answer:

y = 283.9433962264...

y \approx 284

Original Resized (Rounded)
Width:
Height:

 

In general terms, if you have:

\frac {a}{b}=\frac {c}{d}

Formulas for solving c and d are:

c = \frac{(a* d)}{b}\ ; d = \frac{(b* c)}{a}

No comments \ Leave a comment
Nov. 22, 2013

North Korean Central News Agency (KCNA) server logs were protected by client-side JavaScript

Sat on this for a year, and figured it's time to publish. 

After a chance discovery of what appear to be KCNA's webserver logs in late 2012, I was able to perform an analysis and shed some light on the secretive North Korean news agency known to the world as www.kcna.kp.  The media often quotes the source, analysts study their content, but until now -- no one on the outside knew exactly how big their web audience was.

 

Traffic Overview, Apr 2011 - Dec 2012

For the first eleven months of 2011, there was only a bare trickle of web traffic registering on the North Korean Central News Agency's website. Only about 600 people would visit the website daily, and perhaps click on four or five stories apiece. In terms of modern news site traffic, that's akin to cosmic background radiation being picked up by a Geiger counter: insignificant.

Then, in December of 2011, Kim Jong Il died. In the two weeks following his death, everything changed.  The world turned its gaze on North Korea, and by relative standards, the KCNA website traffic skyrocketed. Kim Jong Il's dying alone gave the KCNA a traffic boost it desperately needed. Pageviews in that brief period were nearly half of all their year's traffic.

Language Breakdown:
kor 341,163
chn 68,185
jpn 20,439
eng 38,489
spa 2,896

Media events cause public attention to fade away quickly and this was no exception. However, a peculiar thing happened. After the event, KCNA's baseline traffic levels rose to about 3,000 unique visitors per day - a five-fold increase, and it's been slowly but steadily growing.

After the April 13 test missile launch (second large spike), world again started paying attention to North Korea: KCNA is definitely gaining some traction.

There are five language editions: Korean, Chinese, Japanese, English and Spanish. Of that, 72% of their visitors read in Korean, 14% in Chinese, and the remaining languages comprise the remainder.

Security Hole

Often enough, a content management system's backend has a commonly predicted login URL, such as /admin/. Most such backends are entirely firewalled, or heavily fortified.  That initially appeared to be the case with KCNA: pulling up that URL returned you to where you came from after displaying a browser alert roughly saying "not allow" (at least according to Google Translate.)

However, if you opened the URL up in a new tab, you'd see this overview on the right after dismissing the alert() message - presumably a visitor log, with IP addresses, timestamps and language editions. Clicking into any of the links would tripwire the same security apparatus, and you'd get another alert box.

Not allow!KCNA.KP /admin/

Since the data scrape, the hole seems to have been patched by an actual firewall. However, it's bizarre that for years KCNA's reader privacy was essentially protected by a mechanism that nagged you into not looking around with a history.back() statement.  This was the source view of the /admin/ pages:

kcna_security

Think this part has to be spelled out: KCNA readers' IP addresses were exposed to the public.

 

Success Rate

Goals of any news agency are remarkably simple -- to report the news -- and to reach and grow audience.  How that's actually done is a constant struggle and constant adaptation, but the mark of success is also remarkably simple: growing audience into a loyal following, and with that, increasing outreach.

Despite having no external obstacles to their mission (such as active internet censorship by the west) the North Korean Central News Agency has largely failed to reach a significant web audience. Even with the two event jolts, KCNA web traffic is tiny: a bare trickle over the past years. However much money they're expending in the effort can't be justifiably worth it.

That's just my editorializing, but, at least it's based on on actual numbers. See for yourself.

No comments \ Leave a comment
Nov. 17, 2013

Simple python list picker using curses

Are you writing a workhorse shell script in python?  Why not curses?

This class, using curses, lets a user select from a simple list of options. The return is a simple list of selected options, or False (for cancel). And, for the OCD in all of us, it preserves terminal window contents after exiting.

picker3picker4

First, get picker.py from github: https://github.com/pp19dd/picker

Usage example:

from picker import *
    
opts = Picker(
    title = 'Select files to delete',
    options = [
        ".autofsck", ".autorelabel", "bin/", "boot/", 
        "cgroup/", "dev/", "etc/", "home/", "installimage.conf",
        "installimage.debug", "lib/", "lib64/", "lost+found/",
        "media/", "mnt/", "opt/", "proc/", "root/",
        "sbin/", "selinux/", "srv/", "sys/",
        "tmp/", "usr/", "var/"
    ]
).getSelected()

if opts == False:
    print "Aborted!"
else:
    print opts

If the user hits cancel, the routine returns a False - otherwise, you get a simple list:

['.autofsck', '.autorelabel', 'bin/', 'boot/', 'cgroup/']

The return, being a simple list, allows for a fairly readable logic:

if "cgroup/" in opts:
    print "cgroup is done for!"

Now, mind the possible False return, and put it to good use!

No comments \ Leave a comment
Posts on this blog solely represent my personal opinions and technical experience.

© 2009-2015 Edin (Dino) Beslagic