Oct. 07, 2014

Language spoken at home: LANP05, LANP12 differences

When using the Census Bureau's American Community Survey (ACS5) dataset, there are two fields on the `persons` set that stand out for me: LANP05 and LANP12.  They are "Language spoken at home," except they are split into two forms: "Language spoken at home for data years prior to 2012" and "Language spoken at home for data year 2012."

The data dictionary explains that the set contains two data vintages (-2012, 2012+), but not why.


What's the actual difference?

LANP05 LANP12 Spoken in / Note
966. American Indian 602. Krio Sierra Leone
675. Sindhi Sindh region, Pakistan
689. Uighur AKA Uyghur tili, Uyghurche: Turkic language spoken in a Western province of China
694. Mongolian
750. Micronesian "The twenty Micronesian languages form a family of Oceanic languages..."
761. Trukese Apparently very rare: AKA Chuukese, Austronesian language family spoken primarily on the islands of Chuuk in the Caroline Islands in Micronesia with some speakers on Pohnpei and Guam.
819. Ojibwa AKA Chippewa, spoken in Canada and USA.

So in 2012, the Census Bureau made changes to the language classification but never normalized the fields. Instead, they kept them separate, and reported -9 to fill the missing gaps. While that's annoying to someone who just wants these two fields to be one, it's important to draw attention to the changes and be aware of them.


Combining into one field

Supposing you imported the 15,318,124 rows of persons data into MySQL, here's how to combine the two fields into one:

mysql> alter table persons add LANPX int(11) after LANP05;

This should take about 5 minutes to complete on Amazon-RDS. When ready, set all LANP05 non -9's to LANPX:

mysql> update persons set LANPX=LANP05 where LANP05!=-9;
Query OK, 14774037 rows affected (4 min 55.83 sec)
Rows matched: 14774037  Changed: 14774037  Warnings: 0

And all non -9 LANP12's to LANPX:

mysql> update persons set LANPX=LANP12 where LANP12!=-9;
Query OK, 544087 rows affected (2 min 22.56 sec)
Rows matched: 13276440  Changed: 544087  Warnings: 0

A reality check is that 14774037 + 544087 = 15318124, which is the changed + changed = total number of persons rows.

Find this interesting, or useful? Consider sharing the post.

Leave a Reply

Your email address will not be published. Required fields are marked *

Posts on this blog solely represent my personal opinions and technical experience.

© 2009-2019 Edin (Dino) Beslagic