When using the Census Bureau's American Community Survey (ACS5) dataset, there are two fields on the `persons` set that stand out for me: LANP05 and LANP12. They are "Language spoken at home," except they are split into two forms: "Language spoken at home for data years prior to 2012" and "Language spoken at home for data year 2012."
The data dictionary explains that the set contains two data vintages (-2012, 2012+), but not why.
What's the actual difference?
|LANP05||LANP12||Spoken in / Note|
|966. American Indian||602. Krio||Sierra Leone|
|675. Sindhi||Sindh region, Pakistan|
|689. Uighur||AKA Uyghur tili, Uyghurche: Turkic language spoken in a Western province of China|
|750. Micronesian||"The twenty Micronesian languages form a family of Oceanic languages..."|
|761. Trukese||Apparently very rare: AKA Chuukese, Austronesian language family spoken primarily on the islands of Chuuk in the Caroline Islands in Micronesia with some speakers on Pohnpei and Guam.|
|819. Ojibwa||AKA Chippewa, spoken in Canada and USA.|
So in 2012, the Census Bureau made changes to the language classification but never normalized the fields. Instead, they kept them separate, and reported -9 to fill the missing gaps. While that's annoying to someone who just wants these two fields to be one, it's important to draw attention to the changes and be aware of them.
Combining into one field
Supposing you imported the 15,318,124 rows of persons data into MySQL, here's how to combine the two fields into one:
mysql> alter table persons add LANPX int(11) after LANP05;
This should take about 5 minutes to complete on Amazon-RDS. When ready, set all LANP05 non -9's to LANPX:
mysql> update persons set LANPX=LANP05 where LANP05!=-9; Query OK, 14774037 rows affected (4 min 55.83 sec) Rows matched: 14774037 Changed: 14774037 Warnings: 0
And all non -9 LANP12's to LANPX:
mysql> update persons set LANPX=LANP12 where LANP12!=-9; Query OK, 544087 rows affected (2 min 22.56 sec) Rows matched: 13276440 Changed: 544087 Warnings: 0
A reality check is that 14774037 + 544087 = 15318124, which is the changed + changed = total number of persons rows.