Title: Discrepancy in County Codes and Duplicates between API and Web Downloads of 2008-2012 ACS Tract-Level Data

Hello,

I’m encountering a puzzling discrepancy in the 2008-2012 ACS 5-Year data at the tract level, and I’m hoping someone here can help clarify what’s going on.

I have two versions of the same dataset:

  1. API Download from IPUMS using the IPUMS API
  2. Website Download directly from IPUMS

Both downloads include the same data tables at the tract level. However, I’ve noticed some unexpected differences between the two files:

  1. Duplicates in API File: The API version of the ACS 2012 dataset has duplicate records for FIPS codes 35 (New Mexico) and 72 (Puerto Rico), whereas the website download has no duplicates.
  2. County Code Discrepancy for Doña Ana County, NM: In the API file, Doña Ana County is identified with the county code 010, but in the website download, it’s listed as 013, which I believe is the correct code.

I’m confused about why the API dataset has these discrepancies:

  • Could the issue be related to the API call, or is there an underlying difference in how data is processed between the API and the direct website download?
  • Is there a known issue with county codes or duplications specific to FIPS 35 and 72 in API downloads of ACS data?

Any insights or suggestions would be greatly appreciated! Thank you in advance for your help.

Could you provide details about the NHGIS dataset & tables you requested and/or the code you used to make the API request and inspect the data?

I just used the API to request table B01003 (Total Population) from NHGIS dataset 2008_2012_ACS5a at the tract level, and I didn’t find either of the issues you describe. My data file has no duplicate records, and the county code for Doña Ana County is 013.

It occurred to me that a likely cause of the discrepancies you’re seeing may be some variations in how the software you’re using have handled text encoding for characters like the ñ in “Doña Ana County”. That’s because, to my knowledge, only New Mexico and Puerto Rico have county names that contain special characters, and those are also the only areas where you found issues.

If you use the fixed width files from NHGIS, and the software you’re using doesn’t use the correct encoding, then it may replace the special characters with two characters, which then shifts all remaining text in the row to the right. This could cause errors in codes in other columns.

E.g., when I open the NHGIS CSV file in Excel, it displays “Doña Ana County” as “Doña Ana County”. This isn’t a major problem for the CSV file because columns are delimited by commas, but a fixed width file relies on consistent text string lengths to keep all columns properly aligned.

If this is the cause of the problem, then to avoid it, you could revise your requests to use a CSV format instead of fixed width, or you could investigate whether you could read your original files into your software with a different encoding (e.g., Unicode instead of Latin-1) and see if that fixes the issues.

Hello,

Thank you for the follow-up and for checking into the issue.

I’ve figured out the source of the discrepancy with Doña Ana County. Initially, I created tractid using:

egen tractid = concat(statea countya tracta)

However, for Doña Ana County, countya is coded as 01 (two digits), and cousuba holds the county subdivision code 3. I initially assumed countya alone should be 010, but when I use:

egen tractid = concat(statea countya cousuba tracta)

I get the correct tractid with a county code of 013, combining countya and cousuba.

As for the duplicates, it turned out to be a coding issue with the dictionary when downloading in fixed-width format. The field tracta was originally defined as:

str tracta 110-115

However, for Doña Ana County, tracta only had 5 digits, missing the final digit from the GEOID. This resulted in three observations where tracta was incorrectly listed as 00010 instead of the correct 000102, 000103, and 000104. Changing the dictionary to:

str tracta 110-116

resolved the issue by including the last digit and properly distinguishing these entries.

Thank you again for your assistance and insights!

I’m glad you identified a workaround. This still sounds to me as though there’s an encoding issue, as I suggested in my last post. County subdivision codes should not be needed to form full county or tract identifiers. The most likely explanation for why part of the county code appears in the county subdivision field is that the ñ character in the county name is being misinterpreted as two characters, which, in a fixed width data file, shifts all other characters one position to the right, including the last digit of the county code (“3”), moving it into the county subdivision column. This would also move part of the tract code into the wrong column.

The solutions you identified should take care of this problem, but you’d need to be careful to apply them only to rows that contain a misinterpreted special character. I think you’d also need to apply similar corrections to all data columns to get accurate data (not just the geographic codes). I suggest you also consider the other options I suggested in my previous post.