Repeated Census Tracts *Within* County

Hi all,

I’m constructing a gentrification index which requires comparing variable levels between census places and census tracts. My understanding is that census tracts are meant to be unique within counties (i.e., there shouldn’t be repeats).

However, in my dataset, which I downloaded from NHGIS and includes the entirety of the US has several hundred observations with duplicate county fips codes and tracts with different values for economic and demographic variables I’m interested in.

Has anyone else encountered this issue?

I’m, of course, happy to provide more specifics if that would be helpful.


Hi Kelsey,

Census tracts definitely should be unique within counties! Can you provide more details about the data you downloaded (e.g., year and/or dataset) so that we can help answer your question?

Dave Van Riper


Here’s the “Data Summary” info as listed in the codebook:

Time series layout: Time varies by column
Geographic level: Census Tract (by State–County)
Geographic integration: Nominal
Measurement times: 2000, 2006-2010, 2007-2011, 2008-2012, 2009-2013, 2010-2014, 2011-2015, 2012-2016, 2013-2017, 2014-2018, 2015-2019, 2016-2020, 2017-2021, 2018-2022

!WARNING! In a “Time varies by column” layout, each row provides statistics
** from multiple censuses for areas that had a matching code across**
** time. For the Census Tract geographic level, matching codes may**
** refer to distinctly different areas in different censuses. We**
** strongly recommend checking GIS files to determine the geographic**
** consistency of your areas of interest for your period of interest.**

Aside for 2000, each is the five-year ACS estimate. I also looked how the repeats were distributed. Out of the total 420 duplicates, 234 are in Connecticut, 76 are in New Mexico, and 110 are in Puerto Rico. Perhaps there’s some clue in that?

Dear Kelsey,

I just downloaded a nominally integrated census tract dataset (total population table) for all the same years you did. I grouped the dataset by NHGISCODE and generated a count (to see what duplicates I could find). I didn’t actually find any duplicates when I did that analysis.

Can you provide a bit more detail about how you identified the duplicates (e.g,. what fields in your extract you used to identify the duplicates)?


Hi Dave,

Sure thing:

I’m using the following variables in my analysis:

  • AV0AA - Total Population
  • CV4AA - Persons by Hispanic or Latino Origin [2] by Race (White non-Hispanic)
  • B69AC - Persons 25 Years and Over by Educational Attainment w/ BA or higher
  • B79AA - Median Household Income in Previous Year
  • CL66A - Person’s below Poverty Level in Previous Year

Below I’m also including an example of how the data look in Stata for . While not shown in this picture, not all observations have missing values for years between 2000 and the five-year 2022 estimates.

Dear Kelsey,

We just tracked the problem down. Our fixed width files have incorrect county FIPS codes in CT, and character encoding in New Mexico and Puerto Rico (county names with accents or tildes) seem to be causing issues when loading fixed width files into Stata. A side effect of this is to put a bad value in CV4AA2000, which then impacts the rest of the data for those records. It looks as if it’s an issue when Stata is importing the .dat file using the widths in the .do file.

We’re going to look into this further, but if you re-submit your extract as a CSV file, you’ll get a dataset that can be read in using import delimited in Stata. It will read the variables in correctly and eliminate the duplicates that are introduced when reading in a fixed width file using the .do file.


Quick work!

thanks, Dave. I’ll redownload as csv. That should be sufficient for this project; however, would you mind letting me know when the issue with the fixed width do file program is fixed? just for future reference.


Yes, I will update this thread when we have the fixed width do file corrected.