Joining longitudinal census data with spatial data

I am working with a dataset containing locations at different spatial levels (county, tract, block group). Each location is associated with a FIPS code and a specific year (between 1910 and 2010).

For each location, I would like to join several longitudinal (1910 - present) variables (e.g. population) from NHGIS. So if a record is associated with the year 1950, but data for that variable is available for the entire period (1910 - present), I would like to join data from all available years (e.g. population for that location in 1910, 1920, 1930, etc.)

I am uncertain how best to do this given the inconsistency of boundaries over time.

I understand that I can join NHGIS census data with NHGIS spatial boundary files that have been standardized to specific years (2000 or 2008) using the GIS_JOIN code. What I am unclear of is how the GIS_JOIN codes function over time. For instance, does a GIS_JOIN code that is associated with data from 1950 refer to the same area that the same GIS_JOIN in a 1990 dataset refers to or are the GIS_JOIN codes not consistent over time. It appears that the latter may be the case, but I am not certain.

Additionally, can you explain the difference between the “NHGISCODE” and “GISJOIN” or “GJOIN1XXX”?

Thank you!

First, to clarify… you say, “I understand that I can join NHGIS census data with NHGIS spatial boundary files that have been standardized to specific years (2000 or 2008) using the GIS_JOIN code.”

NHGIS boundary files are not “standardized to specific years.” Rather, they correspond to different versions of the Census Bureau’s TIGER/Line files. TIGER/Line files include representations of many types of features, including census reporting areas like tracts and blocks, but also roads, railroads, water features, and administrative boundaries. The Bureau has continually improved the TIGER/Line files, so different “vintages” of the TIGER/Line files will represent the same feature differently. In particular, there was a major accuracy improvement program between 2000 and 2008, so the 2008 TIGER/Line representations are much more accurate. E.g., a river that appears in both the 2000 and 2008 TIGER/Line files is likely to have a much more accurate representation in the 2008 version.

Accordingly, our boundary files based on 2008 TIGER/Line files generally have greater positional accuracy than those based on 2000 TIGER/Line files. For example, NHGIS has two versions of 1970 census tract boundaries, one based on features in the 2000 TIGER/Line file and another based on features in the 2008 TIGER/Line file. If part of a 1970 tract’s boundary follows a river, the 2000-based version will follow the 2000 TIGER/Line version of the river, and the 2008-based version would follow a more accurate 2008 TIGER/Line version of the river. And the 2008-based version will correspond better with later boundaries based on 2010 TIGER/Line files.

There’s more information on how we derive historical shapefiles from TIGER/Line files in our GIS Files documentation with separate sections for the files based on 2000 TIGER/Line and 2008 TIGER/Line.

Both NHGIS versions of the 1970 tract boundaries (2000- and 2008-based) represent 1970 census tract boundaries, corresponding to original 1970 tract summary data, and using 1970 tract identifiers. You can join 1970 tract data to the 1970 boundaries using the GISJOIN IDs, but these boundaries won’t correspond consistently with any other years’ tract data or GISJOINs.

We provide geographically standardized time series tables of 1990-2020 data for 2010 census areas, and we provide geographic crosswalks that enable users to standardize some other types of data in the 1990-2020 range. We have no resources specifically designed to facilitate standardization over a longer range. In my doctoral research, I developed and assessed methods to standardize 1950-2000 census tract data. (See Chapter 3 in my dissertation.) Jeffrey Lin has used simpler methods to produce a standardized dataset of 1880-2010 data here. Scott Markley & others have developed a dataset of housing and urbanization estimates for 1940-2019, described here.

To answer your last question, if I recall correctly, the NHGISCODE and GJOINXXXX fields appear only in nominally integrated time series tables for census tracts, places, and county subdivisions. The GJOIN fields identify the GISJOIN associated with each individual year’s areas (which corresponds to the GISJOIN in same year’s boundary files) and NHGISCODE is an “integrated” code, consistent for all entities we have linked together across time. For example, a place may have a different GISJOIN in 1980 and 1990, but if its name remained the same, we would link its 1980 and 1990 data together in a nominally integrated table with a single NHGISCODE identifying that place across time.

Thank you!