Discrepancy between manually crosswalked and time series data

I am working on crosswalking several longform Census datasets between 1990, 2000, and 2010. To ensure that I am doing this properly, I thought I would test out my approach by applying the exact same methodology to a shortform dataset for which there is already a time series available. Please check my logic here, but if I am doing the crosswalk correctly, then my values for 1990 and 2000 should match those on the time series, right? While the 1990 data are matching up perfectly, the 2000 data are not (see below). I’m wondering if anyone might have some insight as to why this discrepancy may be emerging.

Here’s my general approach:

  1. Download time series data for Total Population, Standardized to 2010 by Block Group (“nhgisXXXX_ts_geog2010_blck_grp.csv”).

  2. Download 1990 population data for block group parts (“nhgisXXXX_ds120_1990_blck_grp_598.csv”) and 2000 population data for BGPs (“nhgisXXXX_ds152_2000_blck_grp_090.csv”).

  3. Perform crosswalk analysis using the appropriate crosswalk lookup tables (“nhgis_bgp1990_bg2010_49.csv” and “nhgis_bgp2000_bg2010_49.csv”). You’ll notice from the “49” that I’m just working on Utah data. This is done in R (happy to share code if it will help…). Basically, I just (1) perform a join between lookup tables and block group part data, (2) multiply population totals by population weights, and (3) sum up the resulting weighted population values per block group.

  4. Compare crosswalked values to time series values.

Any help or insight would be great! Thanks in advance.

Hi Michael,

I’d guess that the root cause of the discrepancies you’ve discovered is that published census counts differ somewhat between short- and long-form sources. NHGIS standardized time series are based entirely on short-form block data. Your 1990 estimates agree with NHGIS estimates because you obtained your source 1990 BGP data from 1990 STF1, which is based on the 1990 short form. But your source 2000 BGP data come from 2000 SF3, which is based on the 2000 long form. The long form was sent to only a sample of households (like the American Community Survey is now), so summary data based on the long form are estimates… even for Total Population.

It’s good you were able to complete your check with 1990 short-form data. That indicates to me that your process is sound–as are our crosswalks! Unfortunately, there’s not a straightforward way to complete the same check for 2000 data because there are two 2000 BGP levels… one used in 2000 SF1 with short-form data (blck_grp_091, which doesn’t include urban area boundaries) and another used in 2000 SF3 with long-form data (blck_grp_090, which does include urban area boundaries), and our 2000 BGP crosswalks are designed for the latter. I.e., there’s not an easy way to use our crosswalks with 2000 short-form data for BGPs.

I’ll try to add some text to the crosswalks page at some point to explain this issue. It had occurred to me that the discrepancies could cause some trouble for us (and/or our users) when we begin creating time series for long-form data, and I haven’t yet decided what we’ll do about that!

Thanks, Jonathan, for the quick and very helpful response!