1980 Block Group Tabs != Blocks/Tracts?


I’m wonder why 1980 Block Group tabulations (aggregating the rural and urban records) don’t always equal the same counts when either calculated from the block level up to block group or by aggregating block groups to tracts. I would expect to see this for suppressed fields but this is happening for fields that are never suppressed, such as total population and occupied units.

I am only looking at tabulations from SFT1. Are the block group numbers provisional or an earlier release? It seems pretty consistent that the block group counts are lower than those derived from blocks or tracts.

Here is an example for Philadelphia Tract 0326: (value_b is from summing block group up to tract, value_t is the tract value, here i only show columns that are never suppressed. Block and tract counts agree).



You’ve encountered an unfortunate consequence of a known bug that we have not yet corrected.

The original problem is that the 1980 STF1 Technical Documentation misidentifies which levels are components of the compound block group level in that file (blck_grp_01598 in NHGIS, or summary level 015 in the original Summary File). It indicates that the level represents the intersecting parts of places, county subdivisions (i.e., “MCDs” and “CCDs”), and block groups, as indicated in the scan below. Notice that the congressional district level isn’t included in the illustrated hierarchy, but in fact, the summary file does split block group records by congressional district.

The “GISJOIN” identifiers that NHGIS uses to uniquely identify data records are based on our metadata about which component levels are included in the level hierarchy. Because our level definition in this case omits congressional districts, the GISJOIN identifiers also don’t include a code for congressional districts. This means that for block groups that are split by congressional district boundaries, there are duplicate GISJOIN identifiers in NHGIS’s 1980 STF1 data. (Including a congressional district code would make the GISJOINs unique.)

When you request to have data for both the urban and rural breakdowns combined in one file, the NHGIS system uses the GISJOIN identifiers to join the data together. The join process unfortunately discards duplicate GISJOINs. In your specific example, the discarded records account for the difference between your block group totals and your tract totals.

The good news: The missing records will not be dropped if you choose to get the urban and rural breakdowns in separate files, which you can do by selecting this option on the Review and Submit page before submitting your data request:

Sorry for this inconvenience. Thanks for bringing this issue to our attention!


Fantastic! I’m very glad there is a work around.

Thanks, Dan

@JonathanSchroeder , it appears that it is possibly split on more things than congressional district? In fact, I don’t seem to be able to make a primary key out of any combination of geographic codes, although I can get very close.

If I combine gisjoin||cda||aianhha||indsubr||edinda||urb_areaa I’m still left with a small number of duplicates. For example, check out G1201010020999903034. Every single geographic id is equal, and yet there are two records, with different values for the non-geo fields.




Thanks, @Daniel_Moulton. Good catch!

We hadn’t ever inspected these cases thoroughly to determine whether congressional districts were the only “missing level” in the hierarchy. You’re right that other levels must be included, too.

Looking at the example case you found, I also wasn’t able to find any other geographic codes in the NHGIS extract file that would distinguish the duplicate IDs. However, NHGIS extracts don’t include all of the original geographic codes, and our internal source files do contain all of the original codes. To avoid making extract files too noisy, we show only the fields that we supposed would be useful to users, but that’s not an exact science, and we’ve occasionally hidden some relevant fields. In this case, I found that one of the hidden fields, which identifies wards, is needed to distinguish these duplicate cases.

We’ll plan to add the ward field to NHGIS extracts for this dataset at some point in the future. Until then, here’s a listing of the cases that have a duplicate GISJOIN and distinct ward codes, along with total populations:

│       GISJOIN        │ WARD       │ POPULATION    │
│ G1201010020999903034 │            │ 4905          │
│ G1201010020999903034 │ 02         │ 94            │
│ G1201010020999903082 │            │ 15            │
│ G1201010020999903082 │ 02         │ 9             │
│ G1201010020999903148 │            │ 100           │
│ G1201010020999903148 │ 02         │ 0             │
│ G2601390100294002291 │ 01         │ 1045          │
│ G2601390100294002291 │ 02         │ 1282          │
│ G2601390100294002291 │ 03         │ 1460          │
│ G2601390100294002291 │ 04         │ 977           │

Great, thanks Jonathan