Hello,
I am using the NHGIS to do an analysis of both housing units and population at the block level, including breaking out by race and ethnicity. I’m using the 2020_DHCa data for both, getting population by race and Hispanic or Latino Origin from table U7L and occupied housing units by tenure and race or ethnicity from tables U96 (Hispanic or Latino), U97 (non-Hispanic white), U98 (non-Hispanic Black), U99, and VAA-VAD.
When I merge these at the block level, I’m seeing 90,131 blocks out of 8,174,955 (so 1.1%) where the population is 0 but there is at least 1 occupied housing unit. If I break out by race, the discrepancies are much larger: I see 403,718 blocks with at least 1 Black householder but 0 Black population, out of 1,664,721 total blocks with at least 1 Black householder (so 25%).
I am assuming that this is a consequence of the differential privacy algorithm used in the 2020 Census. Is that likely to be correct? I was confused because I thought the synthetic data were supposed to be internally consistent (i.e. if a Black householder lives in a given block, a Black person should also live there).
Looking at the Census DHC documentation (https://assets.nhgis.org/original-data/modern-census/2020Census_DHC_TechDoc.pdf, page 4-2), it appears that the housing unit counts in each block are held invariant. Given that, is the housing unit data more likely to be accurate in cases where there are discrepancies? Or is it too hard to tell?
Thank you!
Your assessment of the source of the problem is correct: these are the types of discrepancies that I would expect given the 2020 Census’s Disclosure Avoidance System (DAS), based on differential privacy.
The system preserves only certain types of “internal consistency”: sums from geographic subtotals (e.g., block-level populations) should match with totals for encompassing areas (e.g., tract-level populations), and sums for population subgroups (e.g., population by race) should match with totals for encompassing population groups (e.g., total population), but there is no enforced consistency between household/housing-level counts and person-level counts. The Bureau was unable to develop an algorithm that would maintain this consistency while also maintaining differential privacy for the main DHC. The last release of 2020 Census Data, the “Supplemental Demographic and Housing Characteristics File”, includes more accurate data involving associations between households and the people within them, but it’s limited to nation- and state-level tables.
Unfortunately, I generally wouldn’t trust any 2020 block-level statistics for specific race groups. The final disclosure avoidance algorithm allocated extremely little of the “privacy-loss budget” (PLB) to block-level statistics, meaning that these statistics are subject to high levels of noise. I believe the Bureau generally recommends using block-level 2020 statistics only by aggregating them to larger areas. I think they may be useful in some other circumstances, but it’s difficult to say, outside of one specific setting: the total housing counts are invariant, as you noted, so block-level counts of total housing units are accurate. But the counts for any subgroup of housing units are noisy, and at the block level, these may be as unreliable as any other counts.
So to answer your last question: I think it’s “too hard to tell”. The Bureau provides this page with a range of info about the DAS, which might get you to a more definitive answer.
Got it–thank you for confirming and giving this info!
My ultimate analysis is going to be at the level of census places, county subdivisions, and school districts. Looking at the privacy budget allocation tables at the second link you gave, it seems like a much larger portion of the privacy budget went to Tract Subsets and Population Estimates Primitive Geographies, which nest into the geographies at which official population estimates are produced.
My understanding based on this website is that places and county subdivisions are mostly or entirely included in the Population Estimates Program. If that is correct, is it likely that statistics by race at the county subdivision level are more likely to be usable? Would that be less true for county subdivisions that are smaller in population?
Thank you again!
I haven’t been able to find any more detailed info about the “Population Estimates Primitive Geographies” than what you’ve found. One additional clue is on this page, which indicates that the “city and town” estimates correspond to incorporated places and minor civil divisions (MCDs). Incorporated places are a subset of all census places, which also include unincorporated “census designated places.” And MCDs are a subset of all county subdivisions, including only the subdivisions that have a governmental or administrative function. (This site provides a nice overview of the different types of county subdivisions.)
In short, you can expect greater accuracy for certain types of places and county subdivisions–namely, incorporated places and MCDs–but it’s difficult to say how much greater!
And yes, in general, the data will be more accurate for larger populations than for small in relative terms. I.e., the distribution of noise added at a given geographic level does not vary by population size, so the errors will be greater relative to small populations than they are relative to large populations. This is evident in the “summary metrics” provided by the Bureau.
Got it–thank you for looking into this! I feel much better about my understanding of what’s going on, and moderately better about the possibilities for my analysis…