Demographic Estimates Across Geo-Units

I’m curious if anyone has quick thoughts/guides on how best to test whether demographic estimates across geographic units are statistically significant. If this question is a bit too broad in scope I totally understand. :slight_smile: Thank you!

As an example, let’s assume I pull ACS data with county-level indicators and information for respondent race. Imagine I use data from the 2019 ACS subset to Virginia. I find that County A has 25.7% individuals who identify as black or African American (using the RACBLK variable). County B is at 28%. This is obviously a made up example, but I’m particularly interested in when estimates like these are close.

If I wanted to test whether the difference in demographics b/w these two counties is significant – what method(s) may be best suited given the structure of ACS data (specifically the IPUMS extracts)? My thoughts:

T-test: I am using the survey package (Survey Data Analysis with R) I know there is a svyttest() function, and I know it would be a 2 sample unpaired test, but I can’t quite wrap my head around the best way to set up the call for comparing sub-populations of the same larger dataset.

Boot-strap type approach: Another thought I had was to kind of boot strap it, where I take a large number of random samples from each county and compare the distribution of means from these samples that way (accounting for person weight) through a t-test manually.

Z-score: I was also referred to the following link - Statistical Testing Tool - which seems to substantively fit perfect for what I’m looking for. However this requires you’re pulling data from Census tables, doesn’t seem to work as easily for ACS extracts. Furthermore, for cases where we’re creating our own geographic units Id’ still want to be able to test across units.

Any/all thoughts would be appreciated.

After more digging it seems like my best bet is to use the Statistical Testing Tool (linked above). You can rework the tables to use Standard Errors instead of Margin of Errors (directions enclosed in the excel document/tool itself).

So for instance, you can do something like:

svyby(~percentRACBLK, ~county, svy, svymean, na.rm=T)
(where these are recoded variables from the ACS extract)

Which will output a series of means for your variable of interest by county, along with standard errors. You can then input this information into the excel document to run Z-score tests.

If anyone has experience with the survey package/etc. and has a more streamlined solution - or critiques of this workaround - I would appreciate it.

I’m surprised how small the SEs are - but I suppose that is to be expected for just looking at respondent demographic information (like race, gender, etc.) and one should expect greater SEs with estimates of things like income.

I have a couple of suggestions:

First, to get correct standard errors with either the svyttest() or the svyby() followed by the Census tool, you should be using replicate weights. The way to incorporate these in R using the survey package is as follows (based on this thread):

svy <- svrepdesign(data = usa_00002, weight = ~PERWT , repweights = “REPWTP[0-9]+”,
    type = “JK1”, scale = 4/ 80 , rscales = rep(1, 80 ), mse = TRUE)

I believe the correct syntax for svyttest() would then be:

svyttest(percentRACBLK~COUNTYFIP, svy)

I’m not sure what this will look like with lots of counties. You might need to make a subset of your data with just the two counties. Also note that not all counties are identified in the ACS microdata.

Hi Matthew,

First off - thank you! I appreciate the sample code and additional reading.

Does IPUMS have suggestions on when to use person weights vs. replicate weights (or is that not even a valid comparison?).

For instance, I was using code from this IPUMS thread provided by an IPUMS Staff Member - Does anyone have sample code for using svydesign function in R? - #2 by gfellis


svy ← svydesign(~CLUSTER, weights = ~PERWT, strata = ~STRATA, data = data, nest = TRUE, check.strata = FALSE)
svymean(~HISPAN, svy)

This also gives you standard errors - but it’s not clear to me when you want to use either method.