Calculating group medians in R (ACS 5yr 2018-2022)

Hi everyone,

I am attempting to calculate a table on median HH income by units in structure and year built. You can find the table here:

My values are not matching with this individual’s calculations, so I am worried that my workflow for ACS microdata is incorrect (I use R).

I have the microdata csv with the necessary variables. My question comes down to: is grouping data and calculating a weighted median sufficient? Perhaps I am second guessing myself, but the workflow seems quite simple…

Here is an image of my R code:
image

Have you limited your sample to California (STATEFIP == 06)? PUMA codes are unique only within each state. There are several PUMAs with code 00101 in different states.

Apologies, I should have mentioned that my extract request was filtered on California data with the “select cases” option.

I found some additional instructions working with household level data from an earlier thread:

As they suggest, I changed my filter to exclude group quarters:
image

However, I noticed the variable PERNUM is missing despite specifying it in my extract. Are there other variables one should consider with household level analysis?

It’s possible that your respective software packages use different formulas for calculating the weighted median. Even without weights there are multiple approaches to calculating medians and other quantiles.

1 Like

David’s suggestion regarding multiple approaches to calculating weighted medians is helpful and important to keep in mind. For example, spatstat offers three different algorithms for calculating a median in cases when the standard median is not represented in the data. When running this analysis using the different median types however, I was unable to perfectly replicate the findings from the blog that you shared. Running this analysis with the 2022 5-year file, I am finding a median household income of $184,187 (spatstat type = 2) for pre-1950 single-family detached units and $134,465 for pre-1950 single-family attached units for California PUMA 00101 (the PUMA had identical boundaries across the 2012-2021 and 2022-2031 vintages). You may try contacting the author of the post to determine how they arrived at their estimates.

Household income is not provided for group quarters (see the universe statement for HHINCOME), so there is no necessity to filter these observations from your data. You are correct however to note that you should only include a single observation per household in your analysis (often implemented by filtering to PERNUM = 1). You can also do this by requesting a household-only extract from the extract options menu (available after submitting your data cart, but before submitting your extract). An additional detail to note about HHINCOME is that the values refer to dollars in the year the survey was conducted (i.e., the dollar amount reported by respondents in that year), except in multi-year files where the original amounts have been standardized to dollars as valued in the final year of data included in the file (e.g., 2022 dollars for the 2018-2022 5-year file).

Ivan, thank you for the detailed response! I really appreciate it.

Went ahead and found the same result using the type 2 weighted median from spatstat. I’ve spent some time reading more about the different formulas. David must be correct.

I will mark this as the solution (yay!), but before I do, I was wondering what place the SAMPLE, and SERIAL variables have in household level analysis. My extract was household level only, so the PERNUM = 1 wasn’t necessary.

However, am I correct to suggest that it seems the SAMPLE and SERIAL numbers do something similar insofar as they select unique households?

image

I only got the same estimates as you once I included this line of code. Thanks again!

You are correct that the combination of SAMPLE and SERIAL uniquely identifies each household in the data, but there should not be any household records with duplicate combinations of SAMPLE and SERIAL that make this line of code necessary. Duplicate combinations of SAMPLE and SERIAL would come only from a person-level extract, while duplicate values of SERIAL (alone) could come from an extract at the household-level that pools together multiple samples.

With a household-records only extract of the 2022 5-year ACS data for California PUMA 00101, I find 3490 households with HHINCOME != 9999999. Running distinct(data, SAMPLE, SERIAL) afterwards does not change my data. If you are seeing duplicate combinations of SAMPLE and SERIAL, then you likely have person-level records in your data (see RECTYPE) .