I’m pretty new to analyzing IPUMS data and want to make sure I’m using the data correctly. I’m trying to get a read on the number of people who report weekly earnings above and below a certain threshold (in this case, $500), broken down by sex. For this, I’m using CPS monthly data since 2019. My extract includes the following variables: SEX, LABFORCE, EARNWT and EARNWEEK, and below is my code in R.
cps_data %>%
mutate(EARNWEEK_2 = as.numeric(as.character(cps_data$EARNWEEK))) %>%
mutate(earn_bins = ifelse(cps_data$EARNWEEK_2 <= 500, 'under 500', 'over 500')) %>% #bin weekly earnings into those making above and below $500
filter(LABFORCE == 2 & EARNWEEK != 9999.99) %>% #exclude those not in the labor force and EARNWEEK NIUs
group_by(YEAM = paste(YEAR, MONTH, sep = '-'), SEX_factor = as_factor(SEX), earn_bins) %>% #group by year/month and sex
summarize(n = sum(EARNWT), EARNWEEK_avg = weighted.mean(EARNWEEK, EARNWT)) #summarize using EARNWT
First, does my methodology seem sound? I’m surprised at the monthly variability of the data. It’s not out of the realm, but choppier than I would have guessed.
Second, I’m wondering why the monthly totals don’t match other sources. For example, the BLS has some 75 million women aged 16+ in the labor force as of February, and my analysis has only 65 million. Is this discrepancy likely due to the way CPS data categorizes workers?