I am trying to obtain estimates on employment and population to calculate employment-to-population ratios. I am worried I am following important steps out of order. Is it possible for you to share if I am planning this analysis correctly?
I need to calculate an employment-to-population ratio for individuals 18+ by month, year, state, and industry group:
- Obtain the following dataset at the CPSIDP level:
CPSIDP
MONTH
YEAR
STATEFIP
IND
AGE
EMPSTAT
WTFINL
- Left join industry group category to each IND code.
- Apply weights to data at CPSIDP level, using WTFINL.
- Filter result to keep estimates for ages >= 18 and EMPSTAT == 10 or 12.
- Obtain industry group employment estimate by MONTH, STATE, YEAR, and industry group.
- Then obtain population data by first applying weights to data at CPSIDP level, using WTFINL.
- Filter result to keep estimates for ages >= 18.
- Obtain population estimate by MONTH, STATE, YEAR.
- Calculate emp-to-pop ratio: [industry estimate for given MONTH, YEAR, STATE] / [population estimate for given MONTH, YEAR, STATE]
Below is the R code I’m using in steps 3, 4, and 5:
ind_emp_est ← as_survey_design(.data = data, weight = WTFINL) |>
filter(AGE >= 18, EMPSTAT %in% c(10, 12)) |>
survey_count(
MONTH,
YEAR,
STATEFIP,
Industry_Group_Variable,
name = “Ind_Emp”,
vartype = c(“se”, “ci”))
And below is the R code I’m using in steps 6, 7, and 8:
pop ← as_survey_design(.data = data, weight = WTFINL) |>
filter(AGE >= 18) |>
survey_count(
MONTH,
YEAR,
STATEFIP,
name = “Pop”,
vartype = c(“se”, “ci”))
Thank you in advance!
While I cannot review the code for your specific analysis, your approach to calculating industry group shares by year/month/state seems reasonable to me. The only thing that sticks out is that you note that you obtain the dataset at the CPSIDP level. CPSIDP links persons across their appearances in the CPS panel and is used for longitudinal/panel analysis. As a result, the same CPSIDP values will appear in up to eight monthly samples. Based on your question, I recommend approaching your data as a repeated cross-section; each individual is then identified by a unique combination of YEAR, MONTH, SERIAL, and PERNUM.
You are correct to restrict to only persons with EMPSTAT = 10 or 12 since industry (IND) is provided for currently employed persons as well as for unemployed persons looking for work and persons not in the labor force who had worked in the preceding 12 months (see the universe tab for IND).
I can share a few edits that you might consider in your approach:
- WTFINL is the weight typically used for analyses of CPS Basic Monthly Survey (BMS) samples. However, you might consider using the composite weight COMPWT instead. COMPWT is used for replicating BLS labor force estimates since it increases the reliability of estimates of month-to-month change.
- You will want to be aware of changes in Census Bureau industry classifications across your time period. IPUMS CPS provides the harmonzied industry variable IND1990 which adjusts for changes to codes and uses a modal assignment protocol to code industry codes that combine or split over time (see our chapter on harmonizing occupations and industries for more information).
- You may find that the small number of observations in your year/month/state/industry cells result in estimates with large margins of error. In addition to aggregating your industry groups to increase sample sizes, you might also consider using the larger American Community Survey samples on IPUMS USA or the aggregated ACS summary data on IPUMS NHGIS.
I have a similar question. When I use WTFINL or HWTFINL, I get really large estimates that I know are incorrect. I’ve limited the data to one record per household, and I used the IPUMS CPS Stata read-in code that adjusts the weight variables by dividing by 10,000. Is a further adjustment necessary? I am using monthly CPS data from 2023 to 2025. Here is a simple version of my code, which leads to estimates of billions of people in 2023 and 2024, and 546 million in 2025. Do I need to do this monthly, and then average?
local filedt 20250608
use “./output/cps_extract_00003_`filedt’.dta”, clear
gen counter=1
bysort year month hseq hrhhid hrhhid2 (pernum): keep if _n == 1
bysort year: tab counter [iw=hwtfinl]