Creating Variables from CPS Basic Monthly Data - When to apply weights, filter, group, sum up

Jamie_Jelly_Murtha · June 5, 2025, 3:04pm

I am trying to obtain estimates on employment and population to calculate employment-to-population ratios. I am worried I am following important steps out of order. Is it possible for you to share if I am planning this analysis correctly?

I need to calculate an employment-to-population ratio for individuals 18+ by month, year, state, and industry group:

Obtain the following dataset at the CPSIDP level:
CPSIDP
MONTH
YEAR
STATEFIP
IND
AGE
EMPSTAT
WTFINL
Left join industry group category to each IND code.
Apply weights to data at CPSIDP level, using WTFINL.
Filter result to keep estimates for ages >= 18 and EMPSTAT == 10 or 12.
Obtain industry group employment estimate by MONTH, STATE, YEAR, and industry group.
Then obtain population data by first applying weights to data at CPSIDP level, using WTFINL.
Filter result to keep estimates for ages >= 18.
Obtain population estimate by MONTH, STATE, YEAR.
Calculate emp-to-pop ratio: [industry estimate for given MONTH, YEAR, STATE] / [population estimate for given MONTH, YEAR, STATE]

Below is the R code I’m using in steps 3, 4, and 5:

ind_emp_est ← as_survey_design(.data = data, weight = WTFINL) |>
filter(AGE >= 18, EMPSTAT %in% c(10, 12)) |>
survey_count(
MONTH,
YEAR,
STATEFIP,
Industry_Group_Variable,
name = “Ind_Emp”,
vartype = c(“se”, “ci”))

And below is the R code I’m using in steps 6, 7, and 8:

pop ← as_survey_design(.data = data, weight = WTFINL) |>
filter(AGE >= 18) |>
survey_count(
MONTH,
YEAR,
STATEFIP,
name = “Pop”,
vartype = c(“se”, “ci”))

Thank you in advance!

Ivan_Strahof · June 6, 2025, 4:45pm

While I cannot review the code for your specific analysis, your approach to calculating industry group shares by year/month/state seems reasonable to me. The only thing that sticks out is that you note that you obtain the dataset at the CPSIDP level. CPSIDP links persons across their appearances in the CPS panel and is used for longitudinal/panel analysis. As a result, the same CPSIDP values will appear in up to eight monthly samples. Based on your question, I recommend approaching your data as a repeated cross-section; each individual is then identified by a unique combination of YEAR, MONTH, SERIAL, and PERNUM.

You are correct to restrict to only persons with EMPSTAT = 10 or 12 since industry (IND) is provided for currently employed persons as well as for unemployed persons looking for work and persons not in the labor force who had worked in the preceding 12 months (see the universe tab for IND).

I can share a few edits that you might consider in your approach:

WTFINL is the weight typically used for analyses of CPS Basic Monthly Survey (BMS) samples. However, you might consider using the composite weight COMPWT instead. COMPWT is used for replicating BLS labor force estimates since it increases the reliability of estimates of month-to-month change.
You will want to be aware of changes in Census Bureau industry classifications across your time period. IPUMS CPS provides the harmonzied industry variable IND1990 which adjusts for changes to codes and uses a modal assignment protocol to code industry codes that combine or split over time (see our chapter on harmonizing occupations and industries for more information).
You may find that the small number of observations in your year/month/state/industry cells result in estimates with large margins of error. In addition to aggregating your industry groups to increase sample sizes, you might also consider using the larger American Community Survey samples on IPUMS USA or the aggregated ACS summary data on IPUMS NHGIS.

Randy_Rosso · June 16, 2025, 8:54pm

I have a similar question. When I use WTFINL or HWTFINL, I get really large estimates that I know are incorrect. I’ve limited the data to one record per household, and I used the IPUMS CPS Stata read-in code that adjusts the weight variables by dividing by 10,000. Is a further adjustment necessary? I am using monthly CPS data from 2023 to 2025. Here is a simple version of my code, which leads to estimates of billions of people in 2023 and 2024, and 546 million in 2025. Do I need to do this monthly, and then average?

local filedt 20250608
use “./output/cps_extract_00003_`filedt’.dta”, clear
gen counter=1
bysort year month hseq hrhhid hrhhid2 (pernum): keep if _n == 1
bysort year: tab counter [iw=hwtfinl]

Ivan_Strahof · June 20, 2025, 4:56pm

The sampling weights are designed such that each monthly sample represents the total sample universe. The sum of WTFINL in January 2023 therefore approximates the total US noninstitutional resident population in that month; this is the case for each month in your sample. When summing WTFINL across the 36 months in your sample, your estimates will be inflated 36x. To obtain aggregate estimates for the 2023-2025 period, you will need to first divide WTFINL/HWTFINL by the total number of months in your sample. Alternatively, If you want to obtain monthly estimates, you will need to generate your counter for each month and year separately.

Topic		Replies	Views
Employment by Industry and State	1	154	June 28, 2023
Can I pool monthly CPS data into years to look at share of population by educational attainment level? CPS	1	1566	May 4, 2015
Using CPS Weights Correctly CPS	2	10	June 19, 2025
CPS basic monthly weights and unemployment (in R) CPS	2	831	September 23, 2020
Annual estimate from basic sample CPS	2	325	September 3, 2021

Creating Variables from CPS Basic Monthly Data - When to apply weights, filter, group, sum up

Related topics