Hi everyone, I’m starting my first project on unemployment rates and I’m trying to create a plot that matches what the Bureau of Labor Statistics (BLS) shows. I would love some guidance on where to find reliable data for this, specifically what variables I should focus on, like the number of unemployed and the labor force. Any tips on data cleaning or preparation would also be really helpful. Thanks so much for your help!
A number of different unemployment metrics are published by the BLS and the specific procedure to replicate them will depend on a variety of factors. I am sharing some general guidelines as well as a procedure to replicate the values in the BLS table Employment status of the civilian noninstitutional population. This table provides estimates of the number of persons employed, unemployed, and not in the labor force averaged over each year. Unlike monthly data, which the BLS applies seasonal adjustment techniques to, this annual data does not require seasonal adjustment and is therefore more straightforward to generate using IPUMS CPS data. If you encounter any difficulties with the procedure below, please take a look at this short video tutorial of the IPUMS CPS extract system or our detailed FAQ page. We also provide data training exercises with questions and sample code that are helpful for confirming your understanding of how to correctly analyze IPUMS data.
Since the table reports annual averages, you will want to add all of the Basic Monthly Survey (BMS) samples for the years that you are interested in. From looking at the table, you’ll want the variables EMPSTAT (employment status), LABFORCE (labor force participation), and IND (respondent’s industry). Analysis of CPS data must use weights to make the estimates nationally representative and account for the complex sampling design. You should use COMPWT with analyses that are intended to replicate BLS estimates, so add COMPWT to your data cart.
Once you’ve added these samples and variables, you should request your data extract, download and decompress the file, and input it into your preferred statistical package (we offer syntax files for Stata, R, SAS, and SPSS and formatted files for Stata, SAS, SPSS, and Excel). Since each monthly sample is weighted to represent the sample universe (the US civilian non-institutionalized population 16 years of age and older) in that particular month, you need to divide COMPWT by 12 to obtain annual averages. Now you can tabulate your data by year to obtain counts of the sample universe, the labor force, the un/employed population, and the population not in the labor force, making sure to apply COMPWT as the survey sampling weight. Below I reproduce my estimates for 2020-2023 in Stata next to the estimates in the linked BLS table. While EMPSTAT provides more detailed employment categories, these can be easily summed to obtain the values from the table.
Note that PUMS CPS harmonizes and integrates the CPS PUMS (Public Use Microdata Sample) file, which differs slightly from the internal CPS microdata file used by the BLS for estimates. Specifically, respondents in the PUMS have their ages perturbed and some geographic identifiers are additionally synthesized to further protect confidentiality. As a result, while topside labor force estimates (such as the ones in this example) will continue to match published data, estimates below the topside level (e.g., employment status by age, sex, race, and ethnicity) are expected to differ slightly from the published data. All such differences should fall within the sampling variability associated with CPS estimates.
You can also refer to this IPUMS blog post for more details on calculating alternative measures of unemployment published by the BLS using IPUMS CPS data.