Hi!
I’m looking to calculate state-year racial shares of the workforce. For example, I would calculate the total employed population in state X in year Y, then I would calculate the number of employed black individuals in X and Y (both calculations would use wtfinl). I would then divide these two estimates (see code below). However, I notice that the sums I calculate are VERY low. For example, the total number of employed people in Alabama in 1980 1139.107, which I know can’t be right.
Question 1: Am I calculating these estimates correctly with the weights?
Question 2: Even if the estimates are low, can the ratios/fractions (i.e. black share of workforce) be correct?
- The variables in the collapse commands are dummies for each subgroups. For example, employed is a dummy for if the individual is employed.
gen employed = . // people that are not in the labor force and people in military
replace employed = 0 if labforce == 2 & empstat>=20 & empstat <=22
replace employed = 1 if labforce == 2 & empstat>= 10 & empstat<=12
replace employed = . if classwkr == 29 // do not include unpaid family workers
- replace employed = . if classwkr == 0 // does not inlcude people NIU for classwkr
********************* Creating-State-Month-Year Totals *************************
- Weights control for within state different sampling decisions by CPS [ie did we sample too few rural people]
collapse (sum) employed lfp emp_black emp_white emp_hisp emp_black_wm ///
emp_black_mn emp_white_wm emp_white_mn emp_hispanic_wm emp_hispanic_mn ///
emp_yng_black_wm emp_yng_black_mn emp_yng_white_wm emp_yng_white_mn ///
emp_yng_hispanic_wm emp_yng_hispanic_mn emp_black_wm_nocollege ///
emp_black_mn_nocollege emp_white_wm_nocollege emp_white_mn_nocollege ///
emp_hispanic_wm_nocollege emp_hispanic_mn_nocollege ///
(count) count=wtfinl [w=wtfinl], by(statefip year month) fast
/// give the weighted total number of people in each subgroup
************** Creating State-Year Panel Avg across months *********************
collapse (mean) employed lfp emp_black emp_white emp_hisp emp_black_wm ///
emp_black_mn emp_white_wm emp_white_mn emp_hispanic_wm emp_hispanic_mn ///
emp_yng_black_wm emp_yng_black_mn emp_yng_white_wm emp_yng_white_mn ///
emp_yng_hispanic_wm emp_yng_hispanic_mn emp_black_wm_nocollege ///
emp_black_mn_nocollege emp_white_wm_nocollege emp_white_mn_nocollege ///
emp_hispanic_wm_nocollege emp_hispanic_mn_nocollege, by(statefip year) fast
gen frac_black = emp_black/employed
While I’m unable to review your specific code, I was able to create an extract combining all of the 1980 Basic Monthly Surveys. Using it, I got an estimate of 1,519,989 for the number of employed people in Alabama together with a labor force participation rate of 57% and unemployment rate of 10%. Given the additional adjustments the BLS does in calculating unemployment rates, this lines up very closely to estimates provided by FRED. Below is the code that I used. Note that you’ll want to divide your weights by the number of samples (in this case I divided them by 12). As for your second question, you will observe that the smaller your subsample becomes, the less precise your estimates regarding that subsample will become. This shouldn’t present any significant issues for finding, for instance, a breakdown of the state unemployment rate by race. If you decide to refine your subsamples further, I would recommend paying attention to the variance of your estimates.
gen employed = 0
replace employed = 1 if classwkr != 29 & (empstat== 10 | empstat==12)
egen stateworkers = sum(employed * wtfinl/12), by(statefip)
gen unemployed = 0
replace unemployed = 1 if empstat == 20 | classwkr == 29
egen stateunemployed = sum(unemployed * wtfinl/12 ), by(statefip)
gen state_unemploymentrate = stateunemployed/ (stateworkers + stateunemployed)
gen lf = 0
replace lf = 1 if labforce == 2
egen statelf = sum(lf * wtfinl/12), by(statefip)
gen nilf = 0
replace nilf = 1 if labforce == 1
egen statenilf = sum(nilf * wtfinl/12), by(statefip)
gen state_lfpr = statelf/ (statenilf + statelf)
Thank you so much for your response! This was super helpful! I have two follow-up questions: (1) Can you explain a little more why you divide the weight by 12? Should this always be done when using weights?
(2) If I want to calculate the probability of being in the labor force for a particular subgroup in a state (e.g. the probability of black men age 24-34 being in the labor force in Alabama), I would typically do this using a dummy variable and collapsing (mean) down to the state-year-race level (after narrowing the sample to only men aged 24-34). Typically, I would use wtfinl in the collapse command. Does your comment mean that I should actually use wtfinl/12 in this command too?
I should specify that dividing by the number of samples is only necessary if you’re looking to find population-level totals (e.g. the number of black men aged 24-34 in the labor force in Alabama). It will not affect your estimates of ratios (e.g. the labor force participation rate for black men age 24-34 in Alabama).
WTFINL estimates the number of people from the population represented by a specific observation in your sample. Suppose that summing WTFINL across all black men aged 24-34 in the labor force in Alabama in January 1980 gives you an estimate of 200,000. This tells you there were approximately this many people in this group in January 1980. Now suppose your sample consists of all basic monthly surveys in 1980 and you sum WTFINL across your population of interest. You may find that the sum is now 2,400,000 and erroneously conclude that this is the size of the population. However, you’ve simply aggregated estimates from all of the months into one figure. Dividing by the number of samples, 12 in this case, will give you the average size of the population across the year. This isn’t necessary when calculating ratios because both the numerator and denominator will be inflated equally.