I’ve got an IPUMS USA extract from the 2019 ACS that I’m using to try and extrapolate some state-level trends regarding demographics and occupation. Here is an example for illustration. Let’s say I’m trying to estimate how many women/men work in finance across different income brackets in Illinois.
I want to make sure I am understanding the PERWT variable correct.
Let’s say, for example, that the first row/observation has the following variable values:
SEX == 2 (Female)
INCWAGE == 5500
OCC2020 == 5165 (Financial clerks)
PERWT == 67
PWSTATE2 == 17 (working in Illinois)
(I’ve just made these up for the sake of my understanding).
Am I then correct in extrapolating that in 2019, there were estimated to be 67 citizens that identify as women, making $5500, and work as financial clerks in Illinois?
If that is correct, is there any reason not to “expand” the dataset using R/STATA etc. so that each row gets duplicated X times, where X == PERWT. For instance, the above row would be duplicated 67 times. If our only interest is learning about the above variables - is there no reason not to do so?
Obviously makes your dataset a lot larger, but if you’re subsetting it at the state level (and by profession, etc.)… its manageable. I think it provides some benefits for visualization purposes.
You can interpret the PERWT value for this record as the individual representing 67 other persons. I do not recommend expanding the dataset as you describe; instead you should leverage weight commands (in either Stata or R) to weight to your analyses and estimate standard errors (I am linking to resources about applying weights, clustering and standard errors as well as generating standard errors using replicate weights). By expanding your dataset as you describe you will get accurate counts, but won’t estimate the correct variance around those point estimates. The weight commands account for the uncertainty of survey data (e.g., where a single record represents 67 other persons); by expanding your dataset to instead include 67 records, your data are more like a census (e.g., where there isn’t this uncertainty because your data includes the entire population).
Additionally, while I understand that you included an example as a reference point only, I want to include a caution about small sample sizes as well as some information about occupation and industry.
Small Sample Sizes
Your example is a very targeted subgroup; I only see 26 women in the 2019 1-year ACS PUMS data that are women working as financial clerks (as per the OCC variable) who are working in the state of Illinois; this is across all income groups. There is no bright line rule regarding how small of a sample is too small, but you want to avoid making population-level inferences from too small of a small sample size. For example, I would be hesitant to make statements about the income distribution of women working in finance in the entire state of Illinois from these 26 cases, and would not want to subdivide by income bracket. You might augment your unweighted case counts by collapsing related occupations together, or pooling together multiple years of data (e.g., using the multi-year files; if you pool multiple single year files, note that the weights will total to the sum of the population for each year of data you are combining, so you should divide the weights by the number of years).
Occupation & Industry
Occupation reports the type of work a person does whereas industry is the type of activity at a person’s place of work. There are multiple ways you might define “women working in finance” as outlined in your example. For example, the occupation code 5165 identifies “Other financial clerks” (note that this does not include financial managers, financial and investment analysts, personal finance advisors, tellers, payroll and timekeeping clerks, billing and posting clerks, or bill and account collectors to name a few). Industry codes beginning 6870-6992 identify people of all occupations who work in “Finance and Insurance” (you can further subdivide into more targeted industries within this group). The thing to note with industry codes is that they may include persons whose occupations may fall outside of your occupation of interest (e.g., janitorial staff), but who work in this industry. Some combination of occupation and industry may be useful as you consider how to define groups like “women working in finance.”
If you are pooling multiple years of data, it is important to note that beginning in 2018 there are new codes for occupation and industry that affect the comparability of some codes over time. IPUMS has harmonized occupation (OCC1990, OCC2010) and industry (IND1990) variables over time to address these (and previous) changes in the underlying coding schemes for occupation and industry.
Wow! First off - thank you so much for this incredibly detailed answer.
I very much appreciate the links to the further information on weights and the survey() package in R.
May I ask a follow up question? Do you happen to have any resources for if your intent is to simply visualize descriptive information (via Tableau) using IPUMS data, but being mindful of weighting? I.e., creating bar charts/histograms of demographic breakdown by profession etc., i.e. no inference or modeling.
In other words (and this is likely outside the scope of IPUMS so please ignore if need be) - are accurate counts sufficient for descriptive visualization purposes?
Re: small sample sizes and occupation and industry - this is extremely helpful. Thank you AGAIN!
We don’t have resources on visualizing IPUMS data via Tableau and this is beyond the scope of our User Support Team, but I am linking a few pages that may be of interest to you. Based on my cursory research it looks like Tableau’s ability to handle weights (as well as variance estimation variables) is fairly limited.
Integration with R/MATLAB/Python; I think these integrations may be restricted to paid versions but leverage the other software’s ability to handle complex survey design and weighting variables
Regarding your question about counts, if you display frequencies for simple descriptive statistics without standard errors (or display counts for which you haven’t calculated the standard errors), there is no way to understand the accuracy of the statistic. While your count will be “accurate” based on the weights, your count is still an estimate because it is derived from a survey (which infers the true population total from a sample) rather than a census (which includes the entire population). Standard errors communicate how much variation there is around your estimated count. Additionally, standard errors are a helpful tool as you consider the “how small is too small a subpopulation” question–for very small subsamples, your standard errors will become relatively large and limit any informative interpretation from the data. It is not uncommon to suppress data with either a low sample size or high coefficient of variation (as they often go hand-in-hand).
One way around the limitations of Tableau on this front might be to use R to generate the standard errors, determine the thresholds that are appropriate for your application of the data, and share your R notebook for persons interested in the additional information, simply note the ranges of your estimates in the documentation, and/or note any excluded groups because of small sample size.