I’m trying to construct a 1900 - current time series at the state-level for counts of several codes of OCC1950, and curious if there is general advice about best practices for manually building time series data from IPUMS samples. I don’t think that NHGIS has the categories in occupation that I need for this one.

More detail: I want a population estimate at the state-year level for these OCC1950 codes: “093 Teachers (n.e.c.)”; “079 Social and welfare workers, except group” using decennial censuses for 1900 - 2010. What’s the best way to build this?

I’m observing some large differences in counts derived from the full population data and the 1% samples, which isn’t terribly surprising, but raises questions about how to estimate appropriate uncertainty / standard errors for 1% sample derived estimates for years where the 100% data are not available through IPUMS. I’m assuming best practice would be to use 100% data for 1900 - 1940, and 1% data for 1950 forward, is that right?

As you suggest, large differences in estimates may be due to sampling error. The fewer respondents who represent a certain category there are in the data, the higher the standard error of your estimate will be. This issue is exacerbated when using smaller samples such as the 1% files, particularly if combined with highly specific subpopulations. With some exceptions, it is preferable to use the larger sample (e.g., the full count instead of the sample in this case). Standard errors can be calculated to create confidence intervals for your estimates using statistical software packages such as Stata, SAS, SPSS, and R. For samples prior to 2005, the IPUMS analysis and variance estimation user guide recommends that users cluster their standard errors by household (CLUSTER) and incorporate STRATA into their survey design. For ACS samples from 2005-onwards, IPUMS provides replicate weights in the variables REPWT and REPWTP for users to derive empirically robust standard errors.

In this specific case, however, differences in estimates might also be due to different methods in coding occupations for the full count and the 1% samples. The samples and full count files were transcribed and processed at different times and using slightly different processes for variable assignments. In general, the smaller samples are likely to be of superior quality due to being transcribed with social science research applications in mind which often relies on principles or leaves documentation that make it easier to track down and make decisions about edge cases. However, with as detailed a variable as occupation, it’s unclear how much this extra refinement contributes to the difference because hand-checking occupation in either data file a challenge. I would expect certain general and unambiguous job categories, such as teachers, to be pretty stable between the two files. There may however be more coding variability in regards to social and welfare workers since they represent fewer observations and they don’t necessarily have job titles that are unambiguous.

I am sure you were hoping for a more definitive response, but hope this helps you contextualize differences between the estimates you are seeing and the strengths and challenges of the different data files in any given year.

1 Like

Thanks for the response! I’ve reviewed the variance estimation guide closely, but don’t fully understand the recommended methods for computing standard errors when repwt is unavailable. I’m an R user and the R syntax only computes means, not variances / SE with the weighted.mean() function. I’ve also reviewed the paper linked in the user guide, and beyond guidance to use Huber-White sandwich estimators, I couldn’t find much guidance for a manual implementation.

Is there anywhere you could point me to a formula / method for manually computing SEs using PERWT, STRATA, and CLUSTER? I’ve given the R survey::svydesign() route but the function buckles under the large amount of data I’m working with. Happy to roll my own function to compute the SE (ignoring nonsampling error obviously) if you can point me to the underlying recommended SE formulas using the provided variables.

Per your suggestions, I can anchor the time series in the 1900 - 1940 full pop data, so that certainly helps things, and will use the 5% samples where available. For ACS, REPWT will work just fine.

IPUMS does not have guidance for manually computing standard errors. You can refer to the source code on github to see how the functions are defined in the package you are using. If svydesign() is giving you too much trouble due to large file sizes, you might try to subdivide your data into more bite-sized chunks. For example, you can use the select cases feature on the IPUMS website to download extracts only with respondents with your OCC1950 codes of interest. You might also try using the srvyr package. The package allows users to specify survey design factors and calculate standard errors for large survey datasets.

1 Like