IPUMS redesign research project

What is the status of the redesign project funded by the National Institute of Health (HD 043392-03S1). I believe its objective is to provide a way to estimate variances in use of IPUMS data. Am I correct, and does that research still hold promise for the estimation of variances?

Hi Chip,

IPUMS has implemented all three of the goals of this project as outlined in the redesign summary. You are correct that the objective of the project was to provide researchers with the tools to derive empirically accurate variances for estimates using IPUMS data. These tools are available in different forms across IPUMS projects from strata and cluster variables to replicate weights and sample code. If you are interested in calculating variance for estimates for a specific IPUMS project, please feel free to share more details about your project and I will explain the variance estimation process recommended by IPUMS using these tools.

Thanks Ivan. Below is the description of my project. Hope this is what you are looking for.

The variance I would like to calculate relates to the unserved market for federal rental housing tax credits. It is for the percentage of income qualified households for the tax credit program that are paying above the tax credit maximum rent for the apartment they are occupied.

The data used for a given program year and geography:

  • Maximum qualifying income level based on household size (from HUD)
  • The maximum rent level for a given apartment size, in number of bedrooms (from HUD)
  • For each household: the annual household income, number of people in the household, their apartment size in terms of numbers of bedrooms, and gross rent paid for the apartment (from IPUMS)

My process:

  1. Determine whether a household is income qualified (this is done through an if-statement iteration based on household size)
  2. Determine whether the income qualified household is paying above the tax credit maximum allowed rent for the apartment size they occupy (this is done through an if-statement iteration based on the number of bedrooms occupied by the household)
  3. Sum the number of income qualified households paying above the allowed program rent and divide that by a sum of the number of income qualified households.

In order to generate empirically derived standard error estimates with IPUMS ACS samples, you will need to use IPUMS provided replicate weights in a statistical software package such as Stata, R, SPSS, or SAS. Below is sample code you might run in Stata (the income and rent thresholds I used were arbitrary fill-ins for this example). Replicate weights are explained in further detail on this page, which also provides code for R and SAS. This will generate point estimates for your ratios, standard errors, and 95% confidence intervals. You will also need to add the variables REPWT, RENT, HHINCOME, NUMPREC, and BEDROOMS to your extract.


`svyset[pweight=hhwt], vce(brr) brrweight(repwt1-repwt80) fay(.5)mse'
*This line comes directly from the replicate weight user guide linked above. The only changes are that 
*household weights (HHWT) replace person weights and household replicate weights (REPWT) replace person 
*replicate weights since your outcome is on the household-level.

gen income_qualified = 0
replace income_qualified = 1 if rent != 0 & (hhincome < 30000 & numprec == 1 | hhincome < 50000 & numprec > 1)
*Use any criteria to determine whether a household is income qualified. The rent !=0 condition ensures that 
*your analysis only includes rented housing units.

gen above_tax_max = 0
replace above_tax_max = 1 if income_qualified == 1 & (rent > 700 & bedrooms ==1 | rent > 1200 & bedrooms > 1)
*Use any criteria to determine whether the income qualified household is paying above the tax credit maximum 
*allowed rent for the apartment size they occupy.

svy, subpop(if income_qualified == 1 ): tab above_tax_max, se ci
*Use the subpop option to restrict your analysis to only income qualified households without losing sample 
*design information. 

Ivan, thanks for this detail. I need some more help, however. My database activities have been pretty much limited to Excel and Power BI. Of the packages you mentioned I have used R, though recently only for data conversions. I’ll dig into R for getting at variances and standard errors, but when I follow your link I just see the following code for R once the srvyr package is accessed. I know I would change PERWT to HHWT. Is there some place to go to get the other code that you provided for Stata?

svy ← as_survey(data, weight = PERWT , repweights = matches(“REPWTP[0-9]+”), type = “JK1”, scale = 4/ 80 , rscales = rep(1, 80 ), mse = TRUE)

I wrote up this code based on the description of the project you provided; I have not written up similar code in R. You will have to write the code for this analysis and it will need to be modified according to your specific needs. Our team answers questions relating to documentation, but we do not provide code review. You might however find this forum post and the accompanying code helpful. You will need to use the ipumsr package to input and analyze IPUMS data. A guide on using the package, with additional vignettes, is provided on this documentation page.

Thanks for these follow up steps, Ivan. fyi, as an experiment (for me) I requested that Bing AI convert your Stata code to R. Below is the response.

library(survey)
# create a survey design object with BRR weights
design <- svydesign(id=~1, weights=~hhwt, fpc=~repwt1, data=mydata)
# generate income_qualified variable
design$income_qualified <- 0
design$income_qualified[design$rent != 0 & (design$hhincome < 30000 & design$numprec == 1 | design$hhincome < 50000 & design$numprec > 1)] <- 1
# generate above_tax_max variable
design$above_tax_max <- 0
design$above_tax_max[design$income_qualified == 1 & (design$rent > 700 & design$bedrooms == 1 | design$rent > 1200 & design$bedrooms > 1)] <- 1
# subset the design object by income_qualified
sub_design <- subset(design, income_qualified == 1)
# tabulate above_tax_max with standard errors and confidence intervals
svytable(~above_tax_max, sub_design, SE=TRUE, CI=TRUE)

I’m now back to this challenge of using R to get the margin of error. How do I get the REPWT1, REPWT2 etc. data, is it built into the srvyr package?

REPWT (household-level replicate weights) and REPWTP (person-level replicate weights) are variables that you can add to your extract. They each provide 80 separate weights for all respondents (i.e. variables REPWT1 through REPWT80). See the IPUMS-USA replicate weights FAQ page for further details. Since you mention that you are estimating “the percentage of income qualified households” you will want to use REPWT to generate your standard errors.

In reviewing your analysis, I noticed that I missed an additional piece of code that’s required for generating household-level estimates. When running your analysis, you should restrict your sample to one observation per household to not overweigh households with more people. This is typically done by selecting only the first respondent in each household using PERNUM. Based on the R code that you shared, I believe this should go into your subdesign option:

sub_design <- subset(design, income_qualified == 1 & pernum == 1)

Thanks Ivan. In doing the extract I didn’t know that I had to add the repwt variable. Is adding the variable by checking the box in the image below the way to get this done?

Yes, clicking on the plus sign under “Add to cart” will add the variable to your extract. You can also click on the variable name “REPWT”, which will lead you to the documentation page for this variable (there is an option to add the variable to your cart at the top of the variable documentation page as well).

Got it, and added the replicate weights. However, with the pernum subdesign I get the following error, might I be placing this line of code in the wrong location (I placed it after the as_survey function)?

> sub_design <- subset(design, income_qualified == 1 & pernum == 1)                  
Error: object 'design' not found

It appears that you have not defined “design” before running this subset function. I’ve provided R code below that you may adapt for your research.

#Load packages
library(ipumsr)
library(srvyr)
library(survey)
#Set your working directory to where your .dat and .xml files are located
setwd("C:...")
#Read the .dat and .xml files in, entering the corresponding extract number you are using
ddi <- read_ipums_ddi("usa_[enter number].xml")
data <- read_ipums_micro(ddi)
#Define income qualified households and those paying above the tax credit maximum using using your criteria (I provide sample data here)
data$income_qualified <- 0
data$income_qualified[data$RENT !=0 & data$HHINCOME < 30000] <- 1
data$above_tax_max <- 0
data$above_tax_max[data$income_qualified == 1 & data$RENT > 700] <- 1
#Now define the survey design and the subsample
svy <- as_survey(data, weight = HHWT , repweights = matches("REPWT[0-9]+"), type = "JK1", scale = 4/ 80 , rscales = rep(1, 80 ), mse = TRUE)
sub_design <- subset(svy, income_qualified == 1 & PERNUM==1)
#You can generate the percentage of income qualified households paying above the tax credit max (and the replicate standard error) by getting the mean of above_tax_max
svymean(~above_tax_max, sub_design, SE=TRUE)

Thank you Ivan for taking the time to guide me with the code. I’ve put the IPUMS data into an excel file so this new code can’t work. What I didn’t go into before was that there is not a single income max, it’s a different number based on number of people in each household, and there is not a single rent, it’s based on the number of bedrooms in each household’s apartment. What I had done was to take all the permutations of household income, household size, rent and number of bedrooms and create an excel file that only includes income qualified households and then shows with a single variable whether those households are paying above the qualifying rent in their apartment. This shows up in the variable OverLIHTC which has a 0 if not paying over the rent, and a 1 if paying over the rent. I also filtered so there is only one row per household (so I don’t think the sub_design is needed).

The excel file now includes all the replicate household weights. From R, I need to end up with the MOE for the % of qualified households (all the rows) that are paying above LIHTC rents (have a 1 in the variable OverLIHTC). So in addition to not needing the sub_design I don’t need the ddi call because I’m loading the excel file. Should the remaining code work, or because I’m not going directly to the ipums data other parts of the code you’ve provided won’t work or need further adjustment? After loading all the libraries I have this code:

hhRaw ← read_excel(“D:/mhp3/qualnvhhs2.xlsx”)

LIHTC_svy ← as_survey(hhRaw,
weight =HHWT ,
repweights =matches(“REPWTP[0-9]+”),
type =“JK1”,
scale =4/80,
rscales =rep(1,80),
mse =TRUE)

LIHTC_svy ← filter(OverLIHTC == 1) %>%
summarise(LIHTC=survey_total(vartype=“ci”)) %>%
mutate(LIHTC_Percent = LIHTC / survey_total()*100) %>%
select(LIHTC_Percent, LIHTC)

Ivan, a volunteer with R said that the code below is producing the error message I’ve pasted below the code. Do you think that this has to do with the fact that my data table is an excel file, and not taking the data directly from the IPUMS extract? If so, is there a modification so that the code would work with excel?

repweights =matches(“REPWTP[0-9]+”)

Error in str2lang(x) : <text>:2:0: unexpected end of input
1: ~ 
   ^

This is going a bit beyond what I am able to help with from user support. I’ve provided code for inputting data using ipumsr, using replicate weights in R, and subsetting the data since these are all part of the replicate weights user note. Your specification of the survey design in LIHTC_svy references REPWTP, when it should call for REPWT. This appears to be the reason for the error message. However, I cannot provide a full code review to confirm whether each function is correctly specified and whether it will give you the results that you are looking for.

Ivan, I apologize for taking up your time with a careless mistake on my part. Just one thing I’m hoping you can confirm, and that is does the code approach you have provided also work with the source data being an excel file (having the replicate weights) versus requiring that it be an IPUMS extract?

No worries, Chip. To answer your question, you can run the as_survey(), subset(), and svymean() functions that I provide from your excel file after editing the IPUMS extract without using ipumsr or the DDI file. You will still however need to load the survey and srvyr packages.

Great, thanks Ivan. I am getting closer!