Random selection of people (how to with survey package)

Hi all, (Or Hi IPUMS Staff :D)

I’m stuck on something that I’m wondering if anyone has any conceptual and/or applied feedback on.

To note: I use R to analyze ACS extracts with the survey package. Given the structure of ACS data - using either PERWT or REPWTP - is it possible to take random selections of the population?

For purely illustrative purposes, let’s say I have an ACS extract of respondents with STATEICP and INCTOT. Instead of just aggregating the income by state (mean, median) etc, what if I wanted to take a random selection of people per state - and then show summary stats and/or do an analysis on them? This is purely illustrative - but I do need to find a way to take the sample provided by an ACS extract, and then randomly select X% of them for an analysis.

The trick here is how to randomly select a portion of your sample (as a subset) to run a new analysis on. The problem I can’t seem wrap my head around is that given the nature of the data - where 1 row may represent 3 or 80 people etc. - how can you take a random selection?

My intuition is to create a binary variable in the data object (before converting it to a survey object) and called “randomlySelected” and have some pre-defined probability it is 1 or 0. Then once I create a survey object, I can subset to only data where randomlySelected == 1. The problem here is though, let’s say I want to randomly select 50% of the sample, while the variable may reflect roughly 50% of the data object, once I convert it to a survey object it may not represent 50% of the sample.

Any insight would be super appreciated!

It is possible to take random selections of the population using the Customize Sample Size feature during data extract creation, which automatically re-adjusts the weights to properly represent the population for a smaller sample size. Note that, in the ACS, individuals are clustered at the household level; this feature randomly selects households, with all members, which is the proper way to select a subset of cases in the ACS.

1 Like

Hi @Grace_Cooper -

First off, thank you!

I was reading about the customizing sample size and ability to randomly select, this is helpful.

I think the problem I’m facing may be admittedly out of the scope of IPUMS support; however, I wanted to ask it in case there was a clear answer that had been discussed elsewhere.

I’m more so wondering how we can randomly select observations (or households) after we’ve cleaned and edited the data according to our own design/interests. For example, if I download an extract and then identify all households where:

  • someone in the household is working X job (could be anything)
  • has X+ kids
  • etc. [this is just purely hypothetical example, but meant to illustrate that our selection of our population from the extract is probably outside of the scope of the IPUMS interface and has to be done manually via R]

And then from this subset I want to randomly select 60% of households. The problem I’m facing is that our selection of our population of interest has to happen in R, but once we turn the dataset into a Survey object I can’t feasibly think of anyway to randomly select households given the structure of the data frame itself.

Your question may be outside the scope of IPUMS support, as you mentioned, but I will offer a suggestion. If you can subset your data to households of interest before turning it into a Survey object, the sample() function can be utilized to take a random sample of your subsetted data in base R. Additionally, if you are using dplyr to manipulate your data, the sample_n() function is an even simpler function for taking random samples of your data. Once you have subsetted your data and taken a random sample, then convert it into a Survey object for analysis. You might also find this blogpost useful: Simple Random Sampling Analysis in R.

1 Like