Is there a way to use the API to extract only linked samples? I can do this from the web UI, of course, but I’d prefer to do so though R code so that other researchers can reproduce my results.
I’m working with the USA full-count data for CT for 1850 and 1860, and I’ve noticed that the population pyramid shows a few towns that have a disproportionate number of native-born young males in both censuses. The 1860 age pyramid is largely the same as for 1850, suggesting that these people are migrating elsewhere. I’m curious whether this sub-population is more likely than others to migrate, but I’d rather not download 31 million records and filter out those without an HIK value (and I assume that the full-count license terms prohibit me from sharing the filtered records).
Unfortunately, this is not currently an option using the API. However, to address your concern on downloading an extract with many millions of records, IPUMS USA does offer the ability to “Select Cases” prior to submitting your extract that allows you to filter your extract (and limit its size) by selecting only criteria of interest to you. Here is documentation on how to use case selection in R using the API. For example, you could filter on STATEFIP and LINK1850 or LINK1860 to only get persons in CT who have a link in either census. These variables (LINK1850 and LINK1860) indicate if that person is linked to the 1850 or 1860 census, respectively. Linked persons receive a common Historical Identification Key (HIK) value across all censuses in which they are identified. A LINK value of 1 indicates that they are linked to another census year, and in some cases multiple census years and the LINK variables are useful for limiting the dataset to the linked population, reducing its size.
After some continued thinking on this, please disregard my advice to use STATEFIP to filter on case selection. We caution users to ONLY use the LINK* variables for case selection. From our documentation on using linked census data:
Users should be cautious about adding case selections beyond those that are applied automatically by the system to identify linked individuals across census years. Performing case selection on time-variant characteristics — such as age, marital status, or state of residence — risks excluding some observations for a person. An individual may be linked in a census year, but the observation will be dropped if they do not meet the additional selection criteria in that specific census .
Thanks; using LINK1850 accomplished exactly what I wanted.
What’s interesting is that the distribution of 1850 ages for the heads-of-household of people who moved out of state is older than I would have expected: the mode is a bit more than 40 median is 46.5. If I examine only the five towns with anomalous age pyramids, the distribution is bimodal, but the two peaks are still a bit older than I would have expected, and the median increases slightly to 47.5. I hadn’t expected that.
I tried distinguishing 1850 minors who came of age by 1860 from those who moved along with their family, but it shifted the age distribution only a tad to the left. Thanks to R’s Tidyverse, it was relatively ease to add a column that used the individual’s age if they were a head of household in 1860, and the head-of-household’s age if they were still part of a family.
1 Like