HISTID for the 1950 1% sample

I am interested in having the HISTID variable for the individuals included in the 1% 1950 census sample. I want to match these individuals with the 1940 census, and the HISTID is the best option available.

Although the full 1950 census count does include the HISTID variable, I don’t want to use these full count as the wage/income variables are not fully reliable. The 1% sample is accurate in this respect.

Of course, there is always the possibility of using both datasets (full count and 1% sample) and impute the HISTID variable to the individuals in the latter dataset assuming that a set of conditions (year of birth, gender, city, etc.) are matched. However, this alternative seems too costly and not as precise as having directly the HISTID.

1 Like

It is inadvisable to attempt linkage from the 1950 1% sample to the 1950 full count file due to the high probability of making a false match. For this reason, IPUMS has not linked this sample with any others. The high probability of mismatch is driven by the lack of data on respondent names as well as limited geographic identifiers in the data shared with IPUMS.

HISTID is a consistent individual-level identifier that was introduced to make it easier for users to trace how revisions to the historical full count data affected individual observations. For this reason, it was never made available for the representative (sub-100% samples). However, even if HISTID values were available for this sample, they would not be able to be linked to respondents in other years due to the data limitations stated above.

In contrast, first (NAMEFRST) and last (NAMELAST) names are included in the 1850-1930 IPUMS representative samples. This is the main driver of the linking algorithm in the linked representative samples (LRS) database, which links records from the 1880 complete-count database to 1% census samples from 1850-1930.

The absence of enumeration district (ENUMDIST) or other granular geographic identifiers in the data additionally make it unwise to link this sample to other samples even from the same year. Instead of enumeration districts, the lowest geographic identifier available for all respondents is the SEA (State Economic Area). These are generally either single large urban counties or groups of contiguous counties within the same state. In the case where the SEA consists of a single county, that county is also identified in COUNTYICP. Regarding the variable CITY, the comparability tab for the variables notes that (in the 1950 1% sample) only central cities of metropolitan areas with a population of at least 200,000 (in 1980) are identified, and metropolitan areas with multiple central cities (e.g., New York City ) are combined into a single code.

I wish that I could have shared better news, but I hope this information helps you make the best decision on how to proceed with your study.

Thanks for the detailled response

1 Like