Hello and thank you for taking the time to review my question. When I calculate national population estimates of Mexican and Central American immigrant cohort groups for each Census year between 1970 and 2000, the population estimates are substantially smaller for some cohorts in the first census following their arrival than it is in the second census following their arrival. For example, the population estimate of Mexican and Central American immigrants who arrived between 1965 and 1969 is substantially smaller in the 1970 1% Form 1 State sample than it is in the 1980 5% sample. The same issue remains when I look at the population estimates for Mexican and Central American immigrants who arrived between 1975-79. That is: the 1980 5% state sample population estimate for this cohort is substantially smaller than it is in the 1990 5% state sample (this occurs for the 1985-89 cohort in the 1990-2000 samples too). Can someone explain why this is happening? Could this be due to systemic misreporting of the year of arrival to the U.S. or perhaps a flaw in the way that the personal weights (variable perwt) are constructed? For reference, I am estimating the cohort population using the weighted sum of individuals with pweights and the variable perwt in Stata. Individuals in my sample get a 0 if they are not in the cohort of interest and they get a 1 if they are in the cohort. The code I am using looks something like this: collapse (sum)cohort_of_interest [pw=perwt], by(year).
I’m not sure which variables you are using to construct these estimates. Please correct me if my answer is not entirely relevant. One possibility is that this observation is driven by limitations in the year of immigration, YRIMMIG, variable.
First, as noted on the description tab, in some census samples the year of immigration is given as a range. For the 1900-1930 samples and the 2000-2004 ACS, YRIMMIG reports the exact year of immigration. For 1970-1990, the respondent was asked to report the range of years that included their year of arrival. For the 2000 census and the ACS from 2005 onward, exact years are reported back to 1935; some years prior to 1935 are collapsed into categories.
Second, these ranges change across the samples from 1970-1990. Specifically, “between 1965 and 1969” is coded as “1969” in the 1980 file, but as “1970” in the 1970 file. Similarly, “1975-1979” is coded as “1979” in the 1990 file, but “1980” in the 1980 file. Looking at the 1980 and 1990 files for Mexican and Central American (identified via BPL) records, it seems that the number does go up between 1980 and 1990. This is present in the raw Census Bureau files and is not related to any editing in IPUMS.
Finally, as noted on the comparability tab, the specific enumeration instructions defining “year of immigration” changes over time. The 1910, 1920, and 1930 US censuses asked for the year of the person’s first arrival in the United States (or Puerto Rico). The 1970, 1980, and 1990 censuses asked when the person came to stay; the 2000 census and the ACS asked when the person came to live in the United States.
First of all, I want to thank you for providing a response to my question. I have accounted for the different yrimmig codes that identify the different ranges in the different samples and have structured my cohort groups so that they are consistent with these ranges, so I know that is not the issue. More specifically, I have used the following code in Stata:
***Generate an immigrant dummy variable, and keep only;
***natives and Mexican/Central American Immigrants;
generate byte imm=(citizen==2 | citizen==3);
keep if imm==0 | (imm==1 & (bpl==200 | bpl==210));
***Identify the survey year
gen survey = year;
***Drop if year of immigration cannot be identified;
***Only drops observations in 1970;
drop if yrimmig==996;
replace yrimmig=1969 if survey==1970 & yrimmig==1970;
replace yrimmig=1979 if survey==1980 & yrimmig==1980;
replace yrimmig=1989 if survey==1990 & yrimmig==1990;
replace yrimmig=1999 if survey==2000 & yrimmig==2000;
gen cohort=0 if imm==0;
replace cohort=4 if imm==1 & yrimmig>=1965 & yrimmig<=1969;
replace cohort=6 if imm==1 & yrimmig>=1975 & yrimmig<=1979;
replace cohort=8 if imm==1 & yrimmig>=1985 & yrimmig<=1989;
gen coh65 = (cohort==4);
gen coh75 = (cohort==6);
gen coh85 = (cohort==8);
collapse (sum)coh 65 coh75 coh85 [pw=perwt], by(survey);
Although the way I have defined the 65-69 cohort includes the year 1970 in the 1970 sample, this should favor a larger cohort population estimate for that year, which I don’t find. The same is true with the 1975-79 and 1985-89 cohorts. I also drop observations with yrimmig code 996 in 1970, which could contribute to the smaller cohort population estimate in the 1970 sample, but even if all of the 415 Mexican and Central American immigrant observations in 1970 who did not report their year of immigration happened to be in the 1965-69 cohort, the sum of the personal weights (41,500) still cannot account for the discrepancy of roughly 110,000 between the 1970 and 1980 samples. Since this issue persists for the 1975-79 cohort and the 1985-89 cohort, it is clearly not just an issue with the 1970 sample. This discrepancy occurs twice prior to the change in language from “stay” to “live,” so it seems like that the questionnaire language is not the issue here. Anyhow, I sincerely appreciate your response and am grateful that you always provide prompt and relevant responses. Could it be that the samples are just unlucky in the sense that these cohort populations are not properly represented in the samples leading to noisy estimates? Can you provide any other resources or recommendations about where I might find an explanation for this phenomenon?
Thanks for your detailed and understanding reply. As far as recommendations for an explanation for this phenomenon, I’d suggest contacting the Census Bureau.
Thanks Jeff. I’ll contact them to see if I can figure this out. I appreciate your help.