why the number of households by income bin using individual-level data does not match the census publication?

We have been unable to use the 1950 individual level data to match 1950 published statistics on the income distribution (Source: https://www.census.gov/prod/www/decennial.html, 1950 Census, page 1-104, table 57). (I also attached a screenshot of census publication).

Basicly, what we did is: keeping sample line households only, and weighting by the sample line weight should tell us the number of families and unrelated individuals.

Here are the codes we used to generate the number of families and unrelated individuals by income bins:

#delimit;

* keep only the -sample-line person-, for whom there is income data *;

keep if slpernum == pernum;

* drop n/a income;

drop if ftotinc==9999999 |ftotinc==9999998;

* create income bins for 1950, according to the census published bin categories *;

gen bin = .;

replace bin = 1 if ftotinc <500;

replace bin = 2 if ftotinc >=500 & ftotinc <=999;

replace bin = 3 if ftotinc >=1000 & ftotinc <=1499;

replace bin = 4 if ftotinc >=1500 & ftotinc <=1999;

replace bin = 5 if ftotinc >=2000 & ftotinc <=2499;

replace bin = 6 if ftotinc >=2500 & ftotinc <=2999;

replace bin = 7 if ftotinc >=3000 & ftotinc <=3499;

replace bin = 8 if ftotinc >=3500 & ftotinc <=3999;

replace bin = 9 if ftotinc >=4000 & ftotinc <=4499;

replace bin = 10 if ftotinc >=4500 & ftotinc <=4999;

replace bin = 11 if ftotinc >=5000 & ftotinc <=5999;

replace bin = 12 if ftotinc >=6000 & ftotinc <=6999;

replace bin = 13 if ftotinc >=7000 & ftotinc <=9999;

replace bin = 14 if ftotinc >=10000;

label define bin 1 “Less than $500”

2 “$500 to $999”

3 “$1,000 to $14,99”

4 “$1,500 to $1,999”

5 “$2,000 to $2,499”

6 “$2,500 to $2,999”

7 “$3,000 to $3,499”

8 “$3,500 to $3,999”

9 “$4,000 to $4,499”

10 “$4,500 to $4,999”

11 “$5,000 to $5,999”

12 “$6,000 to $6,999”

13 “$7,000 to $9,999”

14 “$10,000 or more”, replace;

label value bin bin;

* collapse count of families by income bin, with weights *;

collapse (count) ftotinc [iw=slwt], by (bin);

rename ftotinc num_family;

sum num_family bin, detail;

* calculate percentages *;

egen percent=pc(num_family);

format percent %9.1f;

format num_family %11.0gc;

* create table: number of people by income bin *;

table bin, c(sum num_family sum percent) format(%11.2gc) center row;

The result we got is that the total number of families and unrelated individuals (42,807,270) is smaller than the census publication (46,489,090), especially the bottom bins are quite off.

We are not sure whether ftotinc is defined for unrelated individuals, thus we tried to disagregate families and unrelated individuals. So we tried another approach: keeping sample line households only, and keeping only where relate = 1, and weighting by the sample line weight should tell us the number of families. Then we used sample line individuals where relate is 11 or 12 and weighting by the sample line weight, but none of them matched the census publication. We also used inctot for unrelated individuals, but still not matching.

We are writing in the hopes that you’ll have some insight into why this might be the case. Perhaps we’ve done something wrong in using the 1950 data? Or perhaps there is some other reason the 1950 individual data do not aggregate to the published statistics?

Thank you,

Yuan

The 1950 PUMS file was originally constructed by the Census Bureau and the FTOTINC variable is a direct representation of what the Census Bureau published. A detailed description of how the Census Bureau drew the 1950 1% sample can be found on the 1950 SAMPLING PROCEDURES page. Unfortunately, I am not able to say why discrepancies like the one you have pointed out exist. I would encourage you to reach out to the Census Bureau as they may have more specific information regarding how the 1950 PUMS file is expected to deviate from the summary tables.