Duplicate individual ids of the Vietnam 2009 census?


#1

I’m working with 1989, 99, 2009 Vietnam data.

I’ve created household id’s and personal id’s.

gen double hhid=sample*10^8+serial

gen double pid=hhid*100+pernum

Then, I get more than 1 million observations that are not unique observations for 2009.

. bys pid: gen obs=_N

. tab obs

obs | Freq. Percent Cum.

------------±----------------------------------

1 | 17,376,097 90.63 90.63

2 | 2 0.00 90.63

3 | 1,796,643 9.37 100.00

43 | 43 0.00 100.00

------------±----------------------------------

Total | 19,172,785 100.00

. tab year if obs==3

Year | Freq. Percent Cum.

------------±----------------------------------

2009 | 1,796,643 100.00 100.00

------------±----------------------------------

Total | 1,796,643 100.00

Do you have an idea of what may be happening?

Thank you!


#2

Your line “gen double hhid=sample*10^8+serial” does not create sufficient space to include both sample and serial in the same variable without overlap. Instead, you should multiply sample by 10^10. When I make this change to your code, I get 19,172,742 unique values for pid across the three Vietnam census samples. In other words, there are zero duplicate individual IDs.

Hope this helps.