How to take into account the complex survey design (primary sampling unit) of the Brazilian Censuses?


Hello, I have read that the Brazilian censuses had a complex survey design with muncipios being the primary sampling units.

I would like to take this into account in order to obtain accurate standard errors.

My question is the following: can I use GEO2B_BR (municipios with inconsistent boundaries over time) to identify the primary sampling unit given that this variable groups municipios with fewer than 20,000 inhabitants into a single category for each state?

Thank you for your help!



Brazil in 1991, 2000, and 2010 used a complex stratification design (you can find details about each design here), while earlier years used systematic sampling. Since both of these are forms of stratification, rather than simple random samples, you technically should adjust your sample error estimates to account for the sample design. On the other hand, the sample size of the Brazil census is quite large, which means the risk of drawing invalid inferences from not adjusting the standard errors is minimal.

If you are looking at smaller subgroups or relationships that are marginally statistically significant, then it may be necessary to adjust your standard errors. While municipalities were the smallest geography in the Brazil census, households were actually the sampling unit. Thus, you should adjust for the clustering of persons within households by using SERIAL (Household ID) as the cluster variable. The additional use of household or person weights should account for any oversampling of geographies due to the complex stratification design.

This User Note on sampling error and variance estimation provides more information on accounting for sample design, including strategies at the end of the note.

Hope this helps.



Thank you Tim Moreland, this is really helpful.

In my case, I am interested, for each municipio, in the mean income of individuals according to their race.

Can you please tell me if I undesrtood well your reply and the documentation you advised me to read?

a) If I am solely interested in computing these means (ie. in point estimates) without making any statistical inference about them, then I just have to take care of weights and I don’t need to bother with clusters, stratification, or special subpopulation estimation (because these latters only affect the standard errors of the point estimates).

b) If I am interested in statistical inference (eg. testing if the mean income of black people is statistically different from the one of white people in a given municipio A) then taking care of cluster and special subpopulation estimation become very important because I could otherwise obtain mistakenly some significant results (type I error). There’s no stratification variable in IPUMS but this is less of a concern because not adjusting for it yields conservative standard errors (so the worse thing that could happen is a type II error). To summarize, for case b), I should write in Stata:

svyset serial [pweight=wtper], vce(linearized)

svy, subpop(muncipioA): mean inctot, over(race)

Have a nice day, thanks a lot.



In regards to Part A of your question, that is correct. If you are only interested in the mean, then just using weights will be sufficient.

As for Part B, that is also correct. Clustering decreases the precision of the sample estimates, so you certainly want to account for this (Type I error). It is not currently possible to account for stratification. But, stratification acts to increase the precision of the sample estimates and thus is less of a concern (Type II error).

Finally, your STATA code also looks correct.



Great! Thanks a lot for your time and help.