I’m using Colombia 2005 data for URBAN in R, and have noticed that the frequencies for the unweighted data and weighted data using PERWT differ by a factor of 7, and the proportions can differ by a factor of 2. Which one is correct, which one should I use, or how do I interpret them?
svy.w4 ← svydesign(ids = ~1, data = data05, weights = data05$PERWT/100)
prop.table(table(data05$URBAN))
1 2
0.4190888 0.5809112
prop.table(svytable(~data05$URBAN, design = svy.w4))
data05$URBAN
1 2
0.2416189 0.7583811
Using weights is necessary to produce estimates that are nationally representative (which you do in your second table). While the Colombia 2005 census data distributed by IPUMS International includes 10% of the records from the full census, this is not a random subsample of the full data. As mentioned in the sample characteristics page, the data was drawn from a stratified sample shared with IPUMS by the Colombia Departamento Administrativo Nacional de Estadística (DANE). As a result, persons and households with some characteristics are over-represented in the samples, while others are underrepresented. The documentation additionally notes that the census undercounted the population by an estimated 3.7%. Applying PERWT adjusts for this such that persons in the data are representative of the sample universe (alternatively, HHWT allows researchers to produce representative estimates for households).
1 Like
This is very helpful. I should clarify that I am hoping to use municipal-level data. Since the sample is only 10% of the census data, there is probably a lot of missingness. The question I have now is whether missingness occurs because whole municipalities or departments were omitted, or because only some individuals or sub-municipal units were omitted. Would you recommend seeking municipal-level data from the source itself (in this case DANE), or is it ok to use IPUMS data at the municipal-level? Thank you Ivan!
The department of residence for this sample is identified in the variable GEO1_CO2005 and the municipality is identified in GEO2_CO2005. It appears that all departments are represented in the data. Regarding municipalities, the comparability tab for GEO2_CO2005 notes that municipalities with populations less than 20,000 (based on 1993 counts) were regionalized (combined) with neighboring municipalities within the same department to create units with populations greater than 20,000. For example, residents of Ciudad Bolívar, Hispania, and Betania municipalities in Antioquia are all assigned the same code in GEO2_CO2005 (5008). All municipalities are represented in the sample, but they are not all identified individually because of the grouping together of smaller municipalities in the GEO2_CO2005 variable.
If you require identifying each individual municipality, then you might try to reach out to DANE to see if you can obtain more precise geographical identifiers. To use the IPUMS International microdata, you will have to either restrict your analysis to municipalities with 20,000+ inhabitants or aggregate your analysis to these regionalized units.
1 Like