Large difference between weighted and unweighted frequencies

Patrick_McQuestion · October 16, 2024, 4:17pm

I’m using Colombia 2005 data for URBAN in R, and have noticed that the frequencies for the unweighted data and weighted data using PERWT differ by a factor of 7, and the proportions can differ by a factor of 2. Which one is correct, which one should I use, or how do I interpret them?

svy.w4 ← svydesign(ids = ~1, data = data05, weights = data05$PERWT/100)
prop.table(table(data05$URBAN))
1 2
0.4190888 0.5809112
prop.table(svytable(~data05$URBAN, design = svy.w4))
data05$URBAN
1 2
0.2416189 0.7583811

Ivan_Strahof · October 17, 2024, 7:51pm

Using weights is necessary to produce estimates that are nationally representative (which you do in your second table). While the Colombia 2005 census data distributed by IPUMS International includes 10% of the records from the full census, this is not a random subsample of the full data. As mentioned in the sample characteristics page, the data was drawn from a stratified sample shared with IPUMS by the Colombia Departamento Administrativo Nacional de Estadística (DANE). As a result, persons and households with some characteristics are over-represented in the samples, while others are underrepresented. The documentation additionally notes that the census undercounted the population by an estimated 3.7%. Applying PERWT adjusts for this such that persons in the data are representative of the sample universe (alternatively, HHWT allows researchers to produce representative estimates for households).

Patrick_McQuestion · November 8, 2024, 8:58pm

This is very helpful. I should clarify that I am hoping to use municipal-level data. Since the sample is only 10% of the census data, there is probably a lot of missingness. The question I have now is whether missingness occurs because whole municipalities or departments were omitted, or because only some individuals or sub-municipal units were omitted. Would you recommend seeking municipal-level data from the source itself (in this case DANE), or is it ok to use IPUMS data at the municipal-level? Thank you Ivan!

Ivan_Strahof · November 11, 2024, 6:57pm

The department of residence for this sample is identified in the variable GEO1_CO2005 and the municipality is identified in GEO2_CO2005. It appears that all departments are represented in the data. Regarding municipalities, the comparability tab for GEO2_CO2005 notes that municipalities with populations less than 20,000 (based on 1993 counts) were regionalized (combined) with neighboring municipalities within the same department to create units with populations greater than 20,000. For example, residents of Ciudad Bolívar, Hispania, and Betania municipalities in Antioquia are all assigned the same code in GEO2_CO2005 (5008). All municipalities are represented in the sample, but they are not all identified individually because of the grouping together of smaller municipalities in the GEO2_CO2005 variable.

If you require identifying each individual municipality, then you might try to reach out to DANE to see if you can obtain more precise geographical identifiers. To use the IPUMS International microdata, you will have to either restrict your analysis to municipalities with 20,000+ inhabitants or aggregate your analysis to these regionalized units.

Topic		Replies	Views
Svyset for regression using Colombia 2005 Census INTERNATIONAL	2	373	April 30, 2020
Using weights correctly INTERNATIONAL	1	968	November 21, 2019
Mexico 2015 person and household weights are the same, why? INTERNATIONAL	4	655	June 5, 2020
Comparability ACS Census Weights	3	707	June 3, 2022
Proper way of using weights	3	652	January 12, 2023

Large difference between weighted and unweighted frequencies

Related topics