Mexico municipality averages


we are working with the Mexican IPUMS data from 1990-2015. We observe surprising patterns in the data, e.g. concerning labor market outcomes of kids, over time.

See for example attached fig. 1, showing the municipality averages for share of kids age 15-18 who are unemployed (based on weighted means; y axis), plotted against the number of observations of 15-18 year olds in the municipalities (only up to 100; x axis), in different years. What strikes us is the apparent pattern (and it’s regularities) of the negative relationship between number of observations and unemployment shares. In your view, what could drive this pattern? Should we be concerned about data quality? Fig 2 repeats fig 1, not limiting by number of observations, and using log number of observations.

Related, we observe that the share of municipalities which yield more than 50 or 100 observations of individuals aged 15-18 varies a lot over the waves. Do you you have a suggestion for how to correct for these differences when running analyses using averages such as average enrollment or employment rates for 15-18 year olds?

Thanks for your help!

The relationship you are seeing is expected for the municipalities with small sample sizes. The “smooth curve” patterns you see (in 1990, 1995, 2005), especially in the lower size categories, look like the functions 1/n and 2/n -> this is what you see when only one or two of the people in the sample are unemployed, which is likely when you have a very small sample. I think it would be more informative to look at the mean unemployment rate across municipalities in a given size category, rather than a scatter plot. This would tell you if the size-unemployment relationship is a real correlation or just a visual artifact of the scatter plot.

Regarding your second question, these various samples for Mexico used different sampling strategies. For example, the 1990 sample was a 10% flat sample, while the 1995 sample was a stratified cluster sample of 0.4% of households. You can read more detail about the samples here. The survey weights (PERWT and HHWT) will correct for different sampling rates for geographic units of different sizes across samples. Also make sure that you are using a harmonized geography variable so that your municipality boundaries are constant over time. In your case you would want to use GEO2_MX.