I am analysing the India surveys and want to svyset the data on STATA.
I was wondering which variable I should use as the psu/cluster variable? I have seen ‘cluster’ mentioned for this use but this is not available on the international data.
Many thanks
Nearly all IPUMS International samples, and all India samples, consider households to be the primary sampling unit (PSU). Therefore, it is usually advised to use the household identifier variable SERIAL for clustering. Note, however, that depending on the type of analysis you are performing, it may be more appropriate to cluster a different levels. For example, if you are aggregating data across regions before performing your analysis, then controlling for inter-region correlation of your outcome variable may be advisable. More details about variance estimation with IPUMS International data are available here. Specific sample characteristics for Indian samples are available here.
Thank you for the quick response. Sorry for my confusion but the sampling strategy states that it is a multistage design, with the first round selecting rural villages and urban wards. So should an identifier for these not be the PSU? Or was that not available in the data?
That is correct, villages and urban wards are not identifiable in the data. The lowest level of geography for India is the region level. Houshold IDs are available because they do not identify geographic location.