I am researcher interested in combining the PRCS with the ACS, as I am interested in learning how nativity differences among Puerto Ricans are associated with different outcomes. I learned that the PRCS, as downloaded from the IPUMS USA website, does not have cluster and strata variables. So, I am wondering how to best account for the complex sample design of both surveys when doing my analysis.
Would you recommend me to create a value for the cluster and strata that identify respondents from the PRCS, so I can use the cluster and strata in my analysis? That is, to give the strata and cluster variables a random value for those interviewed by the PRCS. OR Would you recommend me to use the replicate weights instead?
I used the replicate weights for the mean estimations and regression analysis, which works fine. However, I am having problems running post-estimation commands after using the replicate weights because the STATA post-estimation commands are based on the assumption that the standard errors are linearized NOT brr.
I would appreciate your help about how to best account for the complex survey design of a sample from a dataset that combines the PRCS with the ACS.
In order to create cluster and strata for the PRCS samples, you can follow the same method as used to create them for ACS samples. For strata, you will need to concatenate STATEFIP and PUMA. For cluster, it can be generated like so: 1000000000*year+(serial*10)+datanum. Since datanum is unique across samples, this will allow you to create unique clusters after combining the ACS and PRCS samples.
You are correct that STRATA and CLUSTER variables are not available for PRCS data (as shown in the availability tabs for each variable). The Census Bureau provides replicate weights for PRCS data; these should enable you to estimate standard errors. If, however, your analysis combines PRCS and ACS or you require postestimation commands that are not available when using replicate weights, you will need to calculate STRATA and CLUSTER variables using the methods mentioned above with one modification. The variable DATANUM has been replaced by SAMPLE; details are available on the SAMPLE variable description page. I can’t be sure, but I assume the “1000000000” in the CLUSTER formula is designed to ensure that values are unique as SERIAL only uniquely identifies a household within a sample.