Hi! I am trying to compare changes from 2007-2009 to 2010-2012. Therefore, I merge ACS multi-year datasets for 2007-2009 and 2010-2012. I ran logit regressions in Stata with and without using weights, and the coefficient for the variable year changed substantially. To be clear, I used the syntax “logit i.year i.DV. predictors… [pw=perwt]” to apply weights. For unweighted regression, I simply dropped “[pw=perwt]”. In the weighted regression, the year variable is highly significant and positive, but in the unweighted regression, the year variable is negative and not significant (p=0.445).
Why is there such a big difference? Which result is right? When pooling data from multiple ACS datasets, can we still apply year-specific/dataset-specific weights in the pooled sample? BTW, I know I should have also used replicate weights, but I did not… I have the same question, how can we apply replicate weights if we pool multiple ACS datasets?
Since these are complex survey data, it is important to use weights to make the sample representative of the entire population and account for some sampling error. Since the person weights (PERWT) are specific to each person within each sample, it is not problematic to combine samples. That particular person will always have the same person weight, so they will always represent the same portion of the population in their given sample year.
Similarly, replicate weights are generated for each individual and household and included on each individual and household record. However, you will want to make sure you are accounting for year in any analysis since replicate weights meant to be used for generating standard errors for a specific year’s population.
Thank you, Joe! But why did the results change so substantially and only for the “year” variable? Should I trust the results with weights? The results that changed so substantially are very unsettling… Thank you in advance for your time!
Without seeing your analysis I can not say for sure, but it is possible that weighting the year variable somehow altered its influence. For example, if the population of interest is underrepresented in the unweighted data, and grew substantially in the second multi-year file, then adding the weights would better reflect the fact that there were a significantly larger amount of people in this population of interest, making YEAR a significant predictor. This is related to the fact that the YEAR variable in multi-year files gives the most recent year (i.e. in the 2008-2012 ACS file, every person is coded as YEAR==2012, since that is the year of the dataset). The variable MULTYEAR gives the interview year for multi-year files. This means that the YEAR change in your analysis is one 3 year change (which is an accurate representation of the data), not a series of six 1 year changes. So, if the increase in the population of interest was slow from 2007-2009 but then picked up from 2010-2012, relationship between YEAR and the population of interest would be a very steep line connecting the two time points of 2009 and 2012. We would expect this to show up in the unweighted data, but again if this population is underrepresented in the unweighted data, and therefore having large weight values to make the few cases in the data represent the true population, only by weighting the data would this trend become apparent.
This is all just speculative. To see if this might be the case, I would recommend checking the weight values of your population of interest and compare them between the two multi-year data sets. Does one of the data sets include larger weights for this population than the other data set? Are the number of cases representing the population of interest very small in both files? These and similar questions may get you closer to the answer.