Dear IPUMS teams,
I have some questions about what weights should be used when a panel structure, linking observations from the Annual Social and Economic (ASEC) and the Basic Monthly Sample (BMS) is created. For my project, I have created a panel with CPS data using (CPSIDV) to link participants from 2002 to 2024. My aim is to observe the wage outcomes of migrants who naturalize within the observation period available. After linkage, cleaning, etc… I ended up with around 600 unique individuals who naturalized within the observation window with a total of 1.375 observations.
Now, my question is: (1) which weights should I use when conducting an individual fixed effects regression? (2) how to apply them? Given that not all of my observations have the variable panlwt.
I understand all weights are created to account for different things, however, after consulting with some IPUMS members during office hours, their recommendation was to probably use a combination of weights between earnwt (because my outcome variable is wages from labour), pnlwt, and asecwt. For instance, after reading this very informative post, I understand that I should probably use pnlwt, however, not all of my observations have a value for this weight.
What I have attempted already is to do a sort of hierarchical approach by prioritizing:
- 1st panlwt (for those who have it available)
- 2nd earnwt
- 3rd asecwt
- 4th wtfinl
This is the syntax I used:
gen analysis_weight = panlwt
-
For ASEC observations without panel weight, use ASEC weight
replace analysis_weight = earnwt if missing(analysis_weight) & !missing(earnwt)
-
For ASEC observations without panel weight, use ASEC weight
replace analysis_weight = asecwt if missing(analysis_weight) & !missing(asecwt)
-
replace analysis_weight = wtfinl if missing(analysis_weight) & !missing(wtfinl)
Would you say this is a reasonable approach to apply given the panel structure of the data?
Furthermore, once this was done, I attempted to normalize the weights in two different ways.
- Firstly, by dividing it by / r(mean)
- Secondly, as I have read in other forums, such as this one, dividing it by the number of sample I extracted my data from (297 in my case). Would either of these approaches be correct?
I hope this was relatively clear and please let me know if you need more information from me to provide a complete answer. Thank you very much in advance for your help and support.
Hi @Tomas_Bascolo,
Thanks for checking out the discussion I had with IPUMS staff. Before going further, a quick disclaimer, I am not an IPUMS expert or staff member, so please take my words with caution.
From my understanding, the discussion focused on how to annualize the monthly weights provided in BMS into annual weights. If I were you, I would recalculate the annual weights for BMS before merging with ASEC, so both datasets align in terms of weighting before moving forward.
Another takeaway from the discussion was that whenever you plan to include earnings in your analysis, earnwt should be your first choice since it is designed for income-related calculations. In my work, I use wtfinl for demographic information and earnwt for analyses involving earnings.
Not sure if this is helpful, but that is my two cents.
1 Like
Thank you @kobkabnaja for your input! It makes sense, recalculating the weights before merging them makes sense and the order of priority also makes sense.
Appreciate your input!
Rather than prioritizing different weights based on availability, our IPUMS data team recommends using a single longitudinal weight. You might use one of the longitudinal weights that are offered or create your own by modifying the sample code. I’ll explain what different weights are used for so that you can determine the best approach for your study.
EARNWT, ASECWT, and WTFINL are all offered in order to make a cross-sectional sample representative of the US (civilian non-institutional) population. The correct weight for a cross-sectional analysis depends on the variables that you include, though only a single weight should be used at a time. For instance, EARNWT is used for analyses that include Earner Study variables, which are administered to respondents in one of the outgoing rotation groups. EARNWT adjusts for the fact that analyses of these variables are conditioned on respondents being in one of the outgoing rotation groups. Note that EARNWT is not used for analyses of income variables in ASEC samples since those are administered to all ASEC participants. If your analysis does not include any of the Earner Study variables and instead uses income data from the ASEC, ASECWT will be the correct cross-sectional weight.
These cross-sectional weights however are insufficient for longitudinal analyses since a linked sample by definition only includes people who can be linked between samples. There are many reasons why someone might not be linked between samples including cases where they stop responding to the survey or move out of the sampled household. Someone who is not linked would be dropped from your sample, biasing it in a way that cross-sectional weights do not address. A longitudinal weight however corrects for this by ensuring that the sample that is linked is representative of your population of interest. IPUMS offers a number of different longitudinal weights. LNKFW1YWT will likely be of particular interest since it is used for weighting linked persons across one year (such as between ASEC samples or between the outgoing rotation groups). Note that while PANLWT is another longitudinal weight, it is only used for weighting flows between employment status for adjacent months which does not appear to be the focus of your study.
One complication with LNKFW1YWT is that this weight is currently not offered for ASEC oversample respondents (ASECOVERP = 1). While the oversample makes up ~1/3 of the ASEC sample in each year, the oversample also adds a large number of Hispanic respondents to the ASEC. However, this can be used for weighting the remainder of linked ASEC respondents. Incorporating oversample respondents into LNKFW1YWT is possible, but requires recreating this weight by editing and running the Stata replication files that are offered in the linking the CPS user guide. The key edits include adapting the code to run for adjacent years instead of adjacent months and substituting mentions of WTFINL for ASECWT. A similar edit can be done to accommodate analyses of Earner Study variables by substituting WTFINL with EARNWT. This should be carefully considered since EARNWT is additionally constructed to reproduce labor force stocks (refer to CPS technical paper 66), which can affect estimates of transitions between employment status.
There is no need to divide the weights by r(mean). You may divide the final weights that you get by the number of samples in your extract; this is only necessary if you want to produce aggregate estimates (e.g., the number of naturalizations that occured), but is unnecessary for estimates that involve ratios (e.g., the percent of naturalized persons who saw an increase in their earnings).
Dear Ivan,
Thank you very much for your thorough explanation!
The approach that I explained in my post (of combining different weights was founded from reading previous posts and brining my questions to IPUMS office hours. However, in light of your explanation I do see why this approach would not be the best.
So, from your response then, it seems like the best approach would be to create my own weights following the stata replication files. In fact, I have attempted to do so following the replication stata examples, however, given the large extracts I have, this has proven to be extremely computationally heavy (stata crashing due to lack of processing power). Nonetheless, I will attempt this again adapting the code with the substitutions you mentioned:
- Individuals who link between 2 adjacent years.
- ASECWT instead of WTFINL
I will attempt to do it again, however, I am not optimistic that my computer will have the space or processing power necessary to do it (since I have tried a few times already). In case creating my own weights is not possible, what other approaches could be taken to correctly account for the panel structure of my data. The main issue that I have is that people who link in my sample do not have values for LNKFW1YWT in all of their observations.
You might try running the code to create weights with one linked sample at a time in order to limit the computational load. To start, you can restrict your data file to just the 2002 and 2003 ASEC and run the code to create longitudinal weights for this linked sample. Save this file. Then, open a new file with only the 2003 and 2004 ASEC and run the code again. Repeat this until you have created longitudinal weights for your entire panel and then append all of the files together.
If you’re unable to create your own longitudinal weight, then my recommendation is to use one of the pre-constructed longitudinal weights such as LNKFW1YWT and work with the sample that has values for LNKFW1YWT. This means that you will need to drop all ASEC oversample observations since they will have missing values for this weight. Note that you cannot use this weight to link between samples that are not exactly one year apart or for weighting Earner Study variables. For additional insight, you might consider searching the IPUMS bibliography for other papers that make similar links to see what other researchers have done.
1 Like
Dear Ivan,
Thank you for your insights. I will try to do this, even though I have already attempted to only use 2 years extracts and it is still hard to compute the weights.
Just a note on the LNKFW1YWT availability. It is not only those in the oversample who have missing values on this weight variable. There are also observations that do not come from ASEC who have missing LNKFW1YWT values. My guess is that these individuals might have valid linkages across months but not across years, thus do not have weights for this.
Anyways, thank you for your input and support.