Weights Panel Data (CPS + BMS)

Dear IPUMS teams,

I have some questions about what weights should be used when a panel structure, linking observations from the Annual Social and Economic (ASEC) and the Basic Monthly Sample (BMS) is created. For my project, I have created a panel with CPS data using (CPSIDV) to link participants from 2002 to 2024. My aim is to observe the wage outcomes of migrants who naturalize within the observation period available. After linkage, cleaning, etc… I ended up with around 600 unique individuals who naturalized within the observation window with a total of 1.375 observations.

Now, my question is: (1) which weights should I use when conducting an individual fixed effects regression? (2) how to apply them? Given that not all of my observations have the variable panlwt.

I understand all weights are created to account for different things, however, after consulting with some IPUMS members during office hours, their recommendation was to probably use a combination of weights between earnwt (because my outcome variable is wages from labour), pnlwt, and asecwt. For instance, after reading this very informative post, I understand that I should probably use pnlwt, however, not all of my observations have a value for this weight.

What I have attempted already is to do a sort of hierarchical approach by prioritizing:

  • 1st panlwt (for those who have it available)
  • 2nd earnwt
  • 3rd asecwt
  • 4th wtfinl

This is the syntax I used:

gen analysis_weight = panlwt

  • For ASEC observations without panel weight, use ASEC weight
    replace analysis_weight = earnwt if missing(analysis_weight) & !missing(earnwt)

  • For ASEC observations without panel weight, use ASEC weight
    replace analysis_weight = asecwt if missing(analysis_weight) & !missing(asecwt)

  • replace analysis_weight = wtfinl if missing(analysis_weight) & !missing(wtfinl)

Would you say this is a reasonable approach to apply given the panel structure of the data?

Furthermore, once this was done, I attempted to normalize the weights in two different ways.

  1. Firstly, by dividing it by / r(mean)
  2. Secondly, as I have read in other forums, such as this one, dividing it by the number of sample I extracted my data from (297 in my case). Would either of these approaches be correct?

I hope this was relatively clear and please let me know if you need more information from me to provide a complete answer. Thank you very much in advance for your help and support.

Hi @Tomas_Bascolo,

Thanks for checking out the discussion I had with IPUMS staff. Before going further, a quick disclaimer, I am not an IPUMS expert or staff member, so please take my words with caution.

From my understanding, the discussion focused on how to annualize the monthly weights provided in BMS into annual weights. If I were you, I would recalculate the annual weights for BMS before merging with ASEC, so both datasets align in terms of weighting before moving forward.

Another takeaway from the discussion was that whenever you plan to include earnings in your analysis, earnwt should be your first choice since it is designed for income-related calculations. In my work, I use wtfinl for demographic information and earnwt for analyses involving earnings.

Not sure if this is helpful, but that is my two cents.

1 Like

Thank you @kobkabnaja for your input! It makes sense, recalculating the weights before merging them makes sense and the order of priority also makes sense.

Appreciate your input!

Rather than prioritizing different weights based on availability, our IPUMS data team recommends using a single longitudinal weight. You might use one of the longitudinal weights that are offered or create your own by modifying the sample code. I’ll explain what different weights are used for so that you can determine the best approach for your study.

EARNWT, ASECWT, and WTFINL are all offered in order to make a cross-sectional sample representative of the US (civilian non-institutional) population. The correct weight for a cross-sectional analysis depends on the variables that you include, though only a single weight should be used at a time. For instance, EARNWT is used for analyses that include Earner Study variables, which are administered to respondents in one of the outgoing rotation groups. EARNWT adjusts for the fact that analyses of these variables are conditioned on respondents being in one of the outgoing rotation groups. Note that EARNWT is not used for analyses of income variables in ASEC samples since those are administered to all ASEC participants. If your analysis does not include any of the Earner Study variables and instead uses income data from the ASEC, ASECWT will be the correct cross-sectional weight.

These cross-sectional weights however are insufficient for longitudinal analyses since a linked sample by definition only includes people who can be linked between samples. There are many reasons why someone might not be linked between samples including cases where they stop responding to the survey or move out of the sampled household. Someone who is not linked would be dropped from your sample, biasing it in a way that cross-sectional weights do not address. A longitudinal weight however corrects for this by ensuring that the sample that is linked is representative of your population of interest. IPUMS offers a number of different longitudinal weights. LNKFW1YWT will likely be of particular interest since it is used for weighting linked persons across one year (such as between ASEC samples or between the outgoing rotation groups). Note that while PANLWT is another longitudinal weight, it is only used for weighting flows between employment status for adjacent months which does not appear to be the focus of your study.

One complication with LNKFW1YWT is that this weight is currently not offered for ASEC oversample respondents (ASECOVERP = 1). While the oversample makes up ~1/3 of the ASEC sample in each year, the oversample also adds a large number of Hispanic respondents to the ASEC. However, this can be used for weighting the remainder of linked ASEC respondents. Incorporating oversample respondents into LNKFW1YWT is possible, but requires recreating this weight by editing and running the Stata replication files that are offered in the linking the CPS user guide. The key edits include adapting the code to run for adjacent years instead of adjacent months and substituting mentions of WTFINL for ASECWT. A similar edit can be done to accommodate analyses of Earner Study variables by substituting WTFINL with EARNWT. This should be carefully considered since EARNWT is additionally constructed to reproduce labor force stocks (refer to CPS technical paper 66), which can affect estimates of transitions between employment status.

There is no need to divide the weights by r(mean). You may divide the final weights that you get by the number of samples in your extract; this is only necessary if you want to produce aggregate estimates (e.g., the number of naturalizations that occured), but is unnecessary for estimates that involve ratios (e.g., the percent of naturalized persons who saw an increase in their earnings).

Dear Ivan,

Thank you very much for your thorough explanation!

The approach that I explained in my post (of combining different weights was founded from reading previous posts and brining my questions to IPUMS office hours. However, in light of your explanation I do see why this approach would not be the best.

So, from your response then, it seems like the best approach would be to create my own weights following the stata replication files. In fact, I have attempted to do so following the replication stata examples, however, given the large extracts I have, this has proven to be extremely computationally heavy (stata crashing due to lack of processing power). Nonetheless, I will attempt this again adapting the code with the substitutions you mentioned:

  1. Individuals who link between 2 adjacent years.
  2. ASECWT instead of WTFINL

I will attempt to do it again, however, I am not optimistic that my computer will have the space or processing power necessary to do it (since I have tried a few times already). In case creating my own weights is not possible, what other approaches could be taken to correctly account for the panel structure of my data. The main issue that I have is that people who link in my sample do not have values for LNKFW1YWT in all of their observations.

You might try running the code to create weights with one linked sample at a time in order to limit the computational load. To start, you can restrict your data file to just the 2002 and 2003 ASEC and run the code to create longitudinal weights for this linked sample. Save this file. Then, open a new file with only the 2003 and 2004 ASEC and run the code again. Repeat this until you have created longitudinal weights for your entire panel and then append all of the files together.

If you’re unable to create your own longitudinal weight, then my recommendation is to use one of the pre-constructed longitudinal weights such as LNKFW1YWT and work with the sample that has values for LNKFW1YWT. This means that you will need to drop all ASEC oversample observations since they will have missing values for this weight. Note that you cannot use this weight to link between samples that are not exactly one year apart or for weighting Earner Study variables. For additional insight, you might consider searching the IPUMS bibliography for other papers that make similar links to see what other researchers have done.

1 Like

Dear Ivan,

Thank you for your insights. I will try to do this, even though I have already attempted to only use 2 years extracts and it is still hard to compute the weights.

Just a note on the LNKFW1YWT availability. It is not only those in the oversample who have missing values on this weight variable. There are also observations that do not come from ASEC who have missing LNKFW1YWT values. My guess is that these individuals might have valid linkages across months but not across years, thus do not have weights for this.

Anyways, thank you for your input and support.

Hi Tomas, based on you reply I actually have additional information and suggestions to share that I think would be helpful.

Overall, it’s crucial to understand that each longitudinal weight corresponds to a specific type of link. These are categorized based on the number of links and the period of time over which the links occur. When an observation links to a different sample(s), only a specific type of longitudinal weight can be used to analyze these linked observations. Therefore, any type of longitudinal link that you use will necessarily only include a subset of your total observations.

The time period between observations affects the number of possible and actual links, which affects the longitudinal weights. To take a step back, I am going to describe the CPS rotation pattern, discuss how the rotation pattern impacts longitudinal weights, and share a few comments on the provided longitudinal weights. I am hopeful this will provide you with a frame of reference for identifying how to use or adapt the existing materials for your specific application.

CPS Rotation Pattern

Households in the CPS are interviewed using a 4-8-4 rotation pattern where they are in the panel for four months, take an eight month break, and are interviewed for a final four months before exiting the panel. Besides the 8-month break when persons are not interviewed, individuals may drop out of the panel at any time if they move out of the sampled housing unit or reappear in the panel if they rejoin the housing unit (the CPS samples dwelling/housing units, not people; it will not follow individuals if they move). As a result, a person record in any given month of the CPS microdata may be linked anywhere from zero to seven times across the 16-month period when their household is in the CPS. You can visualize the CPS rotation pattern and how many people will link between months using the IPUMS CPS RoPES tool.

Longitudinal Weights in IPUMS CPS

The IPUMS CPS longitudinal weight variables provide a weight for person records that actually link between the specified time periods based on the population counts of the people who were eligible to link in those time periods. The goal is for the weights to inflate the sample of observed linkages to be representative of the population that was eligible to be linked. For example, LNKFW1YWT is used for weighting analyses for individuals linked across two observations separated by exactly one year. An example of this would be linking between adjacent years of the ASEC (e.g., 2023 and 2024). There will not be a longitudinal weight for persons who are observed in the 2023 ASEC and eligible to be in the 2024 ASEC but who are not observed in the 2024 ASEC, though their information will be used to generate the longitudinal weights. Persons observed in both the 2023 and 2024 ASECs will have a longitudinal weight. Persons who are not eligible to link (e.g., those who participated in the 2022 ASEC and will rotate out of the panel before the 2024 ASEC) will not have a longitudinal weight nor will they be used in the construction of the weight. Another example is LNKFWMIS14WT, which is used for a sample where all people are linked across four observations (their first four months in the panel) that are each exactly one month apart. The same logic applies, but the linking requirements are more stringent as individuals must be observed in all four months; this means fewer cases will link.

It’s still not fully clear to me what linking approach you are using to construct your panel, so I will explain two of our most widely used longitudinal weights:

My previous recommendation was to use two adjacent ASEC samples (excluding BMS) to generate LNKFW1YWT since only the ASEC samples have annual income data. This restricts your analysis to observations that are linked once exactly one year apart in the ASEC, retaining two out of the eight maximum number of appearances of each person in the panel (and only if they appear in the ASEC). You could also use LNKFW1YWT for weighting BMS samples that are one year apart (e.g., January 2023 and January 2024). It is not possible to use LNKFW1YWT for weighting linked samples that are not exactly one year apart.

Another approach that may be less restrictive for your analysis would be to use the BMS longitudinal weight for two adjacent months (LNKFW1MWT). This weight is provided anytime a person is linked between two adjacent months, which allows you to retain more of their individual observations across the panel than LNKFW1YWT. This BMS longitudinal weight can also be used for March BMS (ASEC non-oversample respondents), but it cannot be generated for ASEC oversample respondents since they will never appear in an adjacent month.

I encourage you to consult this PowerPoint on weights in CPS to help clarify which approach will work best for you.