Weights for linking CPS basic monthly data

Hi,

I´m trying to link basic monthly CPS data across two consecutive months. I´m trying to create gross flows from 1976 to 2018 by age and sex. I´d like to know which weight should I use and how do you recommend to use it?

I see Shimer(2012) has a code that does this. He uses WTFINL, and applies this weight by adding the weights across to month and divinding by two:

gen weight = (fweight(month1)+fweight(month2)/2
egen double flows = sum(weight), by(lfs2)
replace flows = flows/100

Should I divide the weight by 100? Or any further adjustment?

I see the universe of WTFINL is 1976-1988 all persons, and 1989-2018 civilians. Also, I see that this is a cross-sectional weight, and I´m creating a time series. So, Can I use this weight from 1976 to 2018 and get representative estimates of the population? How does the change in the universe (from both samples 76-88 and 89-18) would affect my results/representativeness? Can I use to create estimates for certain groups of the population, say young people?

If I should use some of the longitudinal weights that CPS provides, which one do you recommend? If I want to estimate flows by different ages and sex, will these weights be represenative?

I see LNKFW1MWT is a weight I could use from 1976 to 2018, though it has some breaks in the series, which means those years I won´t be able to use my data?

In the other hand, I have PANLWT from 1994 to 2018, which seems to serve the purpose of analyzing panel data.

Finally, If I use any of the longitudinal weights to link people across two consecutive months, should I also add the weights across two months and divide them by two?

Thanks in advance for the answer!

1 Like

From what you are explaining here, it sounds like you should use the LNKFW1MWT sampling weight. This variable is designed to specifically account for situations when users are linking together adjacent basic monthly samples. Although you are correct that the LNKFW1MWT variable is not available in some years, it is only available in samples that can be linked to the month after using CPSIDP. See here for details on these samples.

Hi Jeff, thanks a lot for your reply.
Regarding my other questions, is this wrong if I use WTFINL as well? In which way using the wrong weight would bias my results?

And how should I use LNKFW1MWT between two months? Sould I add the weight of two months and divided by 2? What about dividing that by 100 as Shimer does? Is that necessary, or any further adjustment?

Thank you!

If you use WTFINL and do not adjust the value for the fact that you are (a) pooling samples together and (b) using the CPS data as a panel rather than a cross-section, the weights will inflate any population count to be larger than the actual US population size. The method you discuss above (i.e., adding the weights across months and dividing by two) is approximately appropriate, but really does not account for the longitudinal features of your data in the way LNKFW1MWT does. Regarding how to use LNKFW1MWT: no additional adjustments are needed. The IPUMS command files automatically make all of the necessary adjustments.

Hi Jeff, thanks a lot!

With respect to your answer. I´m constructing a gross flow, which means the transition of an individual from month1 to month2. I still have to handle the LNKFW1MWT adding the weight across two months and then dividing by two right? Is there any other way to handle the weights?

The specific answer to your question depends on what you intend to calculate. If you are aiming to estimate some sort of population average that is representative of the two months, then dividing the weight by two is a reasonable approach. If you are performing regression analysis, then dividing the weight by two is not necessary. For more detail on weighting in regression analysis, see the attached paper by Solon et al. (2015).

In investigating this question, I came across some helpful resources that you might find helpful. First, the core difference between PANLWT and LNKFW1MWT is that the first comes from Bureau of Labor Statistics and the second is generated by IPUMS. The BLS doesn’t make PANLWT available in the data before 1994. Second, both weights are intended for linking between two months, but PANLWT uses the population controls and weights from the second month and LNKFW1MWT uses these from the first month to calculate the longitudinal weight. Therefore, it is permissible to use LNKFW1MWT in years where PANLWT is not available. More info on PANLWT can be found in Technical Paper 66, p10-14 (85/175 in the pdf).

Solon et al. (2015 JHR).pdf (272.0 KB)

Hi Jeff, thank you for your answer, it was very useful!

I have another question regarding the representation of my sample.

  1. Let´s say I have 65,000 unweighted observations in my cps file and I want to know the distribution by age in the population, is this way to obtain that number ok?

svyset LNKFW1MWT
tabulate age [fweight= LNKFW1MWT]

  1. And, if in order to check whether I have worked out my data correctly, I want to compare the statistics I get with, let´s say Census data, can I do that? For example, if using this weight I get that I have 1,020,000 young persons aged 19-20 years old. Can I compare it with the number of 19-20 years old from a Census population table? If not, how could I check I get the right statistics?

Thanks a lot!

After talking more about this with some folks around the office, the consensus is not to use LNKFW1MWT when PANLWT is not available. The reason is PANLWT essentially adjusts time 2 weights whereas LNKFW1MWT uses time 1 weights. So, these sampling weight values will be close but not exactly the same. Therefore, the “best” way forward is to construct a sampling weight using the methods used to create PANLWT for the samples before 1994. Ultimately, we’d like to make the PANLWT variable available for years prior to 1994, but at this time I am not able to forecast when this will become available via IPUMS CPS. In the meantime, we do have some resources available that will help you create this weight yourself. The attached paper has much more information about gross flow analysis using CPS data. Additionally, our longitudinal weights page has some starter code that can be helpful as your create this weight.

Hi Jeff, thanks again for your answer!

I wanted to ask you though if you could please clarify a bit more. I found the answer a bit confusing, in relation to what you have answered before.

You recommended on your answer (April 18th) using the LNKFW1MWT weight for longitudinal series such as the one I ´m constructing. So I´m confuse when to use LNKFW1MWT and when PANLWT?

thanks!

Sorry for the confusion. After hearing more about your intended analysis and talking with some folks around the office, our best advice is to use the PANWT variable when performing gross flow analysis. Because the PANWT variable is not available prior to 1994, you’ll need to create this variable yourself in the pre-1994 samples. The resources provided in the previous post should be helpful. The LNKFW!MWT weight, on the other hand, is not specifically designed for gross flow analysis. The key difference between these weights is PANLWT uses time period 2 sampling weights whereas LNKFW1MWT uses time period 1 weights. So, although the values of these weights will be similar they will not be exactly the same.

Hi Jeff,

thanks a lot! Now I understand.

I have one more question regarding the use of the weight PANWT. As you know, I ´m constructing gross flows between employment states in two consecutive months. So, the question is this:

In order to get how many people transition from employment (month1) to inactivity (month2), is the right approach to use the average weight of the first and second month? or should I use the weight of the second month?

Thanks!

Based on the discussion in the attached paper, I think that you do not need to divide the PANWT values when performing gross flow analysis. I’d encourage you to read through the attached paper as it seems like it will be very helpful for your work.

HarleyJFrazisEdwinLRobiso.pdf (595.2 KB)

Hi there,

Sorry to resurrect an old thread but I am facing a very similar issue here; I want to estimate the fraction of people losing their job in one month for given characteristics (e.g. age).

In principle, I would expect that it would be equivalent to use PANLWT looking backwards (employed in t-1 and unemployed in t) with LNKFW1MWT looking forward (employed in t and unemployed in t+1). That is not at all the case though; PANLWT seems to be giving reasonable aggregates at a point in time (very similar to using WTFINL) but LNKFW1MWT is way off.

To give an example, in a sample where I have dropped those under 15 and those in the armed forces:

sum employed if time==tm(2018m6) [aw=LNKFW1MWT], which is equivalent to
sum f.employed if time==tm(2018m6) [aw=LNKFW1MWT]

gives a weight of around 196 million.

While

sum l.employed if time==tm(2018m6) [aw=PANLWT], which is equivalent to
sum employed if time==tm(2018m6) [aw=PANLWT],

gives a weight of 257 million, very close to the 261 million weight from WTFINL. In this example, employed=(empstat==10 | empstat==12).

Can you please advise on how to deal with this? The above discussion only partially addresses my issue as I don’t only want a simple gross flow analysis, but I am also interested in an individual-level analysis (people who had jobs and lost them), so mere aggregates will not suffice.

As long as I understand your application correctly, you should be using the PANLWT variable. This weight variable is specifically designed for the sort of analysis you are implementing. This is because PANLWT adjusts the weight value based on the fact that not all people are linked between months 1 and 2 and uses sampling weights from the second month (while LNKFW1MWT uses sampling weights from the first month). This is the key difference between PANLWT and LNKFW1MWT and is likely what is driving the discrepancy between each method you show here.

Thanks Jeff; I tried both and the ratios I was looking for were very similar.

I have a related question now that I started to doing some exploratory regressions. You did mention above the nice paper by Solon et al on regression weighting; I think it has to be taken into account in my case given different sampling probabilities, but beyond that, I am struggling to understand whether I need the ASEC replicate weights in my monthly samples. While it is very clear that I do need it for the March supplement, I have not found any direction on what to do with only the monthly samples.

Moreover, should I need to use the replicate weights, should I merge with my main file on CPSIDP and year?

Thanks!

This is a good question that I am not well-equipped to answer for several reasons. First, the standard of practice regarding sampling weights varies across disciplines within the social sciences. Therefore, it may be best to discuss this with a colleague within your specific field of study. Second, replicate weights are only made available for the ASEC samples within the CPS data. Therefore, although it may seem theoretically worthwhile to apply replicate weights to the basic monthly samples, there is no clean way to do this.

I will say that applying replicate weights typically does not change the size of the standard error around estimates dramatically. So, while it may seem feasible to merge sampling weights from the ASEC sample onto basic monthly samples, this merge may inject more bias into the variance estimates than is reduced by using replicate weights.

Got it, thanks a lot Jeff!