FILESTAT - Comparability

I am working on consolidating the CPS ASEC households to tax filing units.

For that purpose, I use FILESTAT and noticed a striking discrepancy for the years 2004 and 2005; in those years, the share of joint filers (both less than 65; FILESTAT == 1) is much smaller than in other years. The difference seems to be absorbed by non-filers (FILESTAT == 6). See histograms below.

From the variable description I learned that time comparability of FILESTAT might have been affected by the introduction of the new CPS tax model in 2004. However, it is not clear to me why (and how) this would have affected the variable only in 2004 and 2005 but not later years.

Could someone help to make sense of this? Thanks a lot

1 Like

The IPUMS CPS team has looked into this problem and determined that this is an error in the underlying microdata obtained from the Census Bureau. They will be communicating with the Bureau about the issue, but there is no guarantee if or when it will be fixed.

1 Like

@Matthew_Bombyk

Thanks a lot for looking into this! I am looking forward to the outcome of the communication with CB.

If you’re interested in suggestions on how to adjust the variable, I’ll be happy to discuss (I came up with an adjustment algorithm which aligns it with previous and later years).

1 Like

That would be great if you can share the algorithm! If you feel comfortable sharing it on the forum that may be best so others can consider using it in their research. Otherwise you can email to ipums@umn.edu and the IPUMS team will review it. Thank you!

1 Like

Hello Johannes - I would also be interested in learning more about your algorithm. I use the TAXSIM tax calculator application maintained by NBER, which does not permit household units to have both a household head with FILESTAT == 1 and income from a spouse (so spouses which are non-filers). 2005 is the only year in which I have run into this issue, precisely because of what you’ve explained above. I was independently going to post here on the forum about this same issue you’ve detected when I came across your post. Any recommendations on how to overcome this would be greatly appreciated.

@Matthew_Bombyk @jvandernaald

Thanks for the interest - I’ll prepare a pseudo code algorithm (current one is written in Julia) and share results. Looking forward to your feedback.

The Julia code below implements my FILESTAT adjustment algorithm for 2004. I adjusted the syntax so it contains only standard flow control statements and logical indexing. I think it should be straightforward to reproduce in other languages - I am happy to answer questions if something is not clear of course.

Also, I wrote a brief technical report on the FILESTAT discrepancies which also includes more details on the adjustment algorithm and adjusted FILESTAT values. The report and some more code are in this repository: ASEC_FILESTAT_adjustment

@jvandernaald If you generate an implementation in another language I would be happy to include it in the repository.

@Matthew_Bombyk: Obviously, adjusting the FILESTAT variable in the IPUMS database would obviate the need for adjustment by users. If this could be an option, I am ready to assist if I can be helpful.

(I also found FILESTAT inconsistencies in years after 2006; as you can see from table 2 in the technical report, the adjustment algorithm replicates slightly worse for these years. To investigate, I looked into those observations for which the algorithm produces diverging FILESTAT values in two random years (2006 and 2015). I found that if at least one of the joint filers is above 65, the original FILESTAT does not assign the same value to both spouses but always 1 to one of them. This seems inconsistent to me but maybe I am missing something? It would be great to clarify with CB.)

## Prepare 2004 data
df_2004 = select!(df_ASEC_2004,[:SERIAL, :RELATE, :AGE, :ADJGINC, :FILESTAT, :FILESTAT_adj]);
df_2004[!, :num] = 1:(size(df_2004,1));
hhs_2004 = unique(df_2004.SERIAL);

for k in hhs_2004

    df_tmp = df_2004[df_2004.SERIAL .== k, :]
    RELATE_vec = unique(df_tmp.RELATE)

    if ~(201 in RELATE_vec)
         continue                                           # keep FILESTAT categories as they are
    else

        num_vec = unique(df_tmp.num)

        age_101 = df_tmp[df_tmp.RELATE .== 101, :AGE][1]
        age_201 = df_tmp[df_tmp.RELATE .== 201, :AGE][1]
        if age_101 < 65 && age_201 < 65                     # Both below 65
            df_2004[num_vec[1], :FILESTAT_adj] = 1
            df_2004[num_vec[2], :FILESTAT_adj] = 1
        elseif age_101 >= 65 && age_201 >= 65               # Both 65+
            df_2004[num_vec[1], :FILESTAT_adj] = 3
            df_2004[num_vec[2], :FILESTAT_adj] = 3
        else                                                # One above, one below
            df_2004[num_vec[1], :FILESTAT_adj] = 2
            df_2004[num_vec[2], :FILESTAT_adj] = 2
        end

        # hhs with agi income == 0 do not need to file
        adjginc_101 = df_tmp[df_tmp.RELATE .== 101, :ADJGINC][1]
        adjginc_201 = df_tmp[df_tmp.RELATE .== 201, :ADJGINC][1]
        if adjginc_101 == 0 && adjginc_201 == 0
            df_2004[num_vec[1], :FILESTAT_adj] = 6
            df_2004[num_vec[2], :FILESTAT_adj] = 6
        end

        # remaining hh members
        if length(num_vec) > 2
            for l = 3:length(num_vec)
                df_2004[num_vec[l], :FILESTAT_adj] = df_2004[num_vec[l], :FILESTAT]
            end
        end
    end
end

The Census Bureau has confirmed that your fix for the 2006-2015 data is correct. Re-assign as follows:

FILESTAT = 1, both filers <65
FILESTAT = 2, one filer <65, one filer >=65
FILESTAT = 3, both filers >= 65

They are in the process of tracking down the code/issues for the 2004-5 error.

I will share your forum post with them as well, to assist in their efforts.

Thanks for the follow up!

As I wrote in the report, I think the 2004/2005 error was generated as follows:

“In 2004 and 2005, however, it [the imputation procedure generating FILESTAT values] assigned one of them the joint filer status and nonfiler status to the other.”

It might help the CB staff if you could share this hypothesis with them as well.

Thanks, I shared your hypothesis with the folks at CB.

This is great @Johannes_Fleck, thank you! I wanted to note that in my earlier reply to this thread, I had said that the data from ASEC 2005 was giving me some trouble when I attempted to consolidate it into household units and run it through the NBER TAXSIM program. It is actually the ASEC 2006 which is giving me issues. I notice that while ASEC 2006 does not have a higher proportion of non-filers comparable to 2004 and 2005, as you have pointed out, but I notice that when I consolidate the data from 2006 into households, I end up with a number of households where the household head is coded as a single filer (FILESTAT == 5) while there are spouses in the household that are coded as non-filers (FILESTAT == 6) (please excuse the error in my earlier reply where I stated that non-filers were FILESTAT == 1). I would expect these to be joint-filers. This seems to speak to your latter observation re: “inconsistencies in years after 2006”; however, I noticed this issue in 2006 only.

Thanks for the follow-up regarding the post 2006 inconsistencies.

I am looking forward to CB clarification on this.

Hi @Matthew_Bombyk !

Do you know if the FILESTAT variable for 2004 and 2005 has been updated? I am about to add some more variables to an older extract request so I am wondering if it might change.

Thanks!

I haven’t heard anything about this, but I just checked in with the Census Bureau about it. Will post here when I hear back.