I’m working with a data extract from 2010 - 2021 containing diabetes variables from the DCS sub-survey of MEPS, as well as prescribed medicine data. I’m finding some confusing inconsistencies in the data and the documentation.
According to the IPUMS documentation for the DCSDIABDX variable "Additionally, DCSDIABDX only takes on values of “yes” or “out of universe.” and the code page for DCSDIABDX shows that there should not be values for “no” for this variable. However, when I load the data into R and group by DCSDIABDX and count up the number of rows for each categories I only have values for “Yes” or “No”. “Out of universe” or “Not in universe” do not show up.
Additionally, the documentation for DIABWEIGHT says " Only persons who answer “yes” to DCSDIABDX have a positive value for DIABWEIGHT." However, when I filter my data so that it only contains rows where DCSDIABDX = ‘no’, I find that about 5% of the rows have positive weights for DIABWEIGHT.
I found these discrepancies because I am interested in comparing the total number of prescriptions for different diabetes medications to other data sources I have on the same information. When I produce survey-weighted (using the DIABWEIGHT variable and appropriate strata/id variables) counts of these prescription medications, the numbers don’t make sense. For example, for medications classified as Sulfonylureas I’m getting 67,364,859 for people where DCSDIABDX is “No” (they have not been told they have diabetes) and 2,617,629 for people who have been told they have diabetes. The number from my other data source for Sulfonylureas is approximately 53 million, which is closer to the first number (order of magnitude similarity). However, Sulfonylureas medications are primarily prescribed to manage diabetes, so why would the number being prescribed to people without diabetes match more closely than the number being prescribed to those who have diabetes? Out of curiosity I also switched to using the full PERWEIGHT weighting variable from MEPS and see similar results - the # of prescribed medications form MEPS is in the ballpark for my other data source, but only for people who have been told they do NOT have diabetes.
Can I get insight into whether or not this is a data quality issue or just a documentation issue? Or perhaps a bit of both?
I looked into the issues you are describing and I am not able to replicate them.
I downloaded an IPUMS MEPS extract including the samples from 2010 through 2021. I tabulated DCSDIABDX and I see only “yes” and “niu” as value labels, consistent with the codes section of DCSDIABDX:
And I tabulated DIABWEIGHT for respondents with DCSDIABDX=0 (niu):
I also do not see any missing values for DIABWEIGHT.
It is possible you are having an issue with R. These are the only value labels I see in R for DCSDIABDX:
If you can provide screenshots of what you’re seeing or a more detailed description of how you came to see different value labels for DCSDIABDX or positive values of DIABWEIGHT for respondents with DCSDIABDX values of 0, I may be able to provide more targeted help.
Sure, here are some screenshots.I’m reading in the data using the ipumsr package.
I’m only allowed to put one embedded image in a post apparently, so I’ll respond in multiple posts.
These are the values that the labelled data.frame tells me DCSDIABDX can take:
However, when I group by DCSDIABDX and tally up the counts these are the results:
I can filter the data to either DCSDIABDX == 1 or == 2 and in both cases, there are DIABWEIGHT >= 0:
I’m also having trouble reproducing the issues you describe using a similar extract and ipumsr 0.7.0. What version of ipumsr are you running? It’s possible that this is related to a bug that has already been fixed in more recent versions of the package.
If you’re not running 0.7.0, I’d suggest updating the package and trying again. If you still see the same problems, then this might require some more detailed investigation.
Here’s what I’m seeing:
dat <- read_ipums_micro(
ddi = "~/Downloads/meps_00002.xml",
data_file = "~/Downloads/meps_00002.dat"
#> <labelled<integer>>: Respondent has been told they have diabetes
#>  0 2
#> value label
#> 0 NIU
#> 1 No
#> 2 Yes
#> 7 Unknown-refused
#> 8 Unknown-not ascertained
#> 9 Unknown-don't know
#> # A tibble: 2 × 2
#> # Groups: DCSDIABDX 
#> DCSDIABDX n
#> <int+lbl> <int>
#> 1 0 [NIU] 690882
#> 2 2 [Yes] 40568
filter(DCSDIABDX == 1) %>%
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
filter(DCSDIABDX == 2) %>%
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 517.9 5319.0 9191.3 11374.2 14873.0 92388.7
Thanks for the response - updating the ipumsr package fixed the issues I was reporting.
Glad to hear that updating fixed your issue! Would you mind sharing the extract number of the extract where you were experiencing the scrambled labels, so that we can investigate this issue further?