Odd "peaks" showing up for LITBRIG variable mean plots

Hi! When I create country plots to show how the mean of literacy for the variable LITBRIG has changed over time for each country, I notice that India, Bangladesh, and Zimbabwe have odd peaks, suggesting that literacy skyrocketed during one of the survey years and then went down drastically the following survey year. Is there an explanation for this? Thank you in advance!

LITBRIG is a categorical variable where values correspond to distinct responses rather than a specific literacy level (see the Codes tab for LITBRIG). Calculating the mean of LITBRIG will therefore not produce a meaningful literacy rate measure. For example, a value of LITBRIG = 20 identifies respondents who cannot read. Additionally, LITBRIG = 99 identifies respondents who are NIU (i.e., not-in-universe) for the variable. These are persons who were not asked questions that are required to determine LITBRIG. I suspect that these NIU respondents are causing your literacy rate measure to spike since there are a large number of NIU cases in the three samples you mention. While there are a number of reasons that a respondent might not be asked a particular question, information on who was asked the question(s) that correspond to a variable is summarized in the variable’s universe statement (see the Universe tab for LITBRIG). We recommend that new users review the detailed FAQ page on IPUMS-DHS; question 8 specifically deals with NIU data. Based on my review of this information and original DHS questionnaires, it appears that in India in 1998 and in Bangladesh in 2004, women who had completed grade 6 were not asked about their literacy. Additionally, in Zimbabwe in 1999 women who had ever attended secondary school were also not asked about their literacy.

In order to calculate the literacy rate using LITBRIG, you will need to code respondents to a new dichotomous variable that reports whether the respondent was literate or not. This will require dividing your sample into three groups: respondents who are literate (LITBRIG = 10, 11 and/or 12 depending on the threshold for literacy that you impose) should be assigned a value of one, respondents who are illiterate (LITBRIG = 20 and/or 12) should be assigned a value of 0, and respondents whose level was not ascertained, missing data, and not-in-universe NIU observations (LITBRIG = 31, 32, 98, and 99) should be assigned as missing and excluded from the measure. While it is unclear whether a respondent who was NIU for LITBRIG was literate or not (which is why NIU data is typically set to missing), you might consider coding NIU responses as literate if you have reason to believe that to be the case.

Using the recoding I have suggested above and PERWEIGHT to produce estimates that are representative of the country and survey universe, I estimate literacy rates for Bangladesh, India, and Zimbabwe, as well as Rwanda (see screenshot below). You can find sample R code in the IPUMS data training exercises to help you run a similar analysis.

You should also note that LITBRIG is generally only available for ever married women aged 15-49. This means that your measure would only be measuring the literacy rate for this group. For analyses of literacy rates for the entire population of a particular country, you can use the variable LIT on IPUMS International.

Literacy2

Hello Ivan,

Thank you so much for your help! I was surprised how thorough the response was, so thank you for taking the time to respond. I really appreciate it.

I have followed your instructions in recoding. In my case, I am interested in illiteracy rates rather than literacy, so my recode for the LITBRIG variable was done as follows:

A value of ‘0’ was assigned to “yes, reads” and “reads easily/whole sentence.” A value of ‘1’ was assigned to “read with difficulty/part of sentence” and “no, cannot read.” A missing value (.) was assigned to “not ascertained (blind or diff. language),” “no card with required language,” “blind or visually impaired,” “missing,” and “NIU (not in universe).”

I am now trying to utilize the PERWEIGHT in my calculations to hopefully see the peaks for India, Bangladesh, and Zimbabwe disappear. I have scoured the internet and the IPUMS exercises you shared for how to correctly use PERWEIGHT in Stata coding, but I’m still struggling. This is what I’ve coded so far in Stata:

use “/Users/andeegempelerdevore/Desktop/Dissertation Proposal/DATA/datasets/variablesrecode.dta”, clear

svyset [pw = perweight]

svydescribe

svy: mean illiterate

svy, over(sample): mean illiterate

When I do this, my estimates for the aforementioned countries still show peaks. Do you have any suggestions for where I’m going wrong with my coding? I’ve tried a variety of combinations and I just can’t seem to get them to go away.

I apologize for asking such a dumb question. I feel quite stupid not knowing the answer to this. Usually I can find answers to this type of question elsewhere, but for some reason I’m really struggling on this one.

Thank you for your help in advance!

Best,

Andee

1 Like

Hi Andee. I posted a response to your other forum post with the same question. I have copied and pasted my response below. Here is a link to the other post.

This help page from UCLA is a good introduction to different types of weights in Stata. Typing “help weight” in the Stata command will also lead you to a help window on how to apply different weights. IPUMS provides some sample code for using weights. This page from the IPUMS NHIS user guide includes some sample code for using weights, subsetting your analysis (such as by age), and accounting for sample design when using weights. Also, this page from The DHS Program is a good resource that walks you through how to use weights in multiple steps.

Even when using weights correctly, you may see large intertemporal changes in individual countries when calculating certain statistics. This can be due to a variable having a high level of measurement error, a sample size in a particular country being small, or the universe of a variable changing over time. Literacy rates may also change substantially over time, especially if the DHS surveys are several years apart. I would also double check your code to ensure you have recoded the literacy variable correctly. You can check to see that it is recoded correctly by tabulating the variable, looking for whether the possible values of the variable are what you expect and intend (0 and 1).