Hi all,

I’m trying to slice VALUEH into deciles from low to high, but I keep getting widely varying numbers of observations (houses) in each decile group. Instead of 10 groups of roughly the same number of houses, I’m getting 10 groups differing in number by as much as 10%. I am using Stata and the package Sumdist, which works well on other numeric-coded variables from Ipums, such as HHINCOME (households). Since VALUEH appears as strings in Stata, I am running (or so I think) the commands to change it into numeric. I am running this analysis on one geography, Philadelphia County, and have tried two different samples 1Y2019 and 1Y2020 to no avail.

I’m wondering if the hitch with VALUEH is that it contains lots of categorical values such as “Less than $2,000+” or “$150,000 to $199,999” or “$100,000+”. When I hide value labels in Stata, all the values change into seemingly arbitrary midpoints. For example “Less than $2,000” becomes $1,000, and “$150,000 to $199,999” becomes $175,000. This is probably okay, except with the large groupings like “$100,000+” all becoming exactly $100,000, and “$200,000+” all becoming $200,000, which isn’t logical. Not sure if Ipums or Census created these values and value labels; I didn’t.

Any thoughts or advice?

Thank you, -Tom

Hi Tom,

Currently, VALUEH incorporates labels from all available samples. This is resulting in you seeing intervals in your labels even though the variable description for VALUEH notes that the variable should be continuous without intervals from 2008 onwards. It shouldn’t however impact your analysis since the actual values are correct.

One possibility of what might be happening is that your observations are on the person-level, but you are creating your deciles on the household-level. Since households will have different numbers of person-level observations, the size of these household-level deciles will vary. Since VALUEH will be the same across household members, I recommend keeping one observation per household and dropping the rest. Usually this is done by keeping observations with PERNUM = 1. You will also want to use HHWT as your weight to generate deciles that are representative of your sample of interest.

Many thanks Ivan, i did consider those points. My suspicion is that the categorical origins of this data is resulting in some deciles getting huge quantities of observations with identical values but other deciles getting smaller quantities, e.g. there are loads of 200000’s from the broad category “$200,000+” but fewer 35000’s from the category “$35,000+” and so on. Stata’s xtile function must be putting all identical value observations into a single decile group for that value, and hence preventing the deciles from each having uniform numbers of observations.