I wrote the following code (in R) to produce the weighted average of HHINCOME across a metro area. However, the numbers produced are around $30,000 above the actual average. Not sure what is producing these skewed numbers because I already took out the top coded values.
data %>%
filter(PERNUM ==1,MET2013 == 41700, YEAR == 2019) %>%
summarize(weighted.mean(HHINCOME, HHWT, na.rm = TRUE))
             
            
              
              
              
            
            
           
          
            
            
              When you indicate you took out the top-coded values, I am wondering if you mean that you omitted the N/A values (here coded as 9999999) for HHINCOME, or if you’re referring to the maximum values for the components of personal income, INCTOT. You should be removing the former, but it isn’t clear to me that you would want to remove the latter top-coded values in this situation.
Other than some uncertainty around what you mean by top-coded values, your approach seems fine to me. It would be helpful to have more insight into the estimate you are getting and the source you are comparing it to for average household income. I’m wondering if you are calculating average values for HHINCOME but comparing them to estimates of median HHINCOME since median household incomes seem to be the more common metric in summary statistics. I’m finding an unweighted median household income of $65,000 for San Antonio-New Braunfels, TX, which is close to the $62,000 reported here. High incomes on the tail might be causing the weighted average to significantly diverge from the median. While I wouldn’t expect a large difference (e.g. $30,000), I also wouldn’t expect your estimates from the public use microdata sample (PUMS) available via IPUMS to match Census Bureau estimates exactly because the PUMS data are a sub-sample of the full ACS and have been top-coded.
             
            
              
              
              
            
            
           
          
            
            
              Thank you! You were right: I was calculating average but comparing to medians. Wouldn’t I need to use a weighted median because not all samples are equally representative?
             
            
              
              
              
            
            
           
          
            
            
              That is correct; if you want to produce a statistic representative of the population you want to use HHWT to generate a weighted median of HHINCOME. If you’re using Stata, the pctile command is probably the most straightforward approach as it accepts (frequency) weights. I would also filter your sample by Group Quarters by setting GQ = 1, 2, or 5 so that you’re only including households in your analysis. While most household-level variables are not available for group quarters or for vacant units, this can still help make your results more accurate.
             
            
              
              
              
            
            
           
          
            
            
              In addition to the issues Ivan raised, another source of inconsistencies in summary statistics for metro areas is that IPUMS is unable to exactly identify all metro areas from public use microdata. See the MET2013 description for an explanation. E.g., in the case of San Antonio-New Braunfels, the cases with MET2013 == 41700 omit about 3.6% of the actual residents of the metro area (the “omission error”), and about 0.8% of the identified cases do not reside in the metro area (the “commission error”).