The most disaggregated geography available


Dear Folks–
If I want to break CPS geography up into the largest number of distinct pieces possible, should I interact all the geography variables ( except city size), or are some of them redundant?

Do you know if I am right in thinking that if anyone appears in the CPS sample in a particular geographic unit in a given year, everyone in that year who is in that geographic unit will also be shown there (i.e. not geographically anonymized)?



I’ll address each of your questions one at a time:

(1) If you interact multiple geographic variables available in IPUMS CPS together (i.e. STATEFIP, METAREA, COUNTY), there will inevitably be overlap. In this three variable example, while all counties will be within a certain state and metropolitan areas may cross state lines, there will likely be overlap in all three, largely in STATEFIP. If you’d like the largest number of distinct pieces possible, I’d recommend that you either choose to work with STATEFIP/COUNTY or METAREA. However, neither will give you full geographic coverage of the United States.

(2) What you are hypothesizing about individuals in a particular geographic unit in a given year is true: if a county or metropolitan area is identifiable for one individual then every person within that county or metropolitan area should be identified in the CPS.


I think I am using “interact” the way R users and some linear modelers do, where to interact variables is to take all the combinations – like if you have a set of state dummies, a set of county dummies, a set of metro dummies, and so forth, and then count all the distinct combinations that have people in them. I don’t think that will produce overlapping areas.But if any of these categories corresponds exactly to another, like I suspect STATEFIP and STATECENSUS do, then I’d get perfect multicolinearity and my computer would yell at me. (That’s why I keep the volume low). If we had all the counties, I could drop STATE, since counties never cross state lines, and counties exhaust states. I can not do that because there is a residual “rest of state” category.

But I think I can (and should) drop MSACMSZ, because interacting it with METAREA should just get me back the METAREAs. But I don’t know about CBSASZ, because there are no CBSAs in the CPS geography. So it is possible (though I do not know if it is true) that CBSAs could define collections of counties not all of which, and maybe not any of which, are individually identified. On the other hand, if the CBSAs are defined only over counties which are themselves identified, then each CBSA size catagory will define a group of counties that sum to , so we would have perfect colinearity again. That seems to be the hardest case, in terms of knowing what to do…


Thank you for the clarification – I think I have a better idea of what you are trying to do now.

Yes, STATEFIP and STATECENSUS would be perfectly correlated. The two variables just use different codes.

As for MSACMSZ/CBSASZ, I suppose you might have perfect correlation if the size of each metropolitan area or core based statistical area was entirely unique to each metropolitan area or core based statistical area (i.e. there were no two areas with the same exact size). Everyone in a metropolitan area will have the same CBSASZ value, but not everyone with the same CBSASZ will have be in the same metropolitan area. Additionally, note that MSACMSZ is the old version of CBSASZ. There is more detailed information about CBSAs in the CBSASZ comparability section.


Thanks, Michelle, that’s very helpful.

Poking about the internet, I find two different and incompatible assertions about CBSAs, with different implications about :

  1. CBSAs include one or more Micropolitan areas of size 10,000 to 50,000 as their largest population center; and
  2. All .MSAs are also CBSAs, but not conversely.

Given that the CBSASZ variable comletely rreplaces the MSACBSZ variable in the CPS, plus your comments above, it sounds like the latter, and not the former is true. Would you agree with that? But looking at METAREA, it also does not seem to have undergone anything like the the expansion in number of codes you would expect if CBSAs for smaller urban cores were added to the ongoing MSAs. Since that did not happen, I’m not quite sure what did. It looks to me like they just fiddled around a little bit with the boundaries and the way they define subdivisions of larger urban areas.

That leaves me with (at least) one major area of uncertainty. I can not tell how CBSASZ and MSACBSZ relate to METAREA. Do you know if everyone assigned to a specified METAREA also gets a specified value for one of the size variables? What about conversely – do some people with a siZe value nonetheless not have a specified METAREA?


They names of these geographies and how they are related if fairly confusing because they are so similar so I will start by naming all of the players:

  • Metropolitan Statistical Areas (MSAs) pre-2004

  • Primary Metropolitan Statistical Areas (PMSAs) pre-2004

  • Consolidated Metropolitan Statistical Areas (CMSAs) pre-2004

  • Core-Based Statistical Areas (CBSAs) May 2004 onward

  • Micropolitan Statistical Areas May 2004 onward (They didn’t give this one an acronym, probably because there were none left. They also stopped using the acronym “MSA” at this time and just referred to Metropolitan Statistical Areas, so I will do the same when talking about the new identification system here as well)

The first 3, MSAs, PMSAs, and CMSAs represent the old Metropolitan Area identification system (pre-2004), where PMSAs nest within CMSAs while MSAs are free standing (see the MSACMSZ comparability tab.

Core-Based Statistical Areas (CBSAs) are the newer identification system units and can represent either Metropolitan Statistical Areas or Micropolitan Statistical Areas. However, since no Micropolitan Statistical Area is large enough to be identifiable in the CPS, only Metropolitan Statistical Areas are identified by METAREA and CBSASZ, (you will notice the CBSASZ bins start at “100,00-249,999” population). Hopefully this address the first part of your question, but let me know if it does not.

To the second part of your question:

Not quite, is the answer. Every METAREA has an associated size value for CBSASZ in May 2004 onward samples, but for the earlier samples it depends on whether or not an area is an identifiable PMSA nested within an identified CMSA. If the PMSA is identifiable individuals living within it will have independent values for both MSACMSZ (size of the larger CMSA) and MSAPMSZ (size of the specific PMSA within the CMSA), otherwise they should have the same value for both. Everyone with a size value will be identified to a speific METAREA.

I hope this helps.