Seeking Guidance on using the XML documentation for my sample

Dear Folks–

I am new to XML, and am trying to understand the XML documentation that relates to my sample – what’s in it, and how I get it out. I have been pouring over my documentation with the XML file text and the related DDI text in one hand and an “XML for dummies” book in the other. I’ve been trying to understand what I am seeing, thinking I can pull what I want out of it with the R “XML” package or, if worst comes to worst, with regular expressions.

I’m know that eyeballing this stuff is not how it is intended to be used. Could you just give me the view from a great height of how these files are most often used by the typical user that uses them: how they view the information, or hand it to some automated proxy, using what software, etc.? I’ve known for a long time that I need to learn this, but I’m a little lost. If you could give me just a paragraph or two of guidance and overview, or point me to where it already exists on your site, I would be most grateful.

You folks rock. Thanks! —Andrew

The DDI codebook will not display properly outside the website, which is why you are having difficulties interpreting the xml file. There are some workarounds to make the DDI codebook more comprehensive beyond the context of the extract request page. The first option is to use Google Chrome and “print” the XML page as a PDF which will retain all formatting. A second option is to save the complete page (with images) as an HTML file using the Firefox browser.

I hope this helps.

Is the primary function of the XSD to be a standard for constructing and checking a well-formed XML document which is itself essentially self-documenting? Or is it instructions to the application used to view or extract data from the XML document, such that the document is incomplete without the XSD?

Also, is there human readable documentation on the structure of IPUMS-CPS XML files in particular, or are those files considered to be self-documenting?

Should I be looking at my XML file with a browser? I have been using Notepad++ because it does syntax highlighting, which it seems to do pretty well. (Also on the XSD file). But if it needs to grab the XSD file from in order to display coreectly, I don’t know if it does that or not.

Chrome opens up the XSD file as indented text (without syntax highlighting). When I try to open up the downloaded XML file for on of my samples with Chrome, nothing happens. Do I need a plug-in?

Let me tell you what I want to do, as that may clarify the meaning of my questions. I am hoping to write code to automatiacally extract metadata from the XML file for a sample that will give PostgreSQL what it needs to read the file, and second, give R what it needs to interpret the data it then imports from PostgreSQL. Finally, I want to build functions to pull out anything in the file that is really aimed at a human, e.g. the text variable descriptions, and produce a little report on a single variable, or a big report on all the variables, on request. I should be able to do that, right? As in, “Can that be done and does it make sense,” not as in, “Am I a good enough programmer to do it?”

My initial response to your question was a bit off-base, and I hope this offers more clarity and guidance.

We find that users predominantly view the DDI codebook in its browser viewable form and do not often interpret the XML coding. You should be fine continuing to view the XML and XSD through Notepad++, because as you said, the syntax highlighting can be immensely helpful when trying to interpret the tags. My previous suggestions to use browsers spoke more to human readability, but these are ultimately not helpful for the computer readability that you seek.

Our purpose of using an XSD (found here) is to systematically create a consistently formatted codebook for each extract request. I am not aware of any instances where a user used the XML file to pull variable metadata into a report, but this does not mean it is not possible. I recommend looking into the R “XML” package like you originally suggested, because it most likely will contain some guidance on how to parse XML that uses an XSD.

I’m sorry I cannot provide you with a more specific route to achieving what you are hoping to do. To cover my bases, I want to make sure you are aware of how to open IPUMS data in R such that variable labels and other minor adjustments will be applied.

  1. Generate your extract in Stata format (change “Data Format” on the extract request screen to “STATA”).
  2. Once your extract is complete, you can then access this formatted data extract through the “STATA” link that will be generated under the “Formatted Data” column on the “Download or Revise Extracts” page.
  3. After unzipping the file, you can then read this .dta file into R with the read.dta13 command found in the readstata13 package.

Please don’t hesitate to follow-up with additional questions or comments.