More XML/XSD questions

Dear Folks–

This is a follow-up to my question with the subject line “Seeking Guidance on using the XML documentation for my sample.” I originally posted it there, but I noticed that it had been marked “closed”, so I was not sure anybody had seen my follow-up question.

Is the primary function of the XSD (in this case) to be a standard for constructing and checking a well-formed XML document which is itself essentially self-documenting? Or is it instructions to the application used to view or extract data from the XML document, such that the document is incomplete or can not be interpreted without the XSD?

To the extent that the XSD contributes to the interpretation of the XML, is that contribution primarily formatting for display of the XML’s structure, or instructions on interpreting the XML’s meaningful contents?

Also, is there human readable documentation on the structure of IPUMS-CPS XML files in particular, or are those files considered to be self-documenting?

Should I be looking at my XML file with a browser? I have been using Notepad++ because it does XML syntax highlighting, which it seems to do pretty well. (Also on the XSD file). But if it needs to grab the XSD file from http://www.ddialliance.org/ in order to display correctly, I don’t know if it does that – probably not.

Chrome opens up the XSD file as indented text (without syntax highlighting). When I try to open up the downloaded XML file for on of my samples with Chrome, nothing happens. Do I need a plug-in?

Let me tell you what I want to do, as that may clarify the meaning of my questions. I am hoping to write code to automatically extract metadata from the XML file for a sample that will give PostgreSQL what it needs to read the file, and second, give R what it needs to interpret the data it then imports from PostgreSQL. Finally, I want to build functions to pull out anything in the file that is really aimed at a human, e.g. the text variable descriptions, and produce a little report on a single variable, or a big report on all the variables, on request.

Does this seem like a sensible, appropriate way to use the XML documentation? Or have I misinterpreted its purpose?

Looking forward to your response, Andrew H.

My initial response to your question was a bit off-base, and I hope this offers more clarity and guidance.

We find that users predominantly view the DDI codebook in its browser viewable form and do not often interpret the XML coding. You should be fine continuing to view the XML and XSD through Notepad++, because as you said, the syntax highlighting can be immensely helpful when trying to interpret the tags. My previous suggestions to use browsers spoke more to human readability, but these are ultimately not helpful for the computer readability that you seek.

Our purpose of using an XSD (found here) is to systematically create a consistently formatted codebook for each extract request. I am not aware of any instances where a user used the XML file to pull variable metadata into a report, but this does not mean it is not possible. I recommend looking into the R “XML” package like you originally suggested, because it most likely will contain some guidance on how to parse XML that uses an XSD.

I’m sorry I cannot provide you with a more specific route to achieving what you are hoping to do. To cover my bases, I want to make sure you are aware of how to open IPUMS data in R such that variable labels and other minor adjustments will be applied.

  1. Generate your extract in Stata format (change “Data Format” on the extract request screen to “STATA”).
  2. Once your extract is complete, you can then access this formatted data extract through the “STATA” link that will be generated under the “Formatted Data” column on the “Download or Revise Extracts” page.
  3. After unzipping the file, you can then read this .dta file into R with the read.dta13 command found in the readstata13 package.

Please don’t hesitate to follow-up with additional questions or comments.