Hi Luke,
Responding to the *second issue* first: how do the steps in the construction of the NHGIS 2000->2010 block crosswalk relate to the equations in the article (Schroeder 2007)?
In the case of the 2000->2010 crosswalk, the “source zones” (s) are 2000 blocks and the “target zones” (t) are 2010 blocks.
In the article, Y is a variable of interest measured on source zones. In the application case, it could be any feature for which we have 2000 block counts (e.g., population, housing units, population by race, etc.), and y[s] is a specific 2000 count for block s (e.g., the 2000 population of a 2000 block).
The final crosswalk does not include values of Y. It includes only weights indicating the portion of y[s] that we expect to lie in each target zone t. I.e., each weight gives us an estimate of (y[st] / y[s]) for a specific source zone s and target zone t.
To use the crosswalk file, we would ultimately multiply y[s] (a 2000 block count) by EST(y[st] / y[s]) (an interpolation weight) to obtain EST(y[st]) (the estimated count of a 2000 characteristic Y within an intersection between 2000 block s and 2010 block t). To get EST(y[t]), an estimate of the 2000 characteristic Y for 2010 block t, we sum EST(y[st]) within t.
In the article, Z is a variable measured on target zones that (we expect) has a spatial distribution similar to Y’s. For the NHGIS crosswalk, Z is the sum of 2010 population and housing units, and z[t] is the sum of 2010 population and housing units in a specific 2010 block t.
In the article, Equation 3 specifies that:
EST(y[st]) = (z[st] / z[s]) * y[s]
From this, the formula for an interpolation weight is:
EST(y[st] / y[s]) = z[st] / z[s]
In application, z[st] is the sum of 2010 population and housing in the area of intersection of 2000 block s and 2010 block t, and z[s] is the sum of 2010 population and housing in 2000 block s.
In the text describing the NHGIS crosswalk construction, Step 1 estimates z[st] values, Step 2 estimates z[s] values, and Step 3 computes all interpolation weights as z[st] / z[s].
The formula for Step 1 comes from Equation 6 in the article:
EST(z[st]) = (A[st] / A[t]) * z[t]
In application, this means that EST(z[st]) is equal to:
z[t] (the sum of 2010 population and housing units in 2010 block t)
* A[st] (the area of intersection between 2000 block s and 2010 block t)
/ A[t] (the area of 2010 block t)
We compute these areas as land areas _except_ when the land area of the 2010 block is 0, in which case we use water areas.
Finally, coming back to the *first issue*…
I see a few errors in Step 1 of your code. It appears to be using this equation:
EST(z[st]) = (A[st] / A[s]) * pop[t]
It should use:
EST(z[st]) = (A[st] / A[t]) * z[t]
Z should be set to the sum of 2010 population and housing units, not 2010 population alone.
The “a0” component should be the area of intersection divided by the _2010_ block’s area, not the 2000 block’s area.
And finally, additional special handling is required in instances where the land area of the 2010 block is zero, in which case water are should be used.
Later, in Step 3, more special handling is needed where EST(z[s]) = 0 (the estimated sum of 2010 population and housing units in 2000 block s is zero), in which case the interpolation weight should be set to A[st] / A[s] (with, again, special handling to use water area if the land area is zero).
Finally, note that the NHGIS crosswalk uses this TDW model only in the 95.7% of cases where the advanced model was not used. For the other 4.3% (the “blocks of interest”), the TDW weight will not generally equal the weight in the crosswalk.