I am working with census data from 1970-2010 looking at immigration into the US. The way YRIMMIG is coded groups years of migration according to latest possible year of arrival. This is creating complications when running regression models due to the fact that some years are arbitrarily overrepresented in the samples. Could you please guide me on dealing with this? (yrsusa1-2 have way too many missing values and are unreliable). Thank you!
While there is no official method for dealing with the intervals in YRIMMIG, there are two common methods when dealing with intervalled data in general. First, you could simply assign to respondents the midpoint value of the interval to which they belong. Assuming the distribution within the interval is either approximately uniform or bell-shaped around the midpoint, this should give you a reasonable approximation. Since the 2000-onward data are not intervalled, you can check this assumption for earlier years by inference. Second, you could randomly assign year values to respondents based on the interval to which they belong. This would allow for a closer approximation to the variance of the non-intervalled data, since you are not assigning the same midpoint value to each respondent in an interval. You could also account for non-normal or non-uniform distributions with this method. Ultimately, the decision to use one of these methods, or another method, is left to the discretion of the researcher.
Hope this helps.