Error when reading atus data using ipumspy

I tried to read an atus extract i requested on the website using ipumspy and used the suggested read code. These are my exact outputs.

codebook = readers.read_ipums_ddi(“/Users/marcusesteban/Documents/Data/atus_00001_ddi.xml”)
/opt/miniconda3/envs/default/lib/python3.11/site-packages/ipumspy/readers.py:49: CitationWarning: Use of data from IPUMS is subject to conditions including that users should cite the data appropriately.
See the ipums_conditions attribute of this codebook for terms of use.
See the ipums_citation attribute of this codebook for the appropriate citation.
warnings.warn(

atus_df = readers.read_microdata_chunked(codebook, filename = “/Users/marcusesteban/Downloads/atus_00001.dat.gz”, chunksize=1000)

ab_df = pd.concat([ab_df[ab_df[‘STATEFIP’].isin([8, 30, 48])] for ab_df in atus_df])
Traceback (most recent call last):
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/arrays/integer.py:51 in _safe_cast
return values.astype(dtype, casting=“safe”, copy=copy)
TypeError: Cannot cast array data from dtype(‘float64’) to dtype(‘int64’) according to the rule ‘safe’
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
Cell In[60], line 1
ab_df = pd.concat([ab_df[ab_df[‘STATEFIP’].isin([8, 30, 48])] for ab_df in atus_df])
Cell In[60], line 1 in
ab_df = pd.concat([ab_df[ab_df[‘STATEFIP’].isin([8, 30, 48])] for ab_df in atus_df])
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/ipumspy/readers.py:413 in read_microdata_chunked
yield from _read_microdata(
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/ipumspy/readers.py:167 in _read_microdata
yield from (_fix_decimal_expansion(df).astype(dtype) for df in data)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/ipumspy/readers.py:167 in
yield from (_fix_decimal_expansion(df).astype(dtype) for df in data)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/generic.py:6226 in astype
res_col = col.astype(dtype=cdt, copy=copy, errors=errors)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/generic.py:6240 in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/internals/managers.py:448 in astype
return self.apply(“astype”, dtype=dtype, copy=copy, errors=errors)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/internals/managers.py:352 in apply
applied = getattr(b, f)(**kwargs)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/internals/blocks.py:526 in astype
new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/dtypes/astype.py:299 in astype_array_safe
new_values = astype_array(values, dtype, copy=copy)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/dtypes/astype.py:230 in astype_array
values = astype_nansafe(values, dtype, copy=copy)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/dtypes/astype.py:95 in astype_nansafe
return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/arrays/masked.py:132 in _from_sequence
values, mask = cls._coerce_to_array(scalars, dtype=dtype, copy=copy)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/arrays/numeric.py:258 in _coerce_to_array
values, mask, _, _ = _coerce_to_data_and_mask(
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/arrays/numeric.py:214 in _coerce_to_data_and_mask
values = dtype_cls._safe_cast(values, dtype, copy=False)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/arrays/integer.py:57 in _safe_cast
raise TypeError(
TypeError: cannot safely cast non-equivalent float64 to int64

I tried the same line again and this happened

ab_df = pd.concat([ab_df[ab_df[‘STATEFIP’].isin([8, 30, 48])] for ab_df in atus_df])
Traceback (most recent call last):
Cell In[61], line 1
ab_df = pd.concat([ab_df[ab_df[‘STATEFIP’].isin([8, 30, 48])] for ab_df in atus_df])
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/util/_decorators.py:331 in wrapper
return func(*args, **kwargs)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/reshape/concat.py:368 in concat
op = _Concatenator(
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/reshape/concat.py:425 in init
raise ValueError(“No objects to concatenate”)
ValueError: No objects to concatenate

Hi Marcus,

You’re running into a bug in ipumspy wherein there is some floating point data that is incorrectly designated as an integer by ipumspy and pandas doesn’t like it!

This bug will be fixed in the next release of the ipumspy library, but in the meantime you can get around this issue by creating a dict of data types, fixing designating the WT06 variable as floating type, and passing this dict to the read_microdata() method. Below is some example code that you can adapt by replacing “padded_id” with your atus extract number:

atus_ddi = readers.read_ipums_ddi(f"atus_{padded_id}.xml")
dtypes_fixed = atus_ddi.get_all_types(type_format="pandas_type")
# this is the offending variable
dtypes_fixed.update({"WT06": pd.Float64Dtype()})
df_a = readers.read_microdata(
     atus_ddi, 
     f"atus_{padded_id}.dat.gz", 
     dtype=dtypes_fixed
)

Note that this example does not read the data in chunks. However, because the ATUS files are relatively small, you should not see any meaningful difference in performance by reading in your entire extract and then filtering the data frame down to your states of interest.

Hope this helps!

1 Like

Hey renae,

Thanks for the help. My file though is all the years so it is taking a long time to load. Do you have a chunked solution?

Thanks again!

Hi Marcus,

Sure, you can pass the dtype kwarg to the read_microdata_chunked() method as well. That would look something like this:

iter_atus = readers.read_microdata_chunked(
     atus_ddi, 
     chunksize=1000, 
     dtype=dtypes_fixed
)

df_a = pd.concat([df[df["STATEFIP"] == 27] for df in iter_atus])

Hope this helps!

Thanks, it works but still takes a long time.