Error when reading atus data using ipumspy

Marcus_Esteban · June 6, 2024, 2:36am

I tried to read an atus extract i requested on the website using ipumspy and used the suggested read code. These are my exact outputs.

codebook = readers.read_ipums_ddi(“/Users/marcusesteban/Documents/Data/atus_00001_ddi.xml”)
/opt/miniconda3/envs/default/lib/python3.11/site-packages/ipumspy/readers.py:49: CitationWarning: Use of data from IPUMS is subject to conditions including that users should cite the data appropriately.
See the ipums_conditions attribute of this codebook for terms of use.
See the ipums_citation attribute of this codebook for the appropriate citation.
warnings.warn(

atus_df = readers.read_microdata_chunked(codebook, filename = “/Users/marcusesteban/Downloads/atus_00001.dat.gz”, chunksize=1000)

ab_df = pd.concat([ab_df[ab_df[‘STATEFIP’].isin([8, 30, 48])] for ab_df in atus_df])
Traceback (most recent call last):
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/arrays/integer.py:51 in _safe_cast
return values.astype(dtype, casting=“safe”, copy=copy)
TypeError: Cannot cast array data from dtype(‘float64’) to dtype(‘int64’) according to the rule ‘safe’
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
Cell In[60], line 1
ab_df = pd.concat([ab_df[ab_df[‘STATEFIP’].isin([8, 30, 48])] for ab_df in atus_df])
Cell In[60], line 1 in
ab_df = pd.concat([ab_df[ab_df[‘STATEFIP’].isin([8, 30, 48])] for ab_df in atus_df])
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/ipumspy/readers.py:413 in read_microdata_chunked
yield from _read_microdata(
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/ipumspy/readers.py:167 in _read_microdata
yield from (_fix_decimal_expansion(df).astype(dtype) for df in data)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/ipumspy/readers.py:167 in
yield from (_fix_decimal_expansion(df).astype(dtype) for df in data)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/generic.py:6226 in astype
res_col = col.astype(dtype=cdt, copy=copy, errors=errors)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/generic.py:6240 in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/internals/managers.py:448 in astype
return self.apply(“astype”, dtype=dtype, copy=copy, errors=errors)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/internals/managers.py:352 in apply
applied = getattr(b, f)(**kwargs)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/internals/blocks.py:526 in astype
new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/dtypes/astype.py:299 in astype_array_safe
new_values = astype_array(values, dtype, copy=copy)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/dtypes/astype.py:230 in astype_array
values = astype_nansafe(values, dtype, copy=copy)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/dtypes/astype.py:95 in astype_nansafe
return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/arrays/masked.py:132 in _from_sequence
values, mask = cls._coerce_to_array(scalars, dtype=dtype, copy=copy)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/arrays/numeric.py:258 in _coerce_to_array
values, mask, _, _ = _coerce_to_data_and_mask(
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/arrays/numeric.py:214 in _coerce_to_data_and_mask
values = dtype_cls._safe_cast(values, dtype, copy=False)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/arrays/integer.py:57 in _safe_cast
raise TypeError(
TypeError: cannot safely cast non-equivalent float64 to int64

I tried the same line again and this happened

ab_df = pd.concat([ab_df[ab_df[‘STATEFIP’].isin([8, 30, 48])] for ab_df in atus_df])
Traceback (most recent call last):
Cell In[61], line 1
ab_df = pd.concat([ab_df[ab_df[‘STATEFIP’].isin([8, 30, 48])] for ab_df in atus_df])
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/util/_decorators.py:331 in wrapper
return func(*args, **kwargs)
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/reshape/concat.py:368 in concat
op = _Concatenator(
File /opt/miniconda3/envs/default/lib/python3.11/site-packages/pandas/core/reshape/concat.py:425 in init
raise ValueError(“No objects to concatenate”)
ValueError: No objects to concatenate

renae · June 6, 2024, 6:36pm

Hi Marcus,

You’re running into a bug in ipumspy wherein there is some floating point data that is incorrectly designated as an integer by ipumspy and pandas doesn’t like it!

This bug will be fixed in the next release of the ipumspy library, but in the meantime you can get around this issue by creating a dict of data types, fixing designating the WT06 variable as floating type, and passing this dict to the read_microdata() method. Below is some example code that you can adapt by replacing “padded_id” with your atus extract number:

atus_ddi = readers.read_ipums_ddi(f"atus_{padded_id}.xml")
dtypes_fixed = atus_ddi.get_all_types(type_format="pandas_type")
# this is the offending variable
dtypes_fixed.update({"WT06": pd.Float64Dtype()})
df_a = readers.read_microdata(
     atus_ddi, 
     f"atus_{padded_id}.dat.gz", 
     dtype=dtypes_fixed
)

Note that this example does not read the data in chunks. However, because the ATUS files are relatively small, you should not see any meaningful difference in performance by reading in your entire extract and then filtering the data frame down to your states of interest.

Hope this helps!

Marcus_Esteban · June 6, 2024, 7:36pm

Hey renae,

Thanks for the help. My file though is all the years so it is taking a long time to load. Do you have a chunked solution?

Thanks again!

renae · June 6, 2024, 8:03pm

Hi Marcus,

Sure, you can pass the dtype kwarg to the read_microdata_chunked() method as well. That would look something like this:

iter_atus = readers.read_microdata_chunked(
     atus_ddi, 
     chunksize=1000, 
     dtype=dtypes_fixed
)

df_a = pd.concat([df[df["STATEFIP"] == 27] for df in iter_atus])

Hope this helps!

Marcus_Esteban · June 6, 2024, 10:31pm

Thanks, it works but still takes a long time.

Topic		Replies	Views
Reading IPUMS-DHS data into R	1	482	April 17, 2020
I can not readin dat file from IPUMs. CPS	1	565	January 23, 2017
Error in R reading data: "Line is too short for rectype."	8	419	April 19, 2024
Israel 2008 variable/value labels error INTERNATIONAL	4	674	June 28, 2021
How to I read in select columns from IPUMs data with ipumsr? USA	5	1215	October 1, 2018

Error when reading atus data using ipumspy

I tried the same line again and this happened

Related topics