USA API variable attach_characteristics

jcolond · October 20, 2023, 3:29am

Hi, I’m just getting started with the API, it’s been very smooth so far. I appreciate the option to make my process easier to replicate.

I’m trying to get attach_characteristics to work with the USA collection, but the code from this post: https://forum.ipums.org/t/how-to-get-access-to-characteristics-from-the-cps-api/5176 is not working for me with UsaExtract. Please see my code and output at bottom.

I am able to successfully run an extract without attached characteristics by just including a list of variables, i.e.: EXT=UsaExtract([‘us2020a’],[‘RACED’],data_format=“stata”,description=“Description”,). When I then try to attach characteristics to the object, it runs without complaint but does not attach the supplemental variables: EXT.attach_characteristics(‘RACED’,[“mother”,“father”])

NameError Traceback (most recent call last)
Cell In[56], line 2
1 DOWNLOAD_DIR=pathlib.WindowsPath(‘c:/users/jcolo/Box/Dissertation/Descriptive/Data’)
----> 2 EXT=UsaExtract([‘us2020a’],[Variable(name=‘RACED’, attach_characteristics=[“mother”,“father”])],data_format=“stata”,description=“API retrieval”,)

NameError: name ‘Variable’ is not defined

fran · October 23, 2023, 7:05pm

Hi Jay!

From the traceback you shared, it looks like you’re calling Variable like so:

Variable(name=‘RACED’, attach_characteristics=[“mother”,“father”])

The attached characteristics parameter is named attached_characteristics rather than attach_characteristics. Try changing that and hopefully your issue will be resolved!

jcolond · October 23, 2023, 9:16pm

Hi Fran, thank you so much for replying!

I updated my code, but it unfortunately appears to hang up at the same place, output at bottom.

Here is how I load it:
from ipumspy import IpumsApiClient, UsaExtract, readers, ddi

NameError Traceback (most recent call last)
Cell In[4], line 1
----> 1 EXT=UsaExtract([‘us2020a’], [Variable(name=‘RACED’, attached_characteristics=[“mother”,“father”])],data_format=“stata”,description=“API retrieval”,)

NameError: name ‘Variable’ is not defined

renae · October 23, 2023, 11:54pm

Hi Jay,

I suspect the problem may actually be in your download step. I was able to successfully create an extract with race of the mother and father attached with the following code:

ipums_api = IpumsApiClient(os.getenv("IPUMS_API_KEY"))

EXT=UsaExtract(
    ['us2020a'],['RACED'],
    data_format="stata",
    description="Description",)

EXT.attach_characteristics('RACED',["mother","father"])

ipums_api.submit_extract(EXT)
ipums_api.wait_for_extract(EXT)
ipums_api.download_extract(EXT)

df = pd.read_stata(f"{EXT.collection}_{str(EXT.extract_id).zfill(5)}.dta.gz", compression="gzip")

Looking at the race of the mother I see the following:

df["race_mom"].value_counts()

white                               431864
two major races                      64781
black/african american               63636
other race, nec                      52276
other asian or pacific islander      39021
chinese                              11229
american indian or alaska native     10938
three or more major races             4544
japanese                              2139
Name: race_mom, dtype: int64

Are you able to successfully get your expected extract using this code snippet?

If you are still having trouble, please try upgrading to the most recent version of ipumspy (if you haven’t already) which is v0.4.1.

I hope that helps!

jcolond · October 24, 2023, 4:45am

Thank you!! Such a relief to have some success, I am able to retrieve race_mom and race_pop using your code.

When I select multiple variables, I do not see attached variables. I am not super experienced in Python, please pardon me if this is my Python and not ipumspy. The code below will run without errors:

EXT=UsaExtract([‘us1910k’,‘us1920a’,‘us1930a’,‘us1940a’,‘us1950a’,‘us1960a’,‘us1970c’,‘us1980b’,‘us1990b’,‘us2000d’,‘us2010a’,‘us2020a’], [‘YEAR’,‘SAMPLE’,‘SERIAL’,‘CBSERIAL’,‘HHWT’,‘CLUSTER’,‘STRATA’,‘GQ’,‘PERNUM’,‘PERWT’,‘SEX’,‘AGE’,‘MARST’,‘BIRTHYR’,‘RACE’,‘RACED’,‘HISPAN’,‘HISPAND’,‘BPL’,‘BPLD’,‘EDUC’,‘EDUCD’,‘FTOTINC’,‘INCWAGE’,‘OCCSCORE’], data_format=“stata”, description=“API retrieval”, )

EXT.attach_characteristics(‘RACED’,[“mother”,“father”])
#EXT.attach_characteristics(‘BPLD’,[“mother”,“father”])
#EXT.attach_characteristics(‘EDUCD’,[“mother”,“father”])

When I look at the dataframe, however, I don’t have attached variables:

DOWNLOAD_DIR=pathlib.WindowsPath(‘c:/users/jcolo/Box/Dissertation/Descriptive/Data’)
ipums_api.submit_extract(EXT)
ipums_api.wait_for_extract(EXT)
ipums_api.download_extract(EXT, download_dir=DOWNLOAD_DIR)
gz_file=(f"{DOWNLOAD_DIR}/{EXT.collection}_{str(EXT.extract_id).zfill(5)}.dta.gz")
with gzip.open(gz_file, ‘rb’) as f_in:
with open(‘extract.dta’,‘wb’) as f_out:
shutil.copyfileobj(f_in, f_out)

df=pd.read_stata(‘extract.dta’, convert_categoricals=False)
df.describe

<bound method NDFrame.describe of year sample serial cbserial hhwt cluster
0 1910 191002 101 NaN 100.0 1.910000e+12
1 1910 191002 101 NaN 100.0 1.910000e+12
2 1910 191002 102 NaN 100.0 1.910000e+12
3 1910 191002 201 NaN 100.0 1.910000e+12
4 1910 191002 201 NaN 100.0 1.910000e+12
… … … … … … …
21141062 2020 202001 1193466 2.020001e+12 112.0 2.020012e+12
21141063 2020 202001 1193466 2.020001e+12 112.0 2.020012e+12
21141064 2020 202001 1193467 2.020001e+12 50.0 2.020012e+12
21141065 2020 202001 1193467 2.020001e+12 50.0 2.020012e+12
21141066 2020 202001 1193468 2.020001e+12 172.0 2.020012e+12

           strata  gq  pernum  perwt  ...  raced  hispan  hispand  bpl  \

0 110100100.0 1 1 100.0 … 200 0 0 1
1 110100100.0 1 2 100.0 … 210 0 0 1
2 110100100.0 1 1 100.0 … 210 0 0 1
3 110100100.0 1 1 100.0 … 100 0 0 1
4 110100100.0 1 2 100.0 … 100 0 0 1
… … … … … … … … … …
21141062 50056.0 1 5 103.0 … 100 0 0 49
21141063 50056.0 1 6 107.0 … 100 0 0 56
21141064 20056.0 1 1 50.0 … 100 0 0 46
21141065 20056.0 1 2 53.0 … 100 0 0 31
21141066 30056.0 1 1 172.0 … 100 0 0 56

      bpld  educ  educd   ftotinc   incwage  occscore

0 100 NaN NaN NaN NaN 6
1 100 NaN NaN NaN NaN 4
2 100 NaN NaN NaN NaN 6
3 100 NaN NaN NaN NaN 80
4 100 NaN NaN NaN NaN 0
… … … … … … …
21141062 4900 1.0 17.0 108000.0 999999.0 0
21141063 5600 1.0 14.0 108000.0 999999.0 0
21141064 4600 6.0 63.0 27000.0 0.0 0
21141065 3100 6.0 63.0 27000.0 0.0 0
21141066 5600 11.0 114.0 22000.0 22000.0 20

[21141067 rows x 25 columns]>

EDIT I realized my describe was not showing all vars – here’s the Stata output:

. describe

Contains data from C:\Users\jcolo\Box\Dissertation\Descriptive\extract.dta
Observations: 21,141,067
Variables: 25 24 OCT 2023 04:07

Variable Storage Display Value
name type format label Variable label

year int %8.0g YEAR census year
sample long %12.0g SAMPLE ipums sample identifier
serial long %12.0g household serial number
cbserial double %12.0g original census bureau household
serial number
hhwt double %12.0g household weight
cluster double %12.0g household cluster for variance
estimation
strata double %12.0g household strata for variance
estimation
gq byte %8.0g GQ group quarters status
pernum byte %8.0g person number in sample unit
perwt double %12.0g person weight
sex byte %8.0g SEX sex
age int %8.0g AGE age
marst byte %8.0g MARST marital status
birthyr int %8.0g year of birth
race byte %8.0g RACE race [general version]
raced int %8.0g RACED race [detailed version]
hispan byte %8.0g HISPAN hispanic origin [general version]
hispand int %8.0g HISPAND hispanic origin [detailed version]
bpl int %8.0g BPL birthplace [general version]
bpld long %12.0g BPLD birthplace [detailed version]
educ byte %8.0g EDUC educational attainment [general
version]
educd int %8.0g EDUCD educational attainment [detailed
version]
ftotinc long %12.0g total family income
incwage long %12.0g INCWAGE wage and salary income
occscore byte %8.0g occupational income score

renae · October 24, 2023, 1:52pm

Hi Jay,

I’m glad we solved your first problem! It looks like your lines of code to attach the mother’s and father’s birth place and education are commented out:

EXT.attach_characteristics(‘RACED’,[“mother”,“father”])
#EXT.attach_characteristics(‘BPLD’,[“mother”,“father”])
#EXT.attach_characteristics(‘EDUCD’,[“mother”,“father”])

The # in front of the line is equivalent to a * or // in a Stata .do file and will keep those lines from being executed. I was able to get an extract with both RACED and BPLD for the mother and the father.

ipums_api = IpumsApiClient(os.getenv("IPUMS_API_KEY"))

EXT=UsaExtract(
    ['us2020a'],
    ['RACED', 'BPLD'],
    data_format="stata",
    description="Description",)

EXT.attach_characteristics('RACED',["mother","father"])
EXT.attach_characteristics('BPLD', ["mother", "father"])

ipums_api.submit_extract(EXT)
ipums_api.wait_for_extract(EXT)
ipums_api.download_extract(EXT)

df = pd.read_stata(f"{EXT.collection}_{str(EXT.extract_id).zfill(5)}.dta.gz", compression="gzip")

Here are the first 10 values of mother’s birthplace in the data I downloaded:

df["bpld_mom"].value_counts()[:10]

california      53332
mexico          43848
new york        38769
texas           32547
pennsylvania    30686
illinois        28275
ohio            25912
michigan        24301
florida         13875
new jersey      13859
Name: bpld_mom, dtype: int64

Hope that helps!

jcolond · October 24, 2023, 6:44pm

Hi Renae, thank you again for your help! My problem is completely resolved.

I think I have figured out what was happening to me, I’m sharing in case it is useful.

I don’t get attached characteristics unless those variables I attach-to are ordered first in my query (RACED, etc. must be listed first in the extract). The problem was a little thorny because I was apparently requesting several default / mandatory extract variables (SERIAL, etc.).

Here is what worked:

EXT=UsaExtract([‘us2010a’,‘us2020a’], [‘RACED’,‘BPLD’,‘EDUCD’,‘SEX’,‘AGE’,‘MARST’,‘BIRTHYR’,‘HISPAN’,‘HISPAND’,‘FTOTINC’,‘INCWAGE’,‘OCCSCORE’], data_format=“stata”, description=“API retrieval”, )

I so appreciate your help!

renae · October 24, 2023, 7:37pm

Great! Glad you got things working for you.

Yes, it is a requirement that the variables for which you want to attach characteristics are in the list of variables you request in your original extract definition. For the record, there should be no problem if you actively include a pre-selected variable (such as SERIAL) in your list of requested variables, but you don’t have to to have it included in your extract file.

jcolond · October 24, 2023, 8:21pm

Thank you again. I intend this with a spirit of detail orientedness and don’t mean to quibble—

The variables with attached characteristics had to be ordered first in my original extract definition; otherwise the extract would run without attachments.

In any case, nothing more satisfying than a mystery resolved.

Topic		Replies	Views
How to get access to characteristics from the CPS API IPUMS API	6	333	April 10, 2023
Parent Variable Extraction TIME USE	3	838	June 18, 2019
Enhancements to define_extract APIs IPUMS API	6	328	January 12, 2024
No values for fstatmom (and others) for sample child	1	85	January 19, 2024
how to link parental characteristics (restricted variables) to children in NHIS? HEALTH SURVEYS	1	535	May 1, 2018

USA API variable attach_characteristics

Contains data from C:\Users\jcolo\Box\Dissertation\Descriptive\extract.dta Observations: 21,141,067 Variables: 25 24 OCT 2023 04:07

Variable Storage Display Value name type format label Variable label

Related topics

Contains data from C:\Users\jcolo\Box\Dissertation\Descriptive\extract.dta
Observations: 21,141,067
Variables: 25 24 OCT 2023 04:07

Variable Storage Display Value
name type format label Variable label