Confidence interval calculation using replicate weights in Python

Hi - I’m working on an analysis of person counts in states by citizenship, industry and possibly age group. I’ve found that for some states I end up with a very small number of samples for a given combination of variables. Therefore I want to create some confidence interval so we know when these estimates are more or less reliable.

I do not have access to SAS, or STATA and am primarily a python coder. I have reviewed this page And have have adapted the following code from the word doc at the end. “Estimating ASEC Variances with Replicate Weights”

Really just looking for a sanity check that the approach is valid/correct.

import pandas as pd
import numpy as np

# Load the merged data
data = rep.copy()

# Limit to wyoming, naturalized citizens in the labor force 
filtered_data = data[((data.STATEFIP == 56) & (data.CITIZEN == 4) & (data.LABFORCE == 2))]

# Calculate the sum of the specified columns
columns_to_sum = ['ASECWT'] + [f'REPWTP{i}' for i in range(1, 161)]
sum_data = filtered_data[columns_to_sum].sum().to_frame().T

# Rename the columns
sum_data.columns = ['est'] + [f'rw{i}' for i in range(1, 161)]

# Load the data from the previous step
data2 = sum_data

# Initialize variables
sdiffsq = 0
repwts = ['est'] + [f'rw{i}' for i in range(1, 161)]

# Calculate sdiffsq
for i in range(1, 161):
    sdiffsq += (data2[repwts[i]].iloc[0] - data2[repwts[0]].iloc[0])**2

# Calculate variance, standard error, and coefficient of variation
var = (4/160) * sdiffsq
se = var**0.5
cv = se / data2['est'].iloc[0]

# Calculate the margin of error and confidence interval
Z = 1.96  # Z-score for 95% confidence
margin_of_error = Z * se
confidence_interval = (data2['est'].iloc[0] - margin_of_error, data2['est'].iloc[0] + margin_of_error)

# Create the final DataFrame
data3 = pd.DataFrame({
    'est': [data2['est'].iloc[0]],
    'var': [var],
    'se': [se],
    'cv': [cv],
    'lower_ci': [confidence_interval[0]],
    'upper_ci': [confidence_interval[1]]

This approach seems to return reasonable looking results:
est: 2475.750
var: 903891.969
se: 950.732
cv: 0.384
lower_ci: 612.315

But just would love the sanity check here that this is correct.

Other questions

  • there are actually 161 rep weights in there (REPWTP → REPWTP160) - what is the first one REPWTP which seems to not be used.
  • Also why is it 4/160?

Any help would be greatly appreciated. Many thanks

While I don’t have experience with python, I was able to run this same analysis in Stata on the 2023 ASEC sample with the svy sdr command in the sample Stata code that we provide and got the same result that you report. Specifically, the lower bound of my 95% confidence interval was 612.3 and the upper bound was 4339. This seems to confirm that your code correctly implements the replicate weights for estimating confidence intervals.

The unused REPWTP is for the Census Bureau to internally verify that the files are properly merged to the public use survey data. In IPUMS CPS, it merely indicates the presence of replicate weights and is coded 1 for every case.

I believe the 4/160 is the multiplier that is analogous to setting Fay’s coefficient = 0.5 (the 4 in the numerator is 1/(1-rho)^2). For more information, I recommend reviewing the references cited in the word document that you link to (particularly Judkins, 1990).

@Ivan_Strahof Such a huge help many thanks !