Hi - I’m working on an analysis of person counts in states by citizenship, industry and possibly age group. I’ve found that for some states I end up with a very small number of samples for a given combination of variables. Therefore I want to create some confidence interval so we know when these estimates are more or less reliable.
I do not have access to SAS, or STATA and am primarily a python coder. I have reviewed this page https://usa.ipums.org/usa/repwt.shtml. And have have adapted the following code from the word doc at the end. “Estimating ASEC Variances with Replicate Weights”
Really just looking for a sanity check that the approach is valid/correct.
import pandas as pd
import numpy as np
# Load the merged data
data = rep.copy()
# Limit to wyoming, naturalized citizens in the labor force
filtered_data = data[((data.STATEFIP == 56) & (data.CITIZEN == 4) & (data.LABFORCE == 2))]
# Calculate the sum of the specified columns
columns_to_sum = ['ASECWT'] + [f'REPWTP{i}' for i in range(1, 161)]
sum_data = filtered_data[columns_to_sum].sum().to_frame().T
# Rename the columns
sum_data.columns = ['est'] + [f'rw{i}' for i in range(1, 161)]
# Load the data from the previous step
data2 = sum_data
# Initialize variables
sdiffsq = 0
repwts = ['est'] + [f'rw{i}' for i in range(1, 161)]
# Calculate sdiffsq
for i in range(1, 161):
sdiffsq += (data2[repwts[i]].iloc[0] - data2[repwts[0]].iloc[0])**2
# Calculate variance, standard error, and coefficient of variation
var = (4/160) * sdiffsq
se = var**0.5
cv = se / data2['est'].iloc[0]
# Calculate the margin of error and confidence interval
Z = 1.96 # Z-score for 95% confidence
margin_of_error = Z * se
confidence_interval = (data2['est'].iloc[0] - margin_of_error, data2['est'].iloc[0] + margin_of_error)
# Create the final DataFrame
data3 = pd.DataFrame({
'est': [data2['est'].iloc[0]],
'var': [var],
'se': [se],
'cv': [cv],
'lower_ci': [confidence_interval[0]],
'upper_ci': [confidence_interval[1]]
})
This approach seems to return reasonable looking results:
est: 2475.750
var: 903891.969
se: 950.732
cv: 0.384
lower_ci: 612.315
upper_ci:
But just would love the sanity check here that this is correct.
Other questions
- there are actually 161 rep weights in there (REPWTP → REPWTP160) - what is the first one REPWTP which seems to not be used.
- Also why is it 4/160?
Any help would be greatly appreciated. Many thanks