Hello,
I am doing regression analysis in STATA, and descriptive statistics tables for my sample of mothers in NYC for 2005-2017. I have 3 questions:
Q1) Between two regressions below, I cannot see any differences neither in coefficients nor in standard errors:
reg d_emp td [aweight=perwt],r
reg d_emp td [pweight=perwt],r
where I define d_emp as a ddummy of employment and td is a treatment dummy. What should I understand from this?
Q2) Also when I create summary statistics with perwt, I use the following code, is that true approach (also in summary [pweight=perwt] is not allowed in STATA:
sum d_emp [aweight=perwt] (then I get the mean value of employment for my subsample)
Q3) I tried to use the replication weights to correct standard errors as suggested.
I am not sure if I should run this code is evertime before I do my regressions:
Thanks for your questions. The role of IPUMS User Support is to answer questions about data and documentation available through our websites. I will do my best to respond to your questions that fall within our purview; perhaps others on this forum may have more to add about your comparison of weight types, or you might try Statalist for your Stata-specific questions. I will also address your questions out of order to put the most important information first.
The Census Bureau recommends using replicate weights for analyses of the ACS data. This IPUMS USA overview of replicate weights in the ACS/PRCS includes sample code for implementing replicate weights. Your initial line of code setting these up matches what is provided on the website for using Stata’s svy suite. However, your second line of code does not apply the replicate weights, and your fourth line does not restrict the sample in the appropriate way. The replicate weights page linked above gives an example of how to do both of these using the svy: prefix with the subpop() option. In your case the correct command would be: svy, subpop( if age<45 ): reg d_emp td ,r
It is generally appropriate to use PERWT as a probability weight when making estimates with standard errors. As you note, Stata does not allow for use of pweights with the summarize command; there are generally options for using the svy suite of commands with pweights to get at information provided by summarize. Summarize with aweights will give you the correct mean, but not the correct standard deviation (and no standard error). You may be interested in this summary of different weight types and this discussion of using aweights with the summarize command. I leave it to you to determine what is most appropriate based on common practice in your area of research and to compare results and make a determination about which to use.
Hello, I really appreciate your answer, I corrected many things and as you said, maybe other people will get help from here, I will answer what I did and what I corrected,
For estimations I did following to correct my standard errors:
eststo clear
svyset[pweight=perwt], strata(strata) vce(brr) brrweight(repwtp1-repwtp80) fay(.5)mse
eststo: svy: reg d_emp td
( I also tried without strata option and it did not give me any different results, could you explain me why in terms of ACS data setting)
For my summaries I did the following:
eststo clear
svyset[pweight=perwt], strata(strata) vce(brr) brrweight(repwtp1-repwtp80) fay(.5)mse
eststo est1: svy, subpop(mysubgroup): mean d_emp
estat sd //since I needed standard errors in my table as well.
I hope I corrected my ways, and this helps other people.
One last question: I know IPUMS suggest us to use subpop option (as you wrote as subpop( if age<45 )), however, I download my data and clean some (for instance I work on females in NYC specifically for age<45, but i want to keep other females in my data as well), what I do:
Get NYC people (i.e. drop the rest)
Get females only (i.e. drop the rest)
Then summary for my subgroup:
eststo est1: svy, subpop(if age<45): mean d_emp
Is this the true approach, or should I keep all ACS data (i.e. dont drop anyone from raw data) then create a subgroup for females in NYC age<45, which seems very painful thing to do
I can share the broad recommendations from original data providers and information about the ACS that may be useful to you as you make decisions, but leave it to you to determine what is common in your field and appropriate for your analysis.
It is generally recommended that you retain all observations and specify your analytical subsample using the subpop option so you retain replicate weight values for all cases that are used to estimate standard errors. Note that if you are using the replicate weights to estimate standard errors, you should not apply the sample design variables (e.g., strata) as the replicate weights contain the relevant information about the complex sample design.
That being said, ACS data are stratified at the county level and person weights are post-stratified at the county level as well; accordingly, person-level data should be representative at the county-level (if counties can be identified in the data; see this blog post about “missing” counties in the ACS and how IPUMS identifies counties based on PUMAs which is what is identified in the original Census Bureau PUMS files). In the case of NYC, the five boroughs (which map onto Bronx County, Kings County, New York County, Queens County, and Richmond County) are all identifiable for your years of interest. Accordingly, it seems reasonable that you could drop all observations not in NYC using the CITY variable (I won’t delve into details, but will note that this is specifically possible because the PUMAs for NYC perfectly align with the counties and the city boundaries for your years of interest; this is not true for many other cities). I would run the analysis both ways (e.g., with all records and with NYC only records) to ensure your results are consistent. However, I would not drop males from your analytical sub-sample (as you suggest in your second point) based on the general guidance and stratification/post-stratification of the ACS; creating a binary variable that indicates if someone is/is not a female under the age of 45 should be fairly straightforward.
One small thing also, I checked my results with all NY state and my sub sample, the standard errors in estimations do not change much. Which I believe is a good sign to have same results for 10K sub-sample vs 3 million NY state sample
However when I use the summary as below
ststo clear
svyset[pweight=perwt], vce(brr) brrweight(repwtp1-repwtp80) fay(.5)mse
eststo est1: svy, subpop(mysubgroup): mean d_emp
estat sd
means are same even in NY state sample or NYC sample, but standard deviations are smaller in NY state population sample. Is this unusal in IPUMS, because I thought using wider population may increase the standard deviations, do not increase them.
In your application, all cases would be person records in the 2005-2017 ACS samples. Because the data are representative at the state-level and because stratification and weights are post-stratified at the county-level, I think you could subset to those geographic units. However, I would run analyses using all three and compare results. I would not expect your means or point estimates to change, but you may get different standard errors–when you drop cases the sample design information in their replicate weight values which are used to estimate standard errors cannot be used.
My expertise is in IPUMS data and I am not in a position to provide further comment on your questions about expected changes to standard deviations depending on sample size. We leave it up to each individual researcher to interpret differences in approaches based on their application and norms in their field; I hope the resources I shared in my previous post are helpful as you look into replicate weights further. You may also be interested in this paper on replicate weights from the Stata Journal as well, which explicitly covers replicate weights with subpopulations.