Code
# Import packages
library(tidyverse)
library(plm)
library(lmtest)
library(estimatr)
library(Hmisc)
library(RItools)
library(MatchIt)
library(knitr)
library(kableExtra)
April 10, 2024
This project was completed for my Policy Evaluation class, taken as part of my Master’s program at UC Santa Barbara. Provided with data and questions, I carried out this analysis using appropriate causal inference modeling techniques.
What was the average treatment effect (ATE) of the 1998 Progresa cash-transfer program on the value of animals owned by a household?
Compared pre-treatment characteristics in the treatment and control groups of the 1998 Prospera cash-transfer program. Estimated the Average Treatment Effect (ATE) of the program on a household’s value of owned animals with the First-Difference, Fixed-Effects, and Difference-in-Difference estimators.
Our data comes from a 2012 research paper published in the American Economic Journal looking at the Progresa cash-transfer program, which was implemented in rural Mexican villages in 1998. Eligible households that were randomly selected to be part of the program were provided bi-monthly cash-transfers of up to 550 pesos per month. These cash-transfers were conditional on children attending school, family members obtaining preventive medical care, and attending health-related education talks. In total, over 17,000 households were part of the Progresa program.
The outcome and treatment variables are:
vani = value of animals owned by household (in 1997 USD)
treatment = dummy variable indicating whether an individual was part of the cash-transfer program (equal to 1 if the individual was part of the program)
There are 55 control variables, including:
dirtfloor97 = dummy variable indicating whether a household had a dirt floor in 1997
electricity97 = dummy variable indicating whether a household had electricity in 1997
homeown97 = dummy variable indicating whether a household owned a house in 1997
female_hh = dummy variable indicating whether a household has a female head of household
age_hh = head of household age
educ_hh = head of household years of education
## Load the datasets
progresa_pre_1997 <- read_csv(here::here("data", "2024-3-6-post-data", "progresa_pre_1997.csv"))
progresa_post_1999 <- read_csv(here::here("data", "2024-3-6-post-data", "progresa_post_1999.csv"))
## Append post to pre dataset
progresa <- rbind(progresa_pre_1997, progresa_post_1999)
# Remove all families who were treated/controls in the program, but did not get measured in the second year
progresa <- progresa %>%
group_by(hhid) %>% filter(n() == 2) %>%
ungroup()
# Subset data for treatment and control groups
treatment_group <- progresa[progresa$treatment == 1, ]
control_group <- progresa[progresa$treatment == 0, ]
# Compare proportion of units that had dirt floor in 1997 (binary variable) between treatment and control group
prop.test(x = c(sum(treatment_group$dirtfloor97, na.rm = TRUE), sum(control_group$dirtfloor97, na.rm = TRUE)),
n <- c(length(treatment_group$dirtfloor97), length(control_group$dirtfloor97)))
2-sample test for equality of proportions with continuity correction
data: c(sum(treatment_group$dirtfloor97, na.rm = TRUE), sum(control_group$dirtfloor97, na.rm = TRUE)) out of n <- c(length(treatment_group$dirtfloor97), length(control_group$dirtfloor97))
X-squared = 99.041, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
0.04486458 0.06704500
sample estimates:
prop 1 prop 2
0.6394000 0.5834452
# Compare proportion of units that had electricity in 1997 (binary variable) between treatment and control group
prop.test(x = c(sum(treatment_group$electricity97, na.rm = TRUE), sum(control_group$electricity97, na.rm = TRUE)),
n <- c(length(treatment_group$electricity97), length(control_group$electricity97)))
2-sample test for equality of proportions with continuity correction
data: c(sum(treatment_group$electricity97, na.rm = TRUE), sum(control_group$electricity97, na.rm = TRUE)) out of n <- c(length(treatment_group$electricity97), length(control_group$electricity97))
X-squared = 147.97, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
-0.08027241 -0.05800589
sample estimates:
prop 1 prop 2
0.5583031 0.6274422
2-sample test for equality of proportions with continuity correction
data: c(sum(treatment_group$homeown97, na.rm = TRUE), sum(control_group$homeown97, na.rm = TRUE)) out of n <- c(length(treatment_group$homeown97), length(control_group$homeown97))
X-squared = 51.805, df = 1, p-value = 6.128e-13
alternative hypothesis: two.sided
95 percent confidence interval:
0.02251224 0.03967849
sample estimates:
prop 1 prop 2
0.8460096 0.8149142
2-sample test for equality of proportions with continuity correction
data: c(sum(treatment_group$female_hh, na.rm = TRUE), sum(control_group$female_hh, na.rm = TRUE)) out of n <- c(length(treatment_group$female_hh), length(control_group$female_hh))
X-squared = 9.0087, df = 1, p-value = 0.002687
alternative hypothesis: two.sided
95 percent confidence interval:
-0.016058579 -0.003296729
sample estimates:
prop 1 prop 2
0.07980780 0.08948546
Welch Two Sample t-test
data: age_hh by treatment
t = 12.657, df = 24438, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
2.008817 2.744967
sample estimates:
mean in group 0 mean in group 1
46.70221 44.32532
Welch Two Sample t-test
data: educ_hh by treatment
t = -1.2186, df = 25414, p-value = 0.223
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-0.10171053 0.02372436
sample estimates:
mean in group 0 mean in group 1
2.658340 2.697333
For several of the covariates, there are statistically significant differences between the pre-treatment characteristics of the treated and untreated groups. This indicates that there were likely systemic differences between the pre-treatment characteristics of individuals that were part of the cash-transfer program and those that were not, and if this is the case, simply controlling for all covariates is an insufficient method to estimate the ATE, since systemic differences means that there are differences between the groups that extend beyond what we can control for.
Because there seem to be systemic differences between the pre-treatment characteristics of the treated and untreated group, we will use more advanced techniques to estimate the ATE. Since we are working with panel data, our options for advanced techniques include the FD, FE, or DiD estimators.
Because a FD model controls for the differences in the explanatory variables between two time periods, the estimator is effective at removing bias from omitted variables that result from differences between time periods. If we think that the potential ommited variables (i.e., a variable that is influencing our outcome variable but is not included as a covariate) that are most important are likely to vary over different time periods, then using the FD estimator is the best approach for estimating ATE. For example, if the head of house having to deal with a family emergency was by far the most influential omitted variable, the FD estimator would likely be the best approach for estimating ATE, since having to deal with a family emergency is something that would likely vary over different time periods.
# i. Sort the panel data in the order in which you want to take differences, i.e. by household and time.
progresa_sorted <- progresa %>%
arrange(hhid, year) %>%
group_by(hhid) %>%
# ii. Calculate the first difference using the lag function from the dplyr package.
mutate(vani_fd = vani - dplyr::lag(vani))
# iii. Estimate manual first-difference regression (Estimate the regression using the newly created variables.)
fd_manual <- lm(vani_fd ~ treatment, data = progresa_sorted)
# Extracting the coefficients table
summary_reg <- summary(fd_manual)
summary_reg$coefficients %>%
kbl(caption = "FD estimator") %>% # Generate table
kable_classic(full_width = FALSE)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | -1156.752 | 64.4938 | -17.935859 | 0.0000000 |
treatment | 287.905 | 85.6020 | 3.363297 | 0.0007723 |
Our FD regression tells us that program participants experienced a change in the value of their animal holdings that was, on average, 287.90 dollars greater than the change experienced by non-participants from 1997 to 1999. Our standard error is 85.60 dollars, and our low p-value means we reject the null hypothesis that the difference is zero at an alpha level of 0.01.
A FE model directly controls for omitted variables that do no change over time, so the estimator is effective at removing bias that comes from time-invariant characteristics. If we think that the potential ommited variables (i.e., a variable that is influencing our outcome variable but is not included as a covariate) that are most important are likely stay constant over different time periods, then using the FE estimator is the best approach for estimating ATE. For example, if the head of house being an only child was by far the most influential omitted variable, the FE estimator would likely be the best approach for estimating ATE, since being an only child as an adult is unlikely to be something that would change over time periods.
# ESTIMATE THE BASIC 'WITHIN' FIXED EFFECTS REGRESSION
# NOTE "plm" ONLY PRODUCES CLUSTER-ROBUST STANDARD ERRORS
within_reg <- plm(vani ~ treatment, index = c("state", "year"), model = c("within"), effect = c("twoways"), data = progresa)
# Extracting the coefficients table
summary_reg <- summary(within_reg)
summary_reg$coefficients %>%
kbl(caption = "FE estimator") %>% # Generate table
kable_classic(full_width = FALSE)
Estimate | Std. Error | t-value | Pr(>|t|) | |
---|---|---|---|---|
treatment | -234.0142 | 56.65699 | -4.130368 | 3.63e-05 |
Our FE regression tells us that program participants experienced a change in the value of their animal holdings that was, on average, 234.01 dollars less than the change experienced by non-participants within each State from 1997 to 1999. Our cluster-standard error is 56.66 dollars, and our low p-value means we reject the null hypothesis that the difference is zero at an alpha level of <0.01. The standard error being cluster-robust means that it accounts for the fact that observations in the same State as one another will have results that are not entirely independent of one another.
If we think that our omitted variables are likely to be a mix of variables that stay constant and change across time periods (and treatment occurs only at a single point in time), we are best off using the DiD estimator, which calculates ATE as the difference in the mean outcome variable in the treated group before and after the time of treatment minus the difference in the mean outcome variable in the untreated group before and after the time of treatment.
# Create the dummy variables
progresa$treatment_dummy <- ifelse(progresa$treatment == 1, 1, 0)
progresa$post_treatment_time_dummy <- ifelse(progresa$year == 1999, 1, 0)
progresa$interaction_dummy <- progresa$treatment_dummy * progresa$post_treatment_time_dummy
# OLS regression
ols_reg <- lm(vani ~ treatment_dummy + post_treatment_time_dummy + interaction_dummy, data = progresa)
# Present Regressions in Table
summary_reg <- summary(ols_reg)
summary_reg$coefficients %>%
kbl(caption = "DiD estimator") %>% # Generate table
kable_classic(full_width = FALSE)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 2848.2175 | 60.61455 | 46.989010 | 0.0000000 |
treatment_dummy | -237.6927 | 80.45311 | -2.954425 | 0.0031352 |
post_treatment_time_dummy | -1156.7517 | 85.72191 | -13.494235 | 0.0000000 |
interaction_dummy | 287.9050 | 113.77788 | 2.530413 | 0.0113985 |
From our regression, we estimate the average effect of the cash transfer program on value of animal holdings to be about 287.91 USD (where program participants had a higher average value of animal holdings at the end of the program), with a standard error of about 113.78 USD. To interpret this result as the ATE, we have to assume that the control group (units that did not participate in program) provides a valid counterfactual for what would have happened to units in our treatment group (program participants) had they not participated in the program. Furthermore, our p-value of 0.011 tells us that, at an alpha level of 0.05, we reject the null hypothesis that there was no average effect of the cash transfer program on value of animal holdings.
The coefficient on our treatment dummy variable tells us that we estimate the mean difference in the outcome variable (value of animal holdings) between the treatment group (program participants) and the control group (non-participants) before the program started to have been 237.69 USD (where program participants had a lower average value of animal holdings than non-participants prior to the start of the program), with a standard error of 80.45 USD. Our p-value of <0.01 tells us that, at an alpha level of 0.01, we reject the null hypothesis that there was a mean difference of zero.
The coefficient on our post treatment time dummy variable tells us that we estimate the mean change in the outcome variable (value of animal holdings) between the beginning and end of the program for the control group (non-participants in program) to be 1,156.75 USD (where non-participants had a lower average value of animal holdings when the program ended than when it started), with a standard error of 85.72 USD. Our p-value of 0.003 tells us that, at an alpha level of 0.01, we reject the null hypothesis that there was a mean change of zero.
Overall, the cash-transfer program appears to have been quite successful at boosting the value of animals owned by a household. In this cash-transfer program, all treatment occurred at the same time and we expect our omitted variables to be a mix of variable that change and stay constant across time periods, so the DiD estimator is likely the best method for estimating ATE in our situation. Using the DiD estimator, we estimated the average effect of the cash transfer program on value of animal holdings to be about 287.91 USD (where program participants had a higher average value of animal holdings at the end of the program), with a standard error of about 113.78 USD. For this to be interpreted as the ATE, we have to assume that the control group (units that did not participate in program) provides a valid counterfactual for what would have happened to units in our treatment group (program participants) had they not participated in the program.
@online{ghanadan2024,
author = {Ghanadan, Linus},
title = {Impact {Analysis} of a 1998 {Cash-Transfer} {Program} in
{Rural} {Mexico}},
date = {2024-04-10},
url = {https://linusghanadan.github.io/blog/2023-12-10-post/},
langid = {en}
}