Why am I interested in causal methods?
A while back, I attended a seminar where Professor Eleanor Murray talked about “causal inference for complex & timevarying exposures, or why the gformula is the only formula you need.” Since then, I have been looking for a good project to use to learn how to apply these methods to my own research. Causal inference methods appeal to me not least because they offer a more sophisticated way of understanding our world in causal terms, not merely associative, which means we can in turn (hopefully) help us develop better interventions.
In particular, I am excited for the ability to model timevarying covariates in analyses using observational data, which is the kind I routinely use. In the clinic, these could represent any sort of interesting phenomena and more accuratley reflects the nature of realworld data. The failure to account for these can bias estimates of risk. ^{1}
What is the parametric gformula?
My understanding is that the gformula is a way to estimate standardized outcome distributions using covariatespecific estimates of the outcome distribution when specific assumptions of causal inference are met. The assumptions of this approach include:
 Conditional Exchangeability: The counterfactual outcome risk under every exposure variable is the same in the exposed and the unexposed (e.g. the data simulate a RCT so there is no unmeasured confounding and no selection bias).
 Positivity: Any individual has a nonzero probability of receiving all values of the treatment variable.
 Consistency: The counterfactual for whether an individual receives treatment or not is observable (which means that the treatment is wellspecified).
The gformula proceeds in the following way: ^{2}
 Model conditional probabilities in the observed empirical data
 Monte Carlo simulation of timevarying exposures, covariates and outcomes
 Estimation and comparision of effect measures
Example
Today I’m going to work through a fairly simple example using sample data offered by the gfoRmula package in R. ^{3}
Suppose we have a dataset which contains 13,170 observations on 2,500 individuals with a maximum of 7 followup times. No individuals are censored in this dataset. The variables in this dataset are:
 t0: The time index
 id: A unique identifier for each individual
 L1: A timevarying covariate; binary
 L2: A timevarying covariate; continuous
 L3: A baseline covariate; continuous
 A: The treatment variable; binary
 Y: The outcome of interest; continuous
First, let’s turn to examine the dataset itself.
explanatory < c("A", "L1", "L2", "L3")
dependent < "Y"
table1 < gfoRmula::basicdata_nocomp >
mutate(A = as.factor(A)) >
mutate(L1 = as.factor(L1)) >
mutate(Y = as.factor(Y)) >
summary_factorlist(
dependent,
explanatory,
p = TRUE,
column = TRUE,
add_dependent_label = TRUE,
dependent_label_prefix = "",
total_col = TRUE
)
table1 %>%
head(.[1:5], n = 6) >
knitr::kable("html")
Y  0  1  Total  p  

A  0  3319 (27.9)  967 (76.5)  4286 (32.5)  <0.001 
1  8587 (72.1)  297 (23.5)  8884 (67.5)  
L1  0  6718 (56.4)  988 (78.2)  7706 (58.5)  <0.001 
1  5188 (43.6)  276 (21.8)  5464 (41.5)  
L2  Mean (SD)  0.1 (1.0)  0.3 (0.9)  0.2 (1.0)  <0.001 
L3  Mean (SD)  7.6 (1.7)  7.5 (1.7)  7.6 (1.7)  0.768 
Great! It looks like we’ve got 13,170 observations as expected. We also note a few statistically significant differences between the groups defined by the binary outcome. Next, let’s look at the KaplanMeier plot for this dataset.
explanatory < c("A")
dependent < "Surv(t0, Y)"
gfoRmula::basicdata_nocomp >
surv_plot(dependent, explanatory)
So it looks like fewer of the untreated group are “surviving.” Just to be safe, let’s also check the assumption of proportional hazards.
explanatory < c("A", "L1", "L2", "L3")
dependent < "Surv(t0, Y)"
gfoRmula::basicdata_nocomp >
mutate(A = as.factor(A)) >
mutate(L1 = as.factor(L1)) >
coxphmulti(dependent,
explanatory) >
cox.zph() %>%
{zph_result << .} %>%
plot(var = 1)
That line looks pretty flat! We would not be alone in saying that this looks proportional. So, let’s try a traditional Cox proportional hazards model to understand the difference in risk between those with the intervention compared to those without.
explanatory < c("A", "L1", "L2", "L3")
dependent < "Surv(t0, Y)"
cox < gfoRmula::basicdata_nocomp >
mutate(A = as.factor(A)) >
mutate(L1 = as.factor(L1)) >
finalfit(dependent,
explanatory)
cox %>%
head(.[1:4], n = 6) >
knitr::kable("html")
Dependent: Surv(t0, Y)  all  HR (univariable)  HR (multivariable)  

A  0  4286 (32.5) 


1  8884 (67.5)  0.14 (0.130.16, p<0.001)  0.15 (0.130.17, p<0.001)  
L1  0  7706 (58.5) 


1  5464 (41.5)  0.41 (0.360.47, p<0.001)  0.39 (0.340.45, p<0.001)  
L2  Mean (SD)  0.2 (1.0)  1.06 (1.001.12, p=0.064)  1.35 (1.261.44, p<0.001) 
L3  Mean (SD)  7.6 (1.7)  0.99 (0.961.02, p=0.659)  0.99 (0.961.02, p=0.587) 
outcome_type < 'survival'
id < 'id'
time_points < 7
time_name < 't0'
covnames < c('L1', 'L2', 'A')
outcome_name < 'Y'
covtypes < c('binary', 'bounded normal', 'binary')
histories < c(lagged, lagavg)
histvars < list(c('A', 'L1', 'L2'), c('L1', 'L2'))
covparams <
list(
covmodels = c(
L1 ~ lag1_A + lag_cumavg1_L1 + lag_cumavg1_L2++L3 + t0,
L2 ~ lag1_A + L1 + lag_cumavg1_L1++lag_cumavg1_L2 + L3 + t0,
A ~ lag1_A + L1 + L2 + lag_cumavg1_L1++lag_cumavg1_L2 + L3 + t0
)
)
ymodel < Y ~ A + L1 + L2 + L3 + lag1_A + lag1_L1 + lag1_L2 + t0
intvars < list('A', 'A')
interventions <
list(list(c(static, rep(0, time_points))), list(c(static, rep(1, time_points))))
int_descript < c('Never treat', 'Always treat')
nsimul < 10000
gform_basic < gformula(
obs_data = basicdata_nocomp,
outcome_type = outcome_type,
id = id,
time_points = time_points,
time_name = time_name,
covnames = covnames,
outcome_name = outcome_name,
covtypes = covtypes,
covparams = covparams,
ymodel = ymodel,
intvars = intvars,
interventions = interventions,
int_descript = int_descript,
histories = histories,
histvars = histvars,
basecovs = c("L3"),
nsimul = nsimul,
seed = 1234
)
gform_basic
## PREDICTED RISK UNDER MULTIPLE INTERVENTIONS
##
## Intervention Description
## 0 Natural course
## 1 Never treat
## 2 Always treat
##
## Sample size = 2500, Monte Carlo sample size = 10000
## Number of bootstrap samples = 0
## Reference intervention = natural course (0)
##
##
## k Interv. NP Risk gform risk Risk ratio Risk difference
## 6 0 0.5056 0.5048280 1.0000000 0.0000000
## 6 1 NA 0.7314631 1.4489355 0.2266352
## 6 2 NA 0.2339747 0.4634741 0.2708533
In our results, we see the simulated risk estimates using the gformula for the natural course (when some people are treated to A and others aren’t) as well as the counterfactual scenarios where no one is ever treated to A and where everyone is treated to A all of the time. Notice also that we require model specification for all covariates, not just for the outcome. Yet the advantages are that we adjust for timevarying confounding affected by prior exposures (and we also can account for dynamic interventions).
In this example, we see that estimates of the risk for the natural course, never treated, and always treated populations are 0.505, 0.731, and 0.234 respectively (in the gformula risk column). The NP Risk value represents the nonparametric risk estimate and is close to the gform risk for the natural course group, so we have some (but not full) assurance that our model is not specified incorrectly.
When we compare the risk in the always treated group compared to the never treated group as they do in Keil et al. (2014) to compare with our hazard ratios, we see that the gformula approach yields a risk ratio of 0.32 while, as we observed earlier, we saw the hazard ratio of 0.15 between the treated and untreated group. Perhaps this difference is due to bias which is unaccounted in the Cox proportional hazards model but is addressed in the gformula approach?
Keil AP, Edwards JK, Richardson DR, Naimi AI, Cole SR. The parametric Gformula for timetoevent data: towards intuition with a worked example. Epidemiology (Cambridge, Mass.). 2014 Nov;25(6):889.↩︎
Johnson KW, Glicksberg BS, Hodos RA, Shameer K, Dudley JT. Causal inference on electronic health records to assess blood pressure treatment targets: an application of the parametric g formula. In PACIFIC SYMPOSIUM ON BIOCOMPUTING 2018: Proceedings of the Pacific Symposium 2018 (pp. 180191) and Taubman SL, Robins JM, Mittleman MA, Hernán MA. Intervening on risk factors for coronary heart disease: an application of the parametric gformula. International journal of epidemiology. 2009 Dec 1;38(6):1599611.↩︎
McGrath S, Lin V, Zhang Z, Petito LC, Logan RW, Hernán MA, Young JG. gfoRmula: An R package for estimating the effects of sustained treatment strategies via the parametric gformula. Patterns. 2020 Jun 12;1(3):100008.↩︎