1 Loading Obesity Paradox Data from MIMIC-IV

We begin by loading the data for investigating the obesity paradox. The key data that we need includes body mass index (BMI, treatment \(X\)), in-hospital mortality (outcome \(Y\)), and the confounding variables, including age and comorbidities in the Elixhauser score.

src <- "miiv" # MIMIC-IV data
data <- load_concepts(c("death", "bmi_all", "elix_components", "age"), src, 
                      id_type = "patient", verbose = FALSE)

# all those with missing death flag have survived the hospital stay
data <- data[is.na(death), death := FALSE]
data[, c(meta_vars(data), "elix_components") := NULL]

# focus on patients with BMI recorded, and ICD code available
data <- data[complete.cases(data)]

head(data)

We remove Obesity/Weight Loss comorbidities, as these are directly related to the treatment itself. We further combine diabetes with and without complications. We focus only on patients who have a BMI in the range \([18.5, 35)\, kg/m^2\), meaning that we exclude underweight (\(<18.5\)) and class II and III obese \((\geq 35)\) patients.

# remove Obesity and Weight Loss as they are closely related to BMI
data[, c("Obesity", "WeightLoss") := NULL]

# combine diabetes and diabetes with complications
data[, DM := pmax(DM, DMcx)]
data[, DMcx := NULL]

# remove underweight and class II/III obese patients
data <- data[bmi_all >= 18.5 & bmi_all < 35]

2 Variable Selection

For performing the analysis, to reduce dimensionality, we focus on comorbidities that are predictive for death.

# regress death onto comorbidities and age in a logistic model 
logreg <- glm(death ~ . - bmi_all + (bmi_all >= 25), data = data, 
                family = "binomial")
# get the coefficients
lr_coef <- summary(logreg)$coefficients
# order the variables by coefficient size
lr_coef <- lr_coef[order(lr_coef[, 1], decreasing = TRUE), ]

# focus on comorbidities that (significantly) increase mortality 
sel <- lr_coef[, 1] > 0 & lr_coef[, 4] < 0.05
cmb_sel <- setdiff(rownames(lr_coef)[sel], c("age"))

Purely for the purposes of getting the results faster in the classroom, we will use only the top five comorbidities, while you can run a larger experiment with more time.

# data <- data[1:1000] # subset for now!
R <- as.matrix(data[, cmb_sel[1:5], with=FALSE])
X <- as.integer(data$bmi_all >= 25)
Y <- as.integer(data$death)
W <- as.matrix(as.integer(data$age))
W <- (W - mean(W)) / sd(W) # normalize age

3 Investigating Measurement Error Patterns

Our first step is to look at the effect of different types of measurement error:

  • agnostic error, where \(\phi_{xy}\) does not depend on either \(X\) or \(Y\),
  • x-specific error, where \(\phi_{xy}\) depends on \(X\), but not \(Y\),
  • y-specific error, where \(\phi_{xy}\) depends on \(Y\), but not \(X\).

We look at different values of \(\phi_{xy}\) for each of the three cases.

plts <- spc_lst <- list()
method <- "IF"
solver <- "em"
for (pattern in c("agn", "x", "y")) {
  
  # specify the search space - agnostic, x-specific, and y-specific patterns
  spc <- search_space(pattern = pattern, method = method, 
                      fi = seq(0, 0.2, 0.05), k = k)
  
  # perform the search, using the Expectation-Maximization solver
  spc <- infer_search(dat = list(X=X, Y=Y, R=R), spc, solver = solver)
  
  # save the result
  spc_lst[[length(spc_lst) + 1]] <- spc
  
  # visualize the effect estimates
  plts[[length(plts) + 1]] <- spc_plot(spc)
}
cowplot::plot_grid(plotlist=plts, ncol = 3L)

Question: what are we observing in each of the plots. Comment on all three.

4 Age Association with Obesity

Our initial analysis does not take age into account. However, age is known to increase mortality (monotonically). Therefore, if age is associated with obesity, it may be an important confounder.

splm <- gam(bmi_all >= 25 ~ s(age), family = binomial, 
            data = data[, c("bmi_all", "age"), with=FALSE])
pse <- predict(splm, newdata = data.frame(age = 18:91), se.fit = TRUE)
expit <- function(x) exp(x) / (1 + exp(x))
gam_vals <- data.frame(
  px = expit(pse$fit), px_lwr = expit(pse$fit - 1.96 * pse$se.fit), 
  px_upr = expit(pse$fit + 1.96 * pse$se.fit), age = 18:91
)

raw_vals <- data[, .(px = mean(bmi_all >= 25), 
                     se = sqrt(mean(bmi_all >= 25) * (1 - mean(bmi_all >= 25)) / .N)), 
                 by = age]

ggplot(gam_vals, aes(x = age, y = px)) +
  geom_line(color = "orange", linewidth = 1.2) +
  geom_ribbon(aes(ymin = px_lwr, ymax = px_upr), alpha = 0.1) +
  theme_bw() + xlab("W (age in years)") + ylab("P(X = 1 | W = w)") +
  geom_point(data=raw_vals) +
  scale_y_continuous(labels = scales::percent)

Age has a U-shaped relationship with obesity.

5 Age-Based Analysis

We next re-run our analysis with age included. In particular, we make sure to include a \(W^2\) term in the logistic regression model for \(X\) – this is achieved by passing the tau_xw argument into the binsensate() function. We focus on the \(x\)-specific measurement error pattern, which previously showed a trend that may explain away the obesity paradox.

spc_dir_ifw <- search_space(pattern = "x", method = "IF", 
                            fi = seq(0, 0.1, 0.025), fi2 = 0, k = k)
  
spc_dir_ifw <- infer_search(list(X=X, Y=Y, R=R, W= W), spc_dir_ifw, solver = "em", 
                            se = TRUE, detect_viol = FALSE, mc_cores = 5,
                            tau_xw = function(w) w^2) # adds a W^2 term in X model
spc_plot(spc_dir_ifw)

Question: how do we interpret this plot?

Question: in summary, how likely are the different measurement error patterns?