← back to projects

ICU Deterioration Warning System

complete

Built an ICU early deterioration warning system on 110M+ hourly records from 50,920 patients (MIMIC-IV), using XGBoost with 12-hour rolling window features to predict vasopressor initiation, intubation, CRRT, or death within 12 hours

MIMICscikitpython

ICU Early Deterioration Warning System

The Problem

In the ICU, deterioration rarely announces itself. A patient who looks stable at 2 AM may be intubated by 6 AM. Clinical staff are stretched thin, monitoring dozens of patients simultaneously. Standard severity scores like SOFA are computed once a day — far too coarse to catch a patient sliding toward crisis.

The goal: produce a continuous, per-hour risk score for every ICU patient that answers the question:

In the next 12 hours, will this patient die, require vasopressors, be intubated, or need dialysis?

A model that flags deterioration hours in advance gives clinical teams a window to intervene.


Dataset — MIMIC-IV

SourceMIMIC-IV (Beth Israel Deaconess Medical Center, 2008–2019)
ICU stays50,920 patients
Hourly records~110 million rows
Uncompressed size27.9 GB
Median stay length44 hours
90th percentile stay171 hours
Mortality rate17.7%
Composite deterioration rate52.5%

Composite Outcome (what we're predicting)

The label fires if any of the following occur within the next 12 hours:

EventPrevalence
Death17.7%
New vasopressor initiation32.0%
New invasive ventilation35.5%
New CRRT (dialysis)2.2%

Each of these represents an irreversible escalation in care. Predicting them individually would be too sparse; combining them creates a clinically coherent composite with meaningful prevalence.


Pipeline

BigQuery (110M rows)
        │
        ▼
  SQL Window Functions
  ┌─────────────────────────────────────────────┐
  │  12-hour rolling stats per patient-hour     │
  │  (mean, std, min, max, last, missing-frac)  │
  └─────────────────────────────────────────────┘
        │
        ▼
  Download feature matrix (~27 MB parquet)
        │
        ▼
  Python: derived interaction features
  (SpO2/FiO2, pulse pressure, shock index,
   pH×pCO2, GCS delta)
        │
        ▼
  Patient-level stratified split
  (70% train / 15% val / 15% test — no row leakage)
        │
        ▼
  XGBoost training with early stopping
  (AUPRC on validation set)
        │
        ├──► SHAP TreeExplainer
        │    (global importance + per-prediction waterfall)
        │
        ├──► Threshold optimization
        │    (F-beta β=1.5, recall-prioritized)
        │
        └──► Serialized artifacts
             (model JSON + feature list + encoders)

Feature Engineering

Features are computed as rolling statistics over a 12-hour lookback window — capturing not just where a patient is, but where they've been trending.

GroupVariablesStatistics
Vitals (10)Heart rate, SpO2, SBP/DBP/MAP, temperature, glucoselast, mean, std, min, max, missing%
Blood gases (5)pH, pCO2, pO2, bicarbonate, SaO2last, mean, std, min, max, missing%
Ventilator (13)RR, PEEP, FiO2, tidal volume, peak pressure, I:E ratiolast, mean, std, min, max, missing%
Interventions (2)Vasopressor on/off, CRRT on/offlast, std
Clinical scores (7)GCS (total + components), SOFA, Sepsis-3last, mean, std
DerivedSpO2/FiO2, pulse pressure, shock index, pH×pCO2, ΔGCS
StaticAge, sex, race, ICU unit type, Elixhauser comorbidity score, height

Total: 212 features (v2, after SHAP-guided pruning from 245 in v1)

Why missingness is a feature

For labs like blood gases, whether a value was measured is itself clinical signal — these tests are ordered when a clinician already suspects something is wrong. A blood gas drawn at 3 AM is not random.


Results

Test Set Performance

MetricValue
AUROC0.995
AUPRC0.387
Optimal threshold (F-β, β=1.5)0.743

AUROC near 1.0 shows the model separates deteriorating from stable patients with high confidence. AUPRC of 0.387 is the harder metric — it accounts for class imbalance and is the standard for clinical early-warning models. A random classifier scores AUPRC equal to prevalence (~0.05 at the hourly level), so 0.387 represents a large lift.

ROC · Precision-Recall · Calibration

Test set evaluation — ROC, PR, and calibration curves
Test set evaluation — ROC, PR, and calibration curves

The ROC curve (left) hugs the top-left corner — near-perfect discrimination. The precision-recall curve (center) shows how the model maintains high precision at low recall thresholds, then gracefully trades precision for recall as the threshold drops. The calibration curve (right) shows the model is underconfident at high probabilities — predictions near 0.9 correspond to an actual positive rate of ~0.5 — worth noting for clinical deployment.

Threshold Selection

Precision / Recall / F-beta vs. decision threshold
Precision / Recall / F-beta vs. decision threshold

In ICU alerting, missing a deteriorating patient (false negative) is far worse than an unnecessary alarm (false positive). The threshold was optimized using F-beta with β=1.5, which weights recall 1.5× over precision. The optimal threshold lands at 0.743 — conservative enough to catch most true positives without drowning nurses in false alerts.


Model Explainability — SHAP

XGBoost was chosen in part because SHAP TreeExplainer computes exact (not sampled) Shapley values, making every prediction auditable. This matters in clinical settings: a model that says "high risk" but can't explain why is difficult to act on.

Global Feature Importance

SHAP mean absolute importance — top features
SHAP mean absolute importance — top features

The top drivers of deterioration risk:

FeatureMean abs SHAPClinical meaning
hr_since_admission1.23Longer ICU stays → sicker, more complex patients
spo2_mean0.60Average oxygen saturation over 12h — oxygenation trend
rr_max0.58Peak respiratory rate — respiratory distress signal
heart_rate_min0.36Minimum HR — bradycardia can signal shock or sedation
spo2_max0.24Ceiling SpO2 — helps distinguish transient vs. sustained desaturation
elixhauser_vanwalraven0.18Comorbidity burden — baseline frailty

Direction & Distribution — Beeswarm

SHAP beeswarm — feature direction and spread
SHAP beeswarm — feature direction and spread

Each dot is one patient-hour. Red = high feature value, blue = low. The x-axis shows how much that feature pushed the prediction toward risk (right) or safety (left).

  • High hr_since_admission (red, far right) — very long stays strongly increase risk
  • High spo2_mean (red, left) — good oxygenation reduces risk
  • High rr_max (red, right) — fast breathing increases risk
  • High heart_rate_min (red, left) — strong baseline HR actually slightly reduces risk (bradycardia is the red flag)

Single-Patient Explanation — Waterfall

SHAP waterfall — highest-risk patient-hour in test set
SHAP waterfall — highest-risk patient-hour in test set

This is the model's reasoning for the single highest-risk prediction in the test set (f(x) = 2.821, well above the 0.743 threshold). The baseline expected value is −2.688.

  • hr_since_admission = 2156 pushes risk up by +6.34 — this patient has been in the ICU for nearly 90 days
  • elixhauser_vanwalraven = 23 adds +0.48 — high comorbidity burden
  • SpO2 and RR features (NaN = not measured during this window) slightly reduce the score — absence of recent measurements here dampens the signal
  • Despite those offsets, the ICU tenure alone is enough to produce a very high-risk prediction

This level of transparency is what makes the model usable in clinical practice: a nurse can see why an alert fired, not just that it did.


Model Iteration: v1 → v2

Version 1Version 2
Training dateMay 3, 2026May 4, 2026
Features245212 (−14%)
Validation AUPRC0.39500.3924
What changedFull feature setRemoved low-SHAP vital-sign missingness flags
Model size786 KB826 KB
Key insightBaselineMissing-value indicators for heart rate, temperature, blood pressure carry near-zero SHAP — clinical intuition confirms these vitals are almost always recorded

The v1→v2 transition was driven by SHAP analysis, not guesswork. Removing 33 features caused a 0.3% AUPRC drop — within noise — while making the model more parsimonious, faster at inference, and easier to explain to clinicians.


Technical Stack

LayerTool
Data warehouseGoogle BigQuery
Feature engineeringBigQuery SQL window functions + Python pandas
ModelXGBoost (GPU-accelerated, CUDA)
ExplainabilitySHAP TreeExplainer (exact values)
Evaluationscikit-learn (AUROC, AUPRC, Brier score)
SerializationXGBoost JSON + joblib
Visualizationmatplotlib, seaborn

Limitations & Honest Caveats

  • Single-center data. MIMIC-IV is from one Boston hospital. Generalization to other health systems requires external validation.
  • Calibration gap. The model is underconfident at high probabilities — raw scores should not be interpreted as literal probabilities without recalibration.
  • No prospective validation. Performance on retrospective held-out data is not the same as real-world deployment. A prospective trial would be the next step.
  • Composite label obscures subtype. A patient about to be intubated and one about to start vasopressors look different clinically but get the same label. Outcome-specific sub-models may improve actionability.