ICU Early Deterioration Warning System

The Problem

In the ICU, deterioration rarely announces itself. A patient who looks stable at 2 AM may be intubated by 6 AM. Clinical staff are stretched thin, monitoring dozens of patients simultaneously. Standard severity scores like SOFA are computed once a day â€” far too coarse to catch a patient sliding toward crisis.

The goal: produce a continuous, per-hour risk score for every ICU patient that answers the question:

In the next 12 hours, will this patient die, require vasopressors, be intubated, or need dialysis?

A model that flags deterioration hours in advance gives clinical teams a window to intervene.

Dataset â€” MIMIC-IV


Source	MIMIC-IV (Beth Israel Deaconess Medical Center, 2008â€“2019)
ICU stays	50,920 patients
Hourly records	~110 million rows
Uncompressed size	27.9 GB
Median stay length	44 hours
90th percentile stay	171 hours
Mortality rate	17.7%
Composite deterioration rate	52.5%

Composite Outcome (what we're predicting)

The label fires if any of the following occur within the next 12 hours:

Event	Prevalence
Death	17.7%
New vasopressor initiation	32.0%
New invasive ventilation	35.5%
New CRRT (dialysis)	2.2%

Each of these represents an irreversible escalation in care. Predicting them individually would be too sparse; combining them creates a clinically coherent composite with meaningful prevalence.

Pipeline

BigQuery (110M rows)
        â”‚
        â–¼
  SQL Window Functions
  â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”
  â”‚  12-hour rolling stats per patient-hour     â”‚
  â”‚  (mean, std, min, max, last, missing-frac)  â”‚
  â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜
        â”‚
        â–¼
  Download feature matrix (~27 MB parquet)
        â”‚
        â–¼
  Python: derived interaction features
  (SpO2/FiO2, pulse pressure, shock index,
   pHÃ—pCO2, GCS delta)
        â”‚
        â–¼
  Patient-level stratified split
  (70% train / 15% val / 15% test â€” no row leakage)
        â”‚
        â–¼
  XGBoost training with early stopping
  (AUPRC on validation set)
        â”‚
        â”œâ”€â”€â–º SHAP TreeExplainer
        â”‚    (global importance + per-prediction waterfall)
        â”‚
        â”œâ”€â”€â–º Threshold optimization
        â”‚    (F-beta Î²=1.5, recall-prioritized)
        â”‚
        â””â”€â”€â–º Serialized artifacts
             (model JSON + feature list + encoders)

Feature Engineering

Features are computed as rolling statistics over a 12-hour lookback window â€” capturing not just where a patient is, but where they've been trending.

Group	Variables	Statistics
Vitals (10)	Heart rate, SpO2, SBP/DBP/MAP, temperature, glucose	last, mean, std, min, max, missing%
Blood gases (5)	pH, pCO2, pO2, bicarbonate, SaO2	last, mean, std, min, max, missing%
Ventilator (13)	RR, PEEP, FiO2, tidal volume, peak pressure, I:E ratio	last, mean, std, min, max, missing%
Interventions (2)	Vasopressor on/off, CRRT on/off	last, std
Clinical scores (7)	GCS (total + components), SOFA, Sepsis-3	last, mean, std
Derived	SpO2/FiO2, pulse pressure, shock index, pHÃ—pCO2, Î”GCS	â€”
Static	Age, sex, race, ICU unit type, Elixhauser comorbidity score, height	â€”

Total: 212 features (v2, after SHAP-guided pruning from 245 in v1)

Why missingness is a feature

For labs like blood gases, whether a value was measured is itself clinical signal â€” these tests are ordered when a clinician already suspects something is wrong. A blood gas drawn at 3 AM is not random.

Results

Test Set Performance

Metric	Value
AUROC	0.995
AUPRC	0.387
Optimal threshold (F-Î², Î²=1.5)	0.743

AUROC near 1.0 shows the model separates deteriorating from stable patients with high confidence. AUPRC of 0.387 is the harder metric â€” it accounts for class imbalance and is the standard for clinical early-warning models. A random classifier scores AUPRC equal to prevalence (~0.05 at the hourly level), so 0.387 represents a large lift.

ROC Â· Precision-Recall Â· Calibration

Test set evaluation â€” ROC, PR, and calibration curves

The ROC curve (left) hugs the top-left corner â€” near-perfect discrimination. The precision-recall curve (center) shows how the model maintains high precision at low recall thresholds, then gracefully trades precision for recall as the threshold drops. The calibration curve (right) shows the model is underconfident at high probabilities â€” predictions near 0.9 correspond to an actual positive rate of ~0.5 â€” worth noting for clinical deployment.

Threshold Selection

Precision / Recall / F-beta vs. decision threshold

In ICU alerting, missing a deteriorating patient (false negative) is far worse than an unnecessary alarm (false positive). The threshold was optimized using F-beta with Î²=1.5, which weights recall 1.5Ã— over precision. The optimal threshold lands at 0.743 â€” conservative enough to catch most true positives without drowning nurses in false alerts.

Model Explainability â€” SHAP

XGBoost was chosen in part because SHAP TreeExplainer computes exact (not sampled) Shapley values, making every prediction auditable. This matters in clinical settings: a model that says "high risk" but can't explain why is difficult to act on.

Global Feature Importance

SHAP mean absolute importance â€” top features

The top drivers of deterioration risk:

Feature	Mean abs SHAP	Clinical meaning
`hr_since_admission`	1.23	Longer ICU stays â†’ sicker, more complex patients
`spo2_mean`	0.60	Average oxygen saturation over 12h â€” oxygenation trend
`rr_max`	0.58	Peak respiratory rate â€” respiratory distress signal
`heart_rate_min`	0.36	Minimum HR â€” bradycardia can signal shock or sedation
`spo2_max`	0.24	Ceiling SpO2 â€” helps distinguish transient vs. sustained desaturation
`elixhauser_vanwalraven`	0.18	Comorbidity burden â€” baseline frailty

Direction & Distribution â€” Beeswarm

SHAP beeswarm â€” feature direction and spread

Each dot is one patient-hour. Red = high feature value, blue = low. The x-axis shows how much that feature pushed the prediction toward risk (right) or safety (left).

High hr_since_admission (red, far right) â€” very long stays strongly increase risk
High spo2_mean (red, left) â€” good oxygenation reduces risk
High rr_max (red, right) â€” fast breathing increases risk
High heart_rate_min (red, left) â€” strong baseline HR actually slightly reduces risk (bradycardia is the red flag)

Single-Patient Explanation â€” Waterfall

SHAP waterfall â€” highest-risk patient-hour in test set

This is the model's reasoning for the single highest-risk prediction in the test set (f(x) = 2.821, well above the 0.743 threshold). The baseline expected value is âˆ’2.688.

hr_since_admission = 2156 pushes risk up by +6.34 â€” this patient has been in the ICU for nearly 90 days
elixhauser_vanwalraven = 23 adds +0.48 â€” high comorbidity burden
SpO2 and RR features (NaN = not measured during this window) slightly reduce the score â€” absence of recent measurements here dampens the signal
Despite those offsets, the ICU tenure alone is enough to produce a very high-risk prediction

This level of transparency is what makes the model usable in clinical practice: a nurse can see why an alert fired, not just that it did.

Model Iteration: v1 â†’ v2

	Version 1	Version 2
Training date	May 3, 2026	May 4, 2026
Features	245	212 (âˆ’14%)
Validation AUPRC	0.3950	0.3924
What changed	Full feature set	Removed low-SHAP vital-sign missingness flags
Model size	786 KB	826 KB
Key insight	Baseline	Missing-value indicators for heart rate, temperature, blood pressure carry near-zero SHAP â€” clinical intuition confirms these vitals are almost always recorded

The v1â†’v2 transition was driven by SHAP analysis, not guesswork. Removing 33 features caused a 0.3% AUPRC drop â€” within noise â€” while making the model more parsimonious, faster at inference, and easier to explain to clinicians.

Technical Stack

Layer	Tool
Data warehouse	Google BigQuery
Feature engineering	BigQuery SQL window functions + Python pandas
Model	XGBoost (GPU-accelerated, CUDA)
Explainability	SHAP TreeExplainer (exact values)
Evaluation	scikit-learn (AUROC, AUPRC, Brier score)
Serialization	XGBoost JSON + joblib
Visualization	matplotlib, seaborn

Limitations & Honest Caveats

Single-center data. MIMIC-IV is from one Boston hospital. Generalization to other health systems requires external validation.
Calibration gap. The model is underconfident at high probabilities â€” raw scores should not be interpreted as literal probabilities without recalibration.
No prospective validation. Performance on retrospective held-out data is not the same as real-world deployment. A prospective trial would be the next step.
Composite label obscures subtype. A patient about to be intubated and one about to start vasopressors look different clinically but get the same label. Outcome-specific sub-models may improve actionability.