Sepsis Prediction using Clinical Data (PhysioNet Computing in Cardiology Challenge 2019)
This project implements an LSTM-based sepsis prediction model using various clinical data sources. Specifically, the model takes 10 hours of input data and predicts the probability of sepsis within the next hour. On the test set, the model has an AUC of 0.76.
The data used for this project is from the 2019 PhysioNet Computing in Cardiology Challenge. The following link provides more information about the data and a link to download: https://physionet.org/content/challenge-2019/1.0.0/
The dataset is a series of PSV files, where each row represents a single hour of data.
To run the code in this project, run the following notebooks:
psv_to_df.ipynb
: This notebook loads the PhysioNet data PSV files and saves them into a Pandas DataFrame for ease of downstream analysisfeature_engineering.ipynb
: This notebook generates 10 hour-windowed features and corresponding labelsfeature_selection.ipynb
: This notebook inspects feature correlations and removes any features that are highly correlatedtrain_model.ipynb
: This notebook defines the model, trains it, and evaluates its performance on validation and test sets
The remainder of this readme will cover the different steps in the analysis pipeline.
According to the PhysioNet Challenge details, the labels for the provided data are as follows:
For sepsis patients, SepsisLabel is 1 if t≥tsepsis−6
and 0 if t<tsepsis−6
For non-sepsis patients, SepsisLabel is 0
In other words, the SepsisLabel is set to 1 six hours before the onset of sepsis. However, for the purposes of this project, sepsis only needs to be predicted one hour in advance. So the labels are redefined such that:
For sepsis patients, SepsisLabel is 1 if t≥tsepsis
and 0 if t<tsepsis
For non-sepsis patients, SepsisLabel is 0
To actually realize this change, the first six values of SepsisLabel equals 1 are set to 0 for each patient’s data.