Skip to content

Latest commit

 

History

History
48 lines (31 loc) · 4.21 KB

model_card.md

File metadata and controls

48 lines (31 loc) · 4.21 KB

Model Card: Variational AutoEncoder with Differential Privacy

Following Model Cards for Model Reporting (Mitchell et al.) and Lessons from Archives (Jo & Gebru), we're providing some information about about the Variational AutoEncoder (VAE) with Differential Privacy within this repository.

Model Details

The implementation of the Variational AutoEncoder (VAE) with Differential Privacy within this repository was created as part of an NHSX Analytics Unit PhD internship project undertaken by Dominic Danks (last commit to the repository: commit 88a4bdf). This model card describes the updated version of the model, released in March 2022. Further information about the previous version created by Dom and its model implementation can be found in Section 5.4 of the associated report.

Model Use

Intended Use

This model is intended for use in experimenting with the interplay of differential privacy and VAEs.

Out-of-Scope Use Cases

This model is not suitable to provide privacy guarantees in a production environment.

Training Data

Experiments in this repository are run against the Study to Understand Prognoses Preferences Outcomes and Risks of Treatment (SUPPORT) dataset accessed via the pycox python library. We also performed further analysis on a single table that we extracted from MIMIC-III.

Performance and Limitations

A from-scratch VAE implementation was compared against various models available within the SDV framework using a variety of quality and privacy metrics on the SUPPORT dataset. The VAE was found to be competitive with all of these models across the various metrics. Differential Privacy (DP) was introduced via DP-SGD and the performance of the VAE for different levels of privacy was evaluated. It was found that as the level of Differential Privacy introduced by DP-SGD was increased, it became easier to distinguish between synthetic and real data.

Proper evaluation of quality and privacy of synthetic data is challenging. In this work, we utilised metrics from the SDV library due to their natural integration with the rest of the codebase. A valuable extension of this work would be to apply a variety of external metrics, including more advanced adversarial attacks to more thoroughly evaluate the privacy of the considered methods, including as the level of DP is varied. It would also be of interest to apply DP-SGD and/or PATE to all of the considered methods and evaluate whether the performance drop as a function of implemented privacy is similar or different across the models.

Currently the SynthVAE model only works for data which is 'clean'. I.e data that has no missingness or NaNs within its input. It can handle continuous, categorical and datetime variables. Special types such as nominal data cannot be handled properly however the model may still run. Column names have to be specified in the code for the variable group they belong to. A simple example CSV file is provided under example_input.csv.

Hyperparameter tuning of the model can result in errors if certain parameter values are selected. Most commonly, changing learning rate in our example results in errors during training. An extensive test to evaluate plausible ranges has not been performed as of yet. If you get errors during tuning then consider your hyperparameter values and adjust accordingly.

Additional notes

Opacus

The experiments presented here use the modified copy of Opacus (v0.14.0) contained in this repo. The only difference between this edited version and the usual version is in line 96 of opacus/grad_sample/grad_sample_module.py, where we use register_full_backward_hook instead of register_backward_hook following the PyTorch recommendation provided via a warning when running the latter.