The purpose of this repositories is to train potential LLMs on clinical text for NER task.
The first powerful model class we consider is the GatorTron which is a BERT type model (https://huggingface.co/UFNLP/gatortron-medium).
Developed by a joint effort between the University of Florida and NVIDIA, GatorTron-Medium is a clinical language model of 3.9 billion parameters, pre-trained using a BERT architecure implemented in the Megatron package (https://github.com/NVIDIA/Megatron-LM). GatorTron-Medium is pre-trained using a dataset consisting of 82B words of de-identified clinical notes from the University of Florida Health System, 6.1B words from PubMed CC0, 2.5B words from WikiText, 0.5B words of de-identified clinical notes from MIMIC-III
The base models has 345 million params while the medium one has a whopping 3.9 billion params. More details are provided in the paper https://www.nature.com/articles/s41746-022-00742-2.
This is related to the GatorTron but was trained on a different corpus.
Developed by a joint effort between the University of Florida and NVIDIA, GatorTronS is a clinical language model of 345 million parameters, pre-trained using a BERT architecure implemented in the Megatron package (https://github.com/NVIDIA/Megatron-LM). GatorTronS is pre-trained using a dataset consisting of: 22B synthetic clinical words generated by GatorTronGPT (a Megatron GPT-3 model) 6.1B words from PubMed CC0, 2.5B words from WikiText, 0.5B words of de-identified clinical notes from MIMIC-III
The model has 345 million params. Details can be found here https://www.nature.com/articles/s41746-023-00958-w#code-availability.
This dataset contain 4,392 abstracts released in PubMed®1 between January 2016 and January 2017. The abstracts were manually annotated for biomedical concepts. Details are provided in https://arxiv.org/pdf/1902.09476v1.pdf and data is in https://github.com/chanzuckerberg/MedMentions.
The preprocessed data for LLM training can be found here https://github.com/mhmdrdwn/medm/tree/main/data/built_data.
This dataset contains the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Details are here https://www.sciencedirect.com/science/article/pii/S1532046413001974?via%3Dihub.
The preprocessed data for LLM training can be found here https://huggingface.co/datasets/ncbi_disease.
Details of the dataset can be found here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7489085/.
The data for this shared task consisted of 505 discharge summaries drawn from the MIMIC-III (Medical Information Mart for Intensive Care-III) clinical care database.33 These records were selected using a query that searched for an ADE in the International Classification of Diseases code description of each record. The identified records were manually screened to contain at least 1 ADE, and were annotated for the concept and relation types shown in Table 1. Each record in the dataset was annotated by 2 independent annotators while a third annotator resolved conflicts.
Here I use Python version 3.9.2. All the dependencies are listed in requirements.txt.
You also need to install the repo as a package pip install -e .
.
An example to run the training code is
python3 src/models/train_model.py --model_name 'UFNLP/gatortrons' --data_dir '/home/ec2-user/SageMaker/LLM-NER-clinical-text/data/public/MedMentions/preprocessed-data/' --batch_size 4 --num_train_epochs 5 --weight_decay 0.01 --new_model_dir "/home/ec2-user/SageMaker/LLM-NER-clinical-text/models/medmentions/gatortrons/" --path_umls_semtype '/home/ec2-user/SageMaker/LLM-NER-clinical-text/data/public/MedMentions/SemGroups_2018.txt'
The fine-tuned models and brief results can be found at my huggingface page https://huggingface.co/longluu. You can also look at the notebooks folder for training and test results.
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── setup.py <- makes project pip installable (pip install -e .) so src can be imported
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
│
└── tox.ini <- tox file with settings for running tox; see tox.readthedocs.io
Project based on the cookiecutter data science project template. #cookiecutterdatascience