Skip to content

Train LLMs on extractive Question-Answering in biomedical domain.

Notifications You must be signed in to change notification settings

longluu/Medical-QA-LLM

Repository files navigation

Medical-QA-extractive

The purpose of this repositories is to train LLMs on extractive Question-Answering in biomedical domain.

1. Model

1.1 Extractive method

1.1.1 GatorTronS

Developed by a joint effort between the University of Florida and NVIDIA, GatorTronS is a clinical language model of 345 million parameters, pre-trained using a BERT architecure implemented in the Megatron package (https://github.com/NVIDIA/Megatron-LM). GatorTronS is pre-trained using a dataset consisting of: 22B synthetic clinical words generated by GatorTronGPT (a Megatron GPT-3 model) 6.1B words from PubMed CC0, 2.5B words from WikiText, 0.5B words of de-identified clinical notes from MIMIC-III.

The model has 345 million params. Details can be found here https://www.nature.com/articles/s41746-023-00958-w#code-availability.

1.1.2 Deberta-v3-large-mrqa

DeBERTa improves the BERT and RoBERTa models using disentangled attention and enhanced mask decoder. With those two improvements, DeBERTa out perform RoBERTa on a majority of NLU tasks with 80GB training data. In DeBERTa V3, we further improved the efficiency of DeBERTa using ELECTRA-Style pre-training with Gradient Disentangled Embedding Sharing. Compared to DeBERTa, our V3 version significantly improves the model performance on downstream tasks. You can find more technique details about the new model from our paper. The DeBERTa V3 large model comes with 24 layers and a hidden size of 1024. It has 304M backbone parameters with a vocabulary containing 128K tokens which introduces 131M parameters in the Embedding layer. This model was trained using the 160GB data as DeBERTa V2.

The base model is deberta-v3-large trained on MRQA dataset: https://huggingface.co/VMware/deberta-v3-large-mrqa.

1.2 RAG + LLM

Here we used RAG approach which generally consists of a retriever and and reader. For retriever, we used the embedding sentence-transformers/multi-qa-mpnet-base-dot-v1. For reader, we experimented with several LLM including GPT 3.5, Zephyr 7B, Llama 2 chat 7B, Flan T5 large and BioMedLM. This approach is more powerful than the extractive method because it doesn't require inputting the correct document (the retriever will do it from the corpus) and it can synthesize information from various documents.

2. Dataset

2.1. COVID-QA

This dataset contains 2,019 question/answer pairs annotated by volunteer biomedical experts on scientific articles regarding COVID-19 and other medical issues. The dataset can be found here: https://github.com/deepset-ai/COVID-QA. The preprocessed data can be found here https://huggingface.co/datasets/covid_qa_deepset.

2.2 BioASQ

This dataset consists of many biomedical tasks. For QA, they have a dataset with question-answer pairs on PubMed articles.

Task 9B will use benchmark datasets containing development and test questions, in English, along with gold standard (reference) answers. The benchmark datasets are being constructed by a team of biomedical experts from around Europe. The benchmark datasets contain four types of questions: Yes/no questions: These are questions that, strictly speaking, require "yes" or "no" answers, though of course in practice longer answers will often be desirable. For example, "Do CpG islands colocalise with transcription start sites?" is a yes/no question. Factoid questions: These are questions that, strictly speaking, require a particular entity name (e.g., of a disease, drug, or gene), a number, or a similar short expression as an answer, though again a longer answer may be desirable in practice. For example, "Which virus is best known as the cause of infectious mononucleosis?" is a factoid question. List questions: These are questions that, strictly speaking, require a list of entity names (e.g., a list of gene names), numbers, or similar short expressions as an answer; again, in practice additional information may be desirable. For example, "Which are the Raf kinase inhibitors?" is a list question. Summary questions: These are questions that do not belong in any of the previous categories and can only be answered by producing a short text summarizing the most prominent relevant information. For example, "What is the treatment of infectious mononucleosis?" is a summary question.

More details can be found here: http://participants-area.bioasq.org/general_information/Task9b/

3. Training setup

3.1 Environment

Here I use Python version 3.9.2. All the dependencies are listed in requirements.txt. You also need to install the repo as a package pip install -e ..

3.2 Run the code

An example to run the training code is

python src/models/run_qa.py \
    --model_name_or_path 'UFNLP/gatortrons' \
    --dataset_name 'longluu/covid-qa-split' \
    --do_train \
    --do_eval\
    --per_device_train_batch_size 4 \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 512 \
    --doc_stride 250 \
    --max_answer_length 200 \
    --output_dir "/home/ec2-user/SageMaker/Medical-QA-extractive/models/COVID-QA/gatortrons/" \
    --overwrite_output_dir

4. Results

The fine-tuned models and brief results can be found at my huggingface page https://huggingface.co/longluu. You can also look at the notebooks folder for training and test results.

Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py

About

Train LLMs on extractive Question-Answering in biomedical domain.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published