Skip to content

Latest commit

 

History

History
154 lines (137 loc) · 5.83 KB

DATA.md

File metadata and controls

154 lines (137 loc) · 5.83 KB

Data preparation

Data for training

We provide the processed data as follows.

DatasetsHugging FaceBaidu Disk
Multimodal Pre-trainingLink-
Joint Instruction TuningLink-
ScienceQALink-

Data for validating

We provide the processed data as follows. The annotations are provided in eval/questions.

DatasetsHugging FaceBaidu DiskGoogle DiskPeking University Disk
Image_UnderstandingLink---
Video_UnderstandingLink---
ScienceQALink---
Activitynet_Zero_Shot_QALinkLink--
MSRVTT_Zero_Shot_QALinkLinkLink-
MSVD_Zero_Shot_QALinkLinkLinkLink
TGIF_Zero_Shot_QALinkLinkLinkLink
POPELink---

Data parameter

Modify the data path in config/dataset_config.py:

Pretrain = {
    "chat_path": "${PATH}/CC3M-595K/chat.json",
    "CC3M": "${PATH}/CC3M-595K",
}

VIT = {
    "chat_path": "${PATH}/llava_instruct_150k.json",
    "COCO2017": "${PATH}/COCO2017/train2017",
}

MIMIC_imageonly = {
    "chat_path": "${PATH}/MIMIC-IT-imageonly.json",
    "CDG": "${PATH}/CGD/images",
    "LA": "${PATH}/LA/images",
    "SD": "${PATH}/SD/images",
}

COCO_CAP = {
    "chat_path": "${PATH}/COCO/coco_cap_chat.json",
    "COCO2014": "${PATH}/COCO2014/train2014",
}

COCO_REG = {
    "chat_path": "${PATH}/COCO/coco_reg_chat.json",
    "COCO2014": "${PATH}/COCO2014/train2014",
}

COCO_REC = {
    "chat_path": "${PATH}/COCO/coco_rec_chat.json",
    "COCO2014": "${PATH}/COCO2014/train2014",
}

VIDEO = {
    "chat_path": "${PATH}/video_chat.json",
    "VIDEO": "${PATH}/Activity_Videos",
}

SQA = {
    "chat_path": "${PATH}/llava_train_QCM-LEA.json",
    "ScienceQA": "${PATH}/scienceqa/train",
}

Prepare your own training dataset

Format of data

All the conversation data is in JSON format, and each conversation has the following content:

  • id: Used to distinguish different samples
  • "image" or "video": The name of the image or video
  • conversations: Conversations data
[
  {
    "id": "COCO_CAP_0",
    "image": "COCO_train2014_000000222016.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nDescribe the main events or objects in the image."
      },
      {
        "from": "gpt",
        "value": "a big red telephone booth that a man is standing in"
      }
    ]
  },
]

Data parameter

Modify the data path in config/dataset_config.py:

New_data = {
    "chat_path": "${PATH}/chat.json",
    "new_data": "${PATH}/file",
}

Then, modify the config in config/init.py. You can also combine different datasets using the list format:

DataConfig = {
    "New": [New_data],
}

To use the new dataset when training, you only need to change the parameters of "dataset_use" in the command:

--dataset_use New