Skip to content

Latest commit

 

History

History
152 lines (98 loc) · 6.59 KB

evaluation_full.md

File metadata and controls

152 lines (98 loc) · 6.59 KB

Evaluation for full-parameter tuning models

Note that change conv-mode to minicpm/phi3/llama for MODEL_TYPE = minicpm/phi-3/llama3-8b.

MME

  1. Refer to MME GitHub to download the benchmark dataset and put MME_Benchmark_release_version under eval/mme.
  2. Update MODEL_TYPE and TARGET_DIR accordingly.
CUDA_VISIBLE_DEVICES=0 sh script/eval/full/mme.sh

The responses and scores can be found in eval/mme/answers_upload.

MMBench & MMBench-Chinese

  1. Refer to MMBench GitHub to download the benchmark dataset. We support MMBench-Dev, MMBench-Test, MMBench-Dev (cn) and MMBench-Test (cn). Please note that only the files downloaded by legacy link are supported. Put MMBench_DEV_EN_legacy.tsv, MMBench_TEST_EN_legacy.tsv, MMBench_DEV_CN_legacy.tsv or MMBench_TEST_CN_legacy.tsv under eval/mmbench.
  2. Update SPLIT, LANG (en/cn), MODEL_TYPE and TARGET_DIR accordingly.
CUDA_VISIBLE_DEVICES=0 sh script/eval/full/mmbench.sh

The response file can be found in eval/mmbench/answers_upload. You can submit the Excel file to submission link to obtain the evaluation scores.

SEED-Bench-1

  1. Refer to SEED-Bench Instruction to download the images and videos and put the images under eval/seed-bench/SEED-Bench-image and the videos under eval/seed-bench/SEED-Bench-video. Then, extract the video frames in the middle from the downloaded videos by running:

    pip install av decord
    python eval/seed-bench/extract_video_frames.py
  2. Update MODEL_TYPE and TARGET_DIR accordingly.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash script/eval/full/seedbench.sh

The response file can be found in eval/seed-bench/answers_upload and the scores can be found in eval/seed-bench/scores.

MMMU

  1. Refer to MMMU HuggingFace to download the benchmark dataset and put MMMU under eval/mmmu.
  2. Update SPLIT, MODEL_TYPE and TARGET_DIR accordingly. You may add --small-gpu-usage to avoid CUDA out of memory.
CUDA_VISIBLE_DEVICES=0 sh script/eval/full/mmmu.sh

The response file can be found in eval/mmmu/answers_upload.

For validation set, you can use eval_mmmu.py to obtain the scores.

python eval/mmmu/eval_mmmu.py \
	--output-path ./eval/mmmu/answers_upload/$SPLIT/$TARGET_DIR.json

For test set, you can submit the json response file to submission_link to obtain the evaluation scores.

CMMMU

  1. Refer to CMMMU HuggingFace to download the benchmark dataset and put CMMMU under eval/cmmmu.
  2. Update SPLIT, MODEL_TYPE and TARGET_DIR accordingly. You may add --small-gpu-usage to avoid CUDA out of memory.
CUDA_VISIBLE_DEVICES=0 sh script/eval/full/cmmmu.sh

The response file can be found in eval/cmmmu/answers_upload.

For validation set, you can use eval_script.py to obtain the scores.

python eval/cmmmu/eval_script.py \
	--output_path ./eval/cmmmu/answers_upload/$SPLIT/$TARGET_DIR.jsonl

For test set, you can submit the jsonl response file to submission_link to obtain the evaluation scores.

VQAv2

  1. Download COCO 2015 Test images and put test2015 under eval/vqav2. Then:

    tar -zxvf eval/vqav2/bunny_vqav2_mscoco_test2015.tar.gz -C eval/vqav2 && rm eval/vqav2/bunny_vqav2_mscoco_test2015.tar.gz && tar -zxvf eval/vqav2/bunny_vqav2_mscoco_test-dev2015.tar.gz -C eval/vqav2 && rm eval/vqav2/bunny_vqav2_mscoco_test-dev2015.tar.gz
  2. Update MODEL_TYPE and TARGET_DIR accordingly.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash script/eval/full/vqav2.sh

The response file can be found in eval/vqav2/answers_upload. You can submit the json response file to submission link (Test-Dev Phase) to obtain the evaluation scores.

GQA

  1. Download the images of GQA, unzip it and put images under eval/gqa. Then:

    tar -zxvf eval/gqa/testdev_balanced_questions.tar.gz -C eval/gqa && rm eval/gqa/testdev_balanced_questions.tar.gz
  2. Update MODEL_TYPE and TARGET_DIR accordingly.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash script/eval/full/gqa.sh

ScienceQA-IMG

  1. Refer to ScienceQA Google Drive to download test.zip, problems.json and pid_splits.json, unzip test.zip and put them under eval/scienceqa.
  2. Update MODEL_TYPE and TARGET_DIR accordingly.
CUDA_VISIBLE_DEVICES=0 sh script/eval/full/scienceqa.sh

The responses and the scores can be found in eval/scienceqa/results.

POPE

  1. Download COCO 2014 Val images and put val2014 under eval/pope. Then, refer to POPE GitHub to download the benchmark dataset and put the three json files under eval/pope/coco.
  2. Update MODEL_TYPE and TARGET_DIR accordingly.
CUDA_VISIBLE_DEVICES=0 sh script/eval/full/pope.sh

We report the averaged F1-score of three categories (random, popular and adversarial).

MM-Vet

  1. Refer to MM-Vet Github to download the benchmark dataset and put images under eval/mm-vet.
  2. Update MODEL_TYPE and TARGET_DIR accordingly.
CUDA_VISIBLE_DEVICES=0 sh script/eval/full/mmvet.sh

The response file can be found in eval/mm-vet/answers_upload. You can submit the json response file to submission link to obtain the evaluation scores.

SpatialBench

SpatialBench is proposed in SpatialBot. It tests models' performance on spatial understanding and reasoning.

  1. Download dataset in HuggingFace.
  2. Please refer to SpatialBot GitHub for evaluation codes.