Evaluation for full-parameter tuning models

Note that change conv-mode to minicpm/phi3/llama for MODEL_TYPE = minicpm/phi-3/llama3-8b.

MME

Refer to MME GitHub to download the benchmark dataset and put MME_Benchmark_release_version under eval/mme.
Update MODEL_TYPE and TARGET_DIR accordingly.

CUDA_VISIBLE_DEVICES=0 sh script/eval/full/mme.sh

The responses and scores can be found in eval/mme/answers_upload.

MMBench & MMBench-Chinese

Refer to MMBench GitHub to download the benchmark dataset. We support MMBench-Dev, MMBench-Test, MMBench-Dev (cn) and MMBench-Test (cn). Please note that only the files downloaded by legacy link are supported. Put MMBench_DEV_EN_legacy.tsv, MMBench_TEST_EN_legacy.tsv, MMBench_DEV_CN_legacy.tsv or MMBench_TEST_CN_legacy.tsv under eval/mmbench.
Update SPLIT, LANG (en/cn), MODEL_TYPE and TARGET_DIR accordingly.

CUDA_VISIBLE_DEVICES=0 sh script/eval/full/mmbench.sh

The response file can be found in eval/mmbench/answers_upload. You can submit the Excel file to submission link to obtain the evaluation scores.

SEED-Bench-1

Refer to SEED-Bench Instruction to download the images and videos and put the images under eval/seed-bench/SEED-Bench-image and the videos under eval/seed-bench/SEED-Bench-video. Then, extract the video frames in the middle from the downloaded videos by running:
```
pip install av decord
python eval/seed-bench/extract_video_frames.py
```
Update MODEL_TYPE and TARGET_DIR accordingly.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash script/eval/full/seedbench.sh

The response file can be found in eval/seed-bench/answers_upload and the scores can be found in eval/seed-bench/scores.

MMMU

Refer to MMMU HuggingFace to download the benchmark dataset and put MMMU under eval/mmmu.
Update SPLIT, MODEL_TYPE and TARGET_DIR accordingly. You may add --small-gpu-usage to avoid CUDA out of memory.

CUDA_VISIBLE_DEVICES=0 sh script/eval/full/mmmu.sh

The response file can be found in eval/mmmu/answers_upload.

For validation set, you can use eval_mmmu.py to obtain the scores.

python eval/mmmu/eval_mmmu.py \
	--output-path ./eval/mmmu/answers_upload/$SPLIT/$TARGET_DIR.json

For test set, you can submit the json response file to submission_link to obtain the evaluation scores.

CMMMU

Refer to CMMMU HuggingFace to download the benchmark dataset and put CMMMU under eval/cmmmu.
Update SPLIT, MODEL_TYPE and TARGET_DIR accordingly. You may add --small-gpu-usage to avoid CUDA out of memory.

CUDA_VISIBLE_DEVICES=0 sh script/eval/full/cmmmu.sh

The response file can be found in eval/cmmmu/answers_upload.

For validation set, you can use eval_script.py to obtain the scores.

python eval/cmmmu/eval_script.py \
	--output_path ./eval/cmmmu/answers_upload/$SPLIT/$TARGET_DIR.jsonl

For test set, you can submit the jsonl response file to submission_link to obtain the evaluation scores.

VQAv2

Download COCO 2015 Test images and put test2015 under eval/vqav2. Then:

tar -zxvf eval/vqav2/bunny_vqav2_mscoco_test2015.tar.gz -C eval/vqav2 && rm eval/vqav2/bunny_vqav2_mscoco_test2015.tar.gz && tar -zxvf eval/vqav2/bunny_vqav2_mscoco_test-dev2015.tar.gz -C eval/vqav2 && rm eval/vqav2/bunny_vqav2_mscoco_test-dev2015.tar.gz

Update MODEL_TYPE and TARGET_DIR accordingly.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash script/eval/full/vqav2.sh

The response file can be found in eval/vqav2/answers_upload. You can submit the json response file to submission link (Test-Dev Phase) to obtain the evaluation scores.

GQA

Download the images of GQA, unzip it and put images under eval/gqa. Then:

tar -zxvf eval/gqa/testdev_balanced_questions.tar.gz -C eval/gqa && rm eval/gqa/testdev_balanced_questions.tar.gz

Update MODEL_TYPE and TARGET_DIR accordingly.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash script/eval/full/gqa.sh

ScienceQA-IMG

Refer to ScienceQA Google Drive to download test.zip, problems.json and pid_splits.json, unzip test.zip and put them under eval/scienceqa.
Update MODEL_TYPE and TARGET_DIR accordingly.

CUDA_VISIBLE_DEVICES=0 sh script/eval/full/scienceqa.sh

The responses and the scores can be found in eval/scienceqa/results.

POPE

Download COCO 2014 Val images and put val2014 under eval/pope. Then, refer to POPE GitHub to download the benchmark dataset and put the three json files under eval/pope/coco.
Update MODEL_TYPE and TARGET_DIR accordingly.

CUDA_VISIBLE_DEVICES=0 sh script/eval/full/pope.sh

We report the averaged F1-score of three categories (random, popular and adversarial).

MM-Vet

Refer to MM-Vet Github to download the benchmark dataset and put images under eval/mm-vet.
Update MODEL_TYPE and TARGET_DIR accordingly.

CUDA_VISIBLE_DEVICES=0 sh script/eval/full/mmvet.sh

The response file can be found in eval/mm-vet/answers_upload. You can submit the json response file to submission link to obtain the evaluation scores.

SpatialBench

SpatialBench is proposed in SpatialBot. It tests models' performance on spatial understanding and reasoning.

Download dataset in HuggingFace.
Please refer to SpatialBot GitHub for evaluation codes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation_full.md

evaluation_full.md

Evaluation for full-parameter tuning models

MME

MMBench & MMBench-Chinese

SEED-Bench-1

MMMU

CMMMU

VQAv2

GQA

ScienceQA-IMG

POPE

MM-Vet

SpatialBench

Files

evaluation_full.md

Latest commit

History

evaluation_full.md

File metadata and controls

Evaluation for full-parameter tuning models

MME

MMBench & MMBench-Chinese

SEED-Bench-1

MMMU

CMMMU

VQAv2

GQA

ScienceQA-IMG

POPE

MM-Vet

SpatialBench