Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

Code release for the paper Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding, ECCV 2024.
Authors: Ozan Unal, Christos Sakaridis, Suman Saha, Luc Van Gool

🌟 ConcreteNet is ranked 1st on the ScanRefer online benchmark (19.08.2023-)!

Abstract: 3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language. With a wide range of applications ranging from autonomous indoor robotics to AR/VR, the task has recently risen in popularity. A common formulation to tackle 3D visual grounding is grounding-by-detection, where localization is done via bounding boxes. However, for real-life applications that require physical interactions, a bounding box insufficiently describes the geometry of an object. We therefore tackle the problem of dense 3D visual grounding, i.e. referral-based 3D instance segmentation. We propose a dense 3D grounding network ConcreteNet, featuring four novel stand-alone modules that aim to improve grounding performance for challenging repetitive instances, i.e. instances with distractors of the same semantic class. First, we introduce a bottom-up attentive fusion module that aims to disambiguate inter-instance relational cues, next, we construct a contrastive training scheme to induce separation in the latent space, we then resolve view-dependent utterances via a learned global camera token, and finally we employ multi-view ensembling to improve referred mask quality. ConcreteNet ranks 1st on the challenging ScanRefer online benchmark and has won the ICCV 3rd Workshop on Language for 3D Scenes "3D Object Localization" challenge.

News

[2024-06] ConcreteNet is accepted to ECCV 2024!
[2023-09] ConcreteNet wins the ICCV 3rd Workshop on Language for 3D Scenes "3D Object Localization" challenge!
[2023-08] ConcreteNet achieves 1st on the ScanRefer online benchmark!

Installation

For the installation, we recommend setting up a virtual environment:

python -m venv ~/venv/concretenet
source ~/venv/concretenet/bin/activate
pip install -r requirements.txt

Futhermore install the following dependencies:

Compile the external C++ and CUDA ops:

cd ./lib/dknet_ops
export CPATH=$SPARSEHASH_ROOT/include:$CPATH
python setup.py build_ext develop
cd ../segmentator
cd csrc && mkdir build && cd build
cmake .. \
-DCMAKE_PREFIX_PATH=`python -c 'import torch;print(torch.utils.cmake_prefix_path)'` \
-DPYTHON_INCLUDE_DIR=$(python -c "from distutils.sysconfig import get_python_inc; print(get_python_inc())")  \
-DPYTHON_LIBRARY=$(python -c "import distutils.sysconfig as sysconfig; print(sysconfig.get_config_var('LIBDIR'))") \
-DCMAKE_INSTALL_PREFIX=`python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())'`
make && make install # After install, please do not delete this folder (as we only create a symbolic link)
python -c 'import dknet_ops, segmentator' # Check installation

For further information regarding the compilation of external ops, please consult respective repositories of DKNet and segmentator.

Data Preparation

To download the ScanNetv2 dataset, please refer to the ScanNet project page. You will receive a download-scannet.py script after your request for the dataset is approved. ConcreteNet requires only a subset of ScanNetv2 which can be downloaded as (10GB):

python download-scannet.py -o SCANNET/DIR --type _vh_clean_2.ply
python download-scannet.py -o SCANNET/DIR --type .aggregation.json
python download-scannet.py -o SCANNET/DIR --type _vh_clean_2.0.010000.segs.json

Once the download is complete, process the training and validation splits:

cd dataloader
python prepare_scannet_data.py --scannet_dir SCANNET/DIR --scannet_metadata ./meta_data/scannetv2_train.txt
python prepare_scannet_data.py --scannet_dir SCANNET/DIR --scannet_metadata ./meta_data/scannetv2_val.txt

Fill in the dataset: root_dir field of each config file to point to SCANNET/DIR. To download the ScanRefer dataset, please refer to the ScanRefer project page. You will receive a download link after filling the request form. Download the scanrefer data as well as the annotated viewpoints following the instructions and unzip the compressed files in SCANNET/DIR.

Training

As ConcreteNet is built on a grounding-by-selection strategy, to be able to learn meaningful mappings from natural language prompts to 3D instances, (1) we first pretrain the 3D instance segmentation backbone, and (2) only then introduce end-to-end training with verbo-visual fusion. ConcreteNet was trained with only a single Nvidia RTX 3090.

Step 1 can be trained as follows. Alternatively, this step can be skipped by directly downloading the provided checkpoint file here.

python train.py --config_path config/pretraining.yaml

For Step 2, we provide the checkpoint from step 1. The resulting model state_dict can be downloaded here (43.84% Acc@50).

python train.py --config_path config/concretenet.yaml --ckpt_path STEP1/CKPT/PATH

The --resume argument can be used to resume training from the last saved checkpoint.

Evaluation

The final model as well as the provided intermediate checkpoints can be evaluated on both the ScanNetv2 and ScanRefer validation sets as follows:

python train.py --config_path config/pretraining.yaml --ckpt_path STEP1/CKPT/PATH --evaluate
python train.py --config_path config/concretenet.yaml --ckpt_path STEP2/CKPT/PATH --evaluate

The validation performance for the trained model is given below. Alternative to providing a checkpoint path, the --resume argument can once again be used to evaluate the last saved checkpoint. The evaluation dataset is detemined by the dataset: prefix field of the respective config file.

Unique@25	Unique@50	Multiple@25	Multiple@50	Overall@25	Overall@50
82.39	75.62	41.24	36.56	48.91	43.84

Evaluation with MVE can be done as follow:

python eval.py --config_path config/pretraining.yaml --ckpt_path CKPT/PATH

Unique@25	Unique@50	Multiple@25	Multiple@50	Overall@25	Overall@50
86.40	82.05	42.41	38.39	50.61	46.53

Quick Access for Download Links:

Pretrained checkpoint (121MB).
Final checkpoint (580MB).

Citation

If you use our dataset or our work in your research, please cite:

 @InProceedings{unal2024ways,
      author = {Unal, Ozan and Sakaridis, Christos and Saha, Suman and Van Gool, Luc},
      title = {Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding},
      booktitle = {European Conference on Computer Vision (ECCV)},
      month = {October},
      year = {2024}
      }
}

Acknowledgements

This work is funded by Toyota Motor Europe via the research project TRACE-Zurich (Toyota Research on Automated Cars Europe). We would like to additionally thank the authors of the open source codebases ScanRefer, 3DVG-Transformer and DKNet.

License

This software is made available for non-commercial use under a creative commons license. You can find a summary of the license here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

News

Installation

Data Preparation

Training

Evaluation

Quick Access for Download Links:

Citation

Acknowledgements

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
dataloader		dataloader
doc		doc
lib		lib
network		network
utils		utils
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
requirements.txt		requirements.txt
train.py		train.py

License

ouenal/concretenet

Folders and files

Latest commit

History

Repository files navigation

Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

News

Installation

Data Preparation

Training

Evaluation

Quick Access for Download Links:

Citation

Acknowledgements

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages