Usage Question #5

mranzinger · 2024-08-08T17:42:41Z

Hello, excellent work!

In the readme, I don't see any reference to how inputs need to be transformed before usage. Crawling through the code, I found this:
https://github.com/bdaiinstitute/theia/blob/main/src/theia/models/backbones.py#L337-L338

So, it suggests to me that the right way to use the model is to pass it an input tensor with values between 0 and 255. Is that the correct usage?

Also, do you have any studies on the resolution interpolation ability of your model? I'm testing it out in an ADE20k semantic segmentation linear probe harness with the following:

# x.shape = (2, 3, 512, 512)
features = self.base_model.forward_feature(x, do_resize=False, interpolate_pos_encoding=True)

just so that it matches our settings for AM-RADIO. I've also tried it with 224px resolution. In both cases, I'm using a sliding window.

My results:
224px: 35.61 mIOU
512px: 35.58

Also, would you be willing to update the bibtex for your reference for AM-RADIO to

@InProceedings{Ranzinger_2024_CVPR,
    author    = {Ranzinger, Mike and Heinrich, Greg and Kautz, Jan and Molchanov, Pavlo},
    title     = {AM-RADIO: Agglomerative Vision Foundation Model Reduce All Domains Into One},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {12490-12500}
}

I nearly missed your paper because it didn't show up in my "Cited By" section, I think because the citation wasn't complete, and I was thrilled to see your work building in the agglomerative direction.

The text was updated successfully, but these errors were encountered:

elicassion · 2024-08-08T18:14:50Z

Hi @mranzinger

Thanks and we believe RADIO is a great work!

Input:
Your usage is correct. We use huggingfaces's input processor and it accepts:

list[np.ndarray] or list[PIL.Image], channel last, uint8
torch.Tensor in (*, H, W, C) or (*, C, H, W), uint8

All these take 0-255 values.

Resolution:
The way to do interpolation you found is correct (comes from huggingface's implementation). However, this may not work ideally since we didn't train it on resolutions larger than 224. The reason is that we target robot learning tasks rather than dense prediction tasks. Thanks for sharing the results on ADE20K. As shown by RADIO paper, CPE may give improvements on segmentation tasks. Also, more training images could help.

BibTex:
We would love to update the bibtex and congrats for the CVPR publication. Our draft was finished before CVPR conference.

mranzinger · 2024-08-08T18:44:56Z

Okay, great. Thank you. Given that the mIOU was roughly similar at both resolutions, your model seems reasonably resilient to changes in resolution. I'll keep playing with your model.

Something I definitely learned with your work is that we should have considered regular ViT as a teacher. I fully did not expect it to be so important, but you proved how valuable it is. When we started RADIO, we had no idea how difficult SAM was going to be for us. We figured "hey, it should help with segmentation", and then spent the next while trying to figure out how to integrate it without it poisoning the model.

Based on the format of your paper, are you targeting ICLR?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usage Question #5

Usage Question #5

mranzinger commented Aug 8, 2024

elicassion commented Aug 8, 2024 •

edited

Loading

mranzinger commented Aug 8, 2024

Usage Question #5

Usage Question #5

Comments

mranzinger commented Aug 8, 2024

elicassion commented Aug 8, 2024 • edited Loading

mranzinger commented Aug 8, 2024

elicassion commented Aug 8, 2024 •

edited

Loading