Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage Question #5

Open
mranzinger opened this issue Aug 8, 2024 · 2 comments
Open

Usage Question #5

mranzinger opened this issue Aug 8, 2024 · 2 comments

Comments

@mranzinger
Copy link

Hello, excellent work!

In the readme, I don't see any reference to how inputs need to be transformed before usage. Crawling through the code, I found this:
https://github.com/bdaiinstitute/theia/blob/main/src/theia/models/backbones.py#L337-L338

So, it suggests to me that the right way to use the model is to pass it an input tensor with values between 0 and 255. Is that the correct usage?

Also, do you have any studies on the resolution interpolation ability of your model? I'm testing it out in an ADE20k semantic segmentation linear probe harness with the following:

# x.shape = (2, 3, 512, 512)
features = self.base_model.forward_feature(x, do_resize=False, interpolate_pos_encoding=True)

just so that it matches our settings for AM-RADIO. I've also tried it with 224px resolution. In both cases, I'm using a sliding window.

My results:
224px: 35.61 mIOU
512px: 35.58

Also, would you be willing to update the bibtex for your reference for AM-RADIO to

@InProceedings{Ranzinger_2024_CVPR,
    author    = {Ranzinger, Mike and Heinrich, Greg and Kautz, Jan and Molchanov, Pavlo},
    title     = {AM-RADIO: Agglomerative Vision Foundation Model Reduce All Domains Into One},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {12490-12500}
}

I nearly missed your paper because it didn't show up in my "Cited By" section, I think because the citation wasn't complete, and I was thrilled to see your work building in the agglomerative direction.

@elicassion
Copy link
Contributor

elicassion commented Aug 8, 2024

Hi @mranzinger

Thanks and we believe RADIO is a great work!

Input:
Your usage is correct. We use huggingfaces's input processor and it accepts:

  1. list[np.ndarray] or list[PIL.Image], channel last, uint8
  2. torch.Tensor in (*, H, W, C) or (*, C, H, W), uint8

All these take 0-255 values.

Resolution:
The way to do interpolation you found is correct (comes from huggingface's implementation). However, this may not work ideally since we didn't train it on resolutions larger than 224. The reason is that we target robot learning tasks rather than dense prediction tasks. Thanks for sharing the results on ADE20K. As shown by RADIO paper, CPE may give improvements on segmentation tasks. Also, more training images could help.

BibTex:
We would love to update the bibtex and congrats for the CVPR publication. Our draft was finished before CVPR conference.

@mranzinger
Copy link
Author

Okay, great. Thank you. Given that the mIOU was roughly similar at both resolutions, your model seems reasonably resilient to changes in resolution. I'll keep playing with your model.

Something I definitely learned with your work is that we should have considered regular ViT as a teacher. I fully did not expect it to be so important, but you proved how valuable it is. When we started RADIO, we had no idea how difficult SAM was going to be for us. We figured "hey, it should help with segmentation", and then spent the next while trying to figure out how to integrate it without it poisoning the model.

Based on the format of your paper, are you targeting ICLR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants