Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tuning Grounded Conversation Generation (GCG) Task #52

Open
hungnh1125 opened this issue Jun 6, 2024 · 4 comments
Open

Fine-tuning Grounded Conversation Generation (GCG) Task #52

hungnh1125 opened this issue Jun 6, 2024 · 4 comments

Comments

@hungnh1125
Copy link

Thank for your amazing work. I would like to reproduce the results from the task of Grounded Conversation Generation.
Could you please help me make the following information apparent?

  1. In your design (Figure 2 in the GLaMM study), did you file-tune just LMM or did you fine-tune on Region Encoder, LLM, and Pixel Decoder as well?
Screenshot 2024-06-06 at 10 14 53 AM 2. How much time do you spend fine-tuning? I attempted to fine-tune, and 10 epochs require 50 hours to finish.
@mmaaz60
Copy link
Member

mmaaz60 commented Jun 15, 2024

Hi @hungnh1125,

I appreciate your interest in our work. Please note that during training, the global image encoder and grounding image encoder are kept frozen and the region encoder, projection layers (VL and L-P) and the pixel decoder are fully finetuned, while the LLM is LORA finetuned with alpha = 8.

The training instructions are provided in this readme. Please note that it took us around 20 hours to run GCG finetuning on 8 NVIDIA A100-40GB GPUs.

I hope it answers your questions. Good Luck, and let me know if you have any questions.

@hungnh1125
Copy link
Author

@mmaaz60 Thank you so much for replying to me.
Could you help me to explain why in your function _create_seg_token_mask, you add torch.zeros((mask.shape[0], 575)).bool().cuda(), and torch.zeros((mask.shape[0], 1)).bool().cuda()?
575 and 1 are specific values, and I don't understand why you use these values.

Thank you so much.

@mmaaz60
Copy link
Member

mmaaz60 commented Jun 22, 2024

Hi @hungnh1125,

Thank you for your interest in our work. This is because we are using image at 336x336 resolution with 14 patch size.

@hungnh1125
Copy link
Author

hungnh1125 commented Jun 23, 2024

@mmaaz60 I checked and saw that the size of the image embedding is [576, 4096]. Could you help me explain why you didn't add the torch.zeros((mask.shape[0], 576)).bool().cuda() before mask? What is the meaning of adding only torch.zeros((mask.shape[0], 575)).bool().cuda() before mask and torch.zeros((mask.shape[0], 1)).bool().cuda() after mask?
I want to extend your model to work with multiple images, so I need to understand the meaning of each parameter.

Thank you so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants