Fine-tuning Grounded Conversation Generation (GCG) Task #52

hungnh1125 · 2024-06-06T15:30:47Z

Thank for your amazing work. I would like to reproduce the results from the task of Grounded Conversation Generation.
Could you please help me make the following information apparent?

In your design (Figure 2 in the GLaMM study), did you file-tune just LMM or did you fine-tune on Region Encoder, LLM, and Pixel Decoder as well?

2. How much time do you spend fine-tuning? I attempted to fine-tune, and 10 epochs require 50 hours to finish.

mmaaz60 · 2024-06-15T03:33:55Z

Hi @hungnh1125,

I appreciate your interest in our work. Please note that during training, the global image encoder and grounding image encoder are kept frozen and the region encoder, projection layers (VL and L-P) and the pixel decoder are fully finetuned, while the LLM is LORA finetuned with alpha = 8.

The training instructions are provided in this readme. Please note that it took us around 20 hours to run GCG finetuning on 8 NVIDIA A100-40GB GPUs.

I hope it answers your questions. Good Luck, and let me know if you have any questions.

hungnh1125 · 2024-06-22T07:14:52Z

@mmaaz60 Thank you so much for replying to me.
Could you help me to explain why in your function _create_seg_token_mask, you add torch.zeros((mask.shape[0], 575)).bool().cuda(), and torch.zeros((mask.shape[0], 1)).bool().cuda()?
575 and 1 are specific values, and I don't understand why you use these values.

Thank you so much.

mmaaz60 · 2024-06-22T14:02:45Z

Hi @hungnh1125,

Thank you for your interest in our work. This is because we are using image at 336x336 resolution with 14 patch size.

hungnh1125 · 2024-06-23T06:19:32Z

@mmaaz60 I checked and saw that the size of the image embedding is [576, 4096]. Could you help me explain why you didn't add the torch.zeros((mask.shape[0], 576)).bool().cuda() before mask? What is the meaning of adding only torch.zeros((mask.shape[0], 575)).bool().cuda() before mask and torch.zeros((mask.shape[0], 1)).bool().cuda() after mask?
I want to extend your model to work with multiple images, so I need to understand the meaning of each parameter.

Thank you so much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tuning Grounded Conversation Generation (GCG) Task #52

Fine-tuning Grounded Conversation Generation (GCG) Task #52

hungnh1125 commented Jun 6, 2024

mmaaz60 commented Jun 15, 2024

hungnh1125 commented Jun 22, 2024

mmaaz60 commented Jun 22, 2024

hungnh1125 commented Jun 23, 2024 •

edited

Loading

Fine-tuning Grounded Conversation Generation (GCG) Task #52

Fine-tuning Grounded Conversation Generation (GCG) Task #52

Comments

hungnh1125 commented Jun 6, 2024

mmaaz60 commented Jun 15, 2024

hungnh1125 commented Jun 22, 2024

mmaaz60 commented Jun 22, 2024

hungnh1125 commented Jun 23, 2024 • edited Loading

hungnh1125 commented Jun 23, 2024 •

edited

Loading