You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Can your model be fed with multiple images at once, such as different frames of a video? Or can it be modified so that the input to the language model is the tokens of multiple images at once?
The text was updated successfully, but these errors were encountered:
Thank you for your interest in our work. The current GLaMM model is designed to work with single image only. However, it can be modified to accept multiple images. At the LLM part, it would be relatively simpler as we can consider multiple images as video frames and concatenate the images. In the grounding part, we may have to introduce special tokens to decide if the generated <seg> token refers to the first of second image. Alternatively, we need to design a segmentation encoder-decoder architecture that can work with multiple images.
Please do share if you have made any progress towards this interesting research direction. Good Luck!
Can your model be fed with multiple images at once, such as different frames of a video? Or can it be modified so that the input to the language model is the tokens of multiple images at once?
The text was updated successfully, but these errors were encountered: