diff --git a/images/figures/1-architecture.png b/images/figures/1-architecture.png new file mode 100644 index 0000000..ad9bca6 Binary files /dev/null and b/images/figures/1-architecture.png differ diff --git a/images/figures/audio-qual.png b/images/figures/audio-qual.png new file mode 100644 index 0000000..65f992e Binary files /dev/null and b/images/figures/audio-qual.png differ diff --git a/images/figures/comparison-prev_versions.png b/images/figures/comparison-prev_versions.png new file mode 100644 index 0000000..7052741 Binary files /dev/null and b/images/figures/comparison-prev_versions.png differ diff --git a/images/figures/grounding-qual.png b/images/figures/grounding-qual.png new file mode 100644 index 0000000..ffcd9c4 Binary files /dev/null and b/images/figures/grounding-qual.png differ diff --git a/images/figures/teaser.png b/images/figures/teaser.png new file mode 100644 index 0000000..9eb96ba Binary files /dev/null and b/images/figures/teaser.png differ diff --git a/images/logos/IVAL_logo.png b/images/logos/IVAL_logo.png new file mode 100644 index 0000000..5cd6523 Binary files /dev/null and b/images/logos/IVAL_logo.png differ diff --git a/images/logos/MBZUAI_logo.png b/images/logos/MBZUAI_logo.png new file mode 100644 index 0000000..1aededc Binary files /dev/null and b/images/logos/MBZUAI_logo.png differ diff --git a/images/logos/Oryx_logo.png b/images/logos/Oryx_logo.png new file mode 100644 index 0000000..745cbdf Binary files /dev/null and b/images/logos/Oryx_logo.png differ diff --git a/images/logos/logo.png b/images/logos/logo.png new file mode 100644 index 0000000..060457e Binary files /dev/null and b/images/logos/logo.png differ diff --git a/index.html b/index.html new file mode 100644 index 0000000..36b0f8d --- /dev/null +++ b/index.html @@ -0,0 +1,411 @@ + + + + + + + + + Video-LLaVA: Pixel Grounding in Large Multimodal Video Models + + + + + + + + + + + + + + + + + + + + + + +
+
+
+
+
+ Video-LLaVA_face +

Video-LLaVA: Pixel Grounding in Large Multimodal Video Models

+
+ + + + +
+ + Muhammad Maaz, + + + Hanoona Rasheed, + + + Salman Khan, + +
+ + +
+ + Mubarak Shah, + + + Fahad S. Khan + +
+
+
+ Mohamed bin Zayed University of AI, Australian National University
+
+
+ Linköping University, University of Central Florida
+
+ + + +
+
+
+
+
+ + +
+
+
+

+ Extending image-based Large Multimodal Models (LMM) to videos is challenging due to the inherent complexity of video data. The recent approaches extending image-based LMM to videos either lack the grounding capabilities (e.g., VideoChat, Video-ChatGPT, Video-LLaMA, etc.) or do not utilize the audio-signals for better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we propose Video-LLaVA, the first LMM with pixel-level grounding capability, integrating audio cues by transcribing them into text to enrich video-context understanding. Our framework uses an off-the-shelf tracker and a novel grounding module, enabling it to spatially and temporally ground objects in videos following user instructions. We evaluate Video-LLaVA using video-based generative and question-answering benchmarks and introduce new benchmarks specifically designed to measure prompt-based object grounding performance. Further, we propose the use of Vicuna over GPT-3.5, as utilized in Video-ChatGPT, for video-based conversation benchmarking, ensuring reproducibility of results, a concern with the proprietary nature of GPT-3.5. Our codes, pretrained models, and interactive demo will be made publicly available. +

+
+
+
+ + + + + + +
+
+ +
+
+

🔥Highlights

+
+ The key contributions of this work are: + +
    +
  1. We propose Video-LLaVA, the first video-based LMM with pixel-level grounding capabilities, featuring a modular design for enhanced flexibility.
  2. +
    +
  3. By incorporating audio context, Video-LLaVA significantly enhances its understanding of video content, making it more comprehensive and aptly suited for scenarios where the audio signal is crucial for video understanding + (e.g., dialogues and conversations, news videos, etc.).
  4. +
    +
  5. We introduce improved quantitative benchmarks for video-based conversational models. Our benchmarks utilize open-source Vicuna LLM to ensure better reproducibility and transparency. We also propose benchmarks to evaluate the grounding capabilities of video-based conversational models.
  6. +
+
+
+
+
+
+ + + + +
+
+
+

GLaMM_face Video-LLaVA

+
+
+ +
+
+
+
+

+ +

+
+
+
+ +
+
+
+ +
+ Overview of the Video-LLaVA architecture, showcasing the integration of a CLIP-based visual +encoder with a multimodal language model for video understanding. The CLIP visual encoder extracts spatio-temporal features from videos +by averaging frame-level features across temporal and spatial dimensions. These features are then projected into the LLM’s input space +using a learnable Multi-Layer Perceptron (MLP). The system features a grounding module for spatially locating textual descriptions within +video frames, a class-agnostic object tracker, and an entity matching module. Audio processing incorporates Voice Activity Detection, +phoneme modeling, and Whisper-based audio transcription, culminating in a multimodal pipeline that facilitates robust video-question +answering. The architecture is trained on a hybrid dataset of video instructions, enabling the handling of diverse conversational contexts +with high accuracy. +
+
+
+
+
+
+ + + +
+
+
+

Qualitative Results : Video Grounding

+
+
+ +
+ + +
+
+
+ +
+ Visual representation of the grounding capability of advanced video-conversational capabilities of Video-LLaVA. The highlighted regions in each video frame indicate the model's ability to identify and spatially locate key subjects mentioned in the textual description, such as the giraffe, the statue, and the gymnast on a balance beam. +
+
+
+
+
+
+ + + +
+
+
+

Qualitative Results : Including Audio Modality

+
+
+ +
+ + +
+
+
+ +
+ The figure illustrates the integrated audio processing pipeline that augments video-question answering with audio cues. It provides side-by-side comparisons showing how audio cues offer additional context, leading to a more accurate interpretation of the video content, as seen in the examples above. +
+
+
+
+
+
+ + + +
+
+
+

Video-ChatGPT vs Video-LLaVA

+
+
+ +
+ +
+
+
+ +
+ Qualitative analysis of video descriptions generated by Video-ChatGPT, Video-LLaVA (7B), and Video-LLaVA (13B) models. The evolution in model performance is evident, with enhancements in the accuracy of information, richness of descriptive detail, and alignment with the video’s context and sequence of events as we move from the baseline Video-ChatGPT to the more advanced Video-LLaVA (13B) model. +
+
+
+
+
+
+ + + + + + +
+
+

Acknowledgement

+

+ This website is adapted from Nerfies, licensed under a Creative + Commons Attribution-ShareAlike 4.0 International License. +

+
+
+ + + + + +
+ + IVAL Logo + + + Oryx Logo + + + MBZUAI Logo + +
diff --git a/static/css/index.css b/static/css/index.css new file mode 100644 index 0000000..338dbdd --- /dev/null +++ b/static/css/index.css @@ -0,0 +1,159 @@ +body { + font-family: 'Noto Sans', sans-serif; + } + + + .footer .icon-link { + font-size: 25px; + color: #000; + } + + .link-block a { + margin-top: 5px; + margin-bottom: 5px; + } + + .dnerf { + font-variant: small-caps; + } + + + .teaser .hero-body { + padding-top: 0; + padding-bottom: 3rem; + } + + .teaser { + font-family: 'Google Sans', sans-serif; + } + + + .publication-title { + } + + .publication-banner { + max-height: parent; + + } + + .publication-banner video { + position: relative; + left: auto; + top: auto; + transform: none; + object-fit: fit; + } + + .publication-header .hero-body { + } + + .publication-title { + font-family: 'Google Sans', sans-serif; + } + + .publication-authors { + font-family: 'Google Sans', sans-serif; + } + + .publication-venue { + color: #555; + width: fit-content; + font-weight: bold; + } + + .publication-awards { + color: #ff3860; + /* width: fit-content; */ + font-weight: bolder; + } + + .title + .publication-authors, + .subtitle + .publication-authors { + margin-top: -1.25rem; + } + + .publication-authors a { + color: hsl(204, 86%, 53%) !important; + } + + .publication-authors a:hover { + text-decoration: underline; + } + + .author-block { + display: inline-block; + } + + .publication-banner img { + } + + .publication-authors { + /*color: #4286f4;*/ + } + + .publication-video { + position: relative; + width: 100%; + height: 0; + padding-bottom: 56.25%; + + overflow: hidden; + border-radius: 10px !important; + } + + .publication-video iframe { + position: absolute; + top: 0; + left: 0; + width: 100%; + height: 100%; + } + + .publication-body img { + } + + .results-carousel { + overflow: hidden; + } + + .results-carousel .item { + margin: 5px; + overflow: hidden; + border: 1px solid #bbb; + border-radius: 10px; + padding: 0; + font-size: 0; + } + + .results-carousel video { + margin: 0; + } + + + .interpolation-panel { + background: #f5f5f5; + border-radius: 10px; + } + + .interpolation-panel .interpolation-image { + width: 100%; + border-radius: 5px; + } + + .interpolation-video-column { + } + + .interpolation-panel .slider { + margin: 0 !important; + } + + .interpolation-panel .slider { + margin: 0 !important; + } + + #interpolation-image-wrapper { + width: 100%; + } + #interpolation-image-wrapper img { + border-radius: 5px; + } \ No newline at end of file