Loading Events

← October 2023 Calendar

Multimodal LLMs: Unification, Efficiency, Interpretability

Friday, October 13, 2023

Multimodal LLMs: Unification, Efficiency, Interpretability

Dr. Mohit Bansal

Friday, Oct. 13, 2023 at 10 a.m. CT

Zoom link or copy and paste link: https://auburn.zoom.us/j/81706339239



In this talk, I will present our journey of large-scale multimodal pretrained (generative) models across various modalities (text, images, videos, audio, layouts, etc.) and enhancing important aspects such as unification, efficiency, and interpretability. We will start by discussing early cross-modal vision-and-language pretraining models (LXMERT) and visually-grounded text models with image/video knowledge distillation (Vokenization, VidLanKD). We will then present early unified models (VL-T5) to combine several multimodal tasks (such as visual QA, referring expression comprehension, visual entailment, visual commonsense reasoning, captioning, and multimodal translation) by treating all these tasks as text generation. We will also look at recent unified models (with joint objectives and architecture) such as textless video-audio transformers (TVLT), vision-text-layout transformers for universal document processing (UDOP), as well as composable any-to-any multimodal generation (CoDi). Second, we will look at further parameter/memory efficiency via adapter (VL-Adapter), ladder-sidetuning (LST), sparse sampling (ClipBERT), and audio replacement (ECLIPSE) methods. I will conclude with interpretability and evaluation aspects of image generation models, based on fine-grained skill and bias evaluation (DALL-Eval) and based on interpretable and controllable visual programs (VPGen+VPEval).


More information on the Fall Forums can be found on Auburn’s website.