Zhengyuan Yang

[2025 H1] A holistic view for our recent NeurIPS, ICCV, ICML works on multimodal reasoning and agentic models:

Effective Visual-Centric Reasoning as the Foundation: ViCrit: RL task for incentivizing to percieve; ThinkLite-VL: RL sample selection; Point-RFT; Vision Value Model; ImageGen-CoT; Editing as CoT.
Multi-Turn Agentic RL Training with Vision Tools: RAGEN; VAGEN; OpenThinkIMG.
Benchmarks for New Requirements on Models: Perceive all visual details: ViCrit-Bench; Synergy visual and textual reasoning: EMMA; Spatial Intelligence: SITE; Agentic eval in interactable environment: V-MAGE; Multi-faceted video reasoning: MMWorld.

[2025/09] Five papers accepted to NeurIPS 2025: (1) ViCrit: a challenging yet verifiable RL task for incentivizing visual perception; (2) ThinkLite-VL: scaling sample selection for effective RL; (3) Point-RFT: grounded CoT scales better in RL; (4) VAGEN: reinforcing visual state reasoning for multi-turn VLM agents; (5) OLA-VLM: distilling visual tokens for better perception.

[2025/08] Two papers accepted to EMNLP 2025: (1) GLIMPSE: Do Large Vision-Language Models Truly Think With Videos; (2) Audio-Aware Large Language Models as Judges for Speaking Styles.

[2025/06] Three papers accepted to ICCV 2025: (1) Vision Value Model (VisVM), for guiding VLM inference-time search; (2) ImageGen-CoT, for reasoning in visual generation; (3) SITE, a spatial intelligence benchmark.

[2025/06] Check out our CVPR 2025 Tutorial on "Recent Advances in Vision Foundation Models".

[2025/05] Two papers accepted to ICML 2025: (1) ReFocus, using image tools to better think for structured image understanding; (2) EMMA, an enhanced multimodal reasoning benchmark (Oral presentation).

[2025/02] Two papers accepted to CVPR 2025: (1) ShowUI for GUI visual agent, (2) LiVOS for light video object segmentation.

[2025/01] Five papers accepted to ICLR 2025: (1) SlowFast-VGen for dual-speed action-driven video generation, (2) PSO for tuning timestep-distilled diffusion models, (3) GenXD for 3D and 4D scene generation, (4) MMWorld for world model evaluation in videos, (5) EditRoom for composable 3D room layout editing.

[2025/01] Design2Code accpeted to NAACL 2025.

[2024/12] ShowUI received the Outstanding Paper Award at NeurIPS 2024 Open-World Agents workshop.

[2024/09] Three papers accepted to NeurIPS 2024: (1) Motion Consistency Model for video diffusion distillation, (2) What can Foundation Models' Embeddings do?, (3) VideoGUI, benchmarking GUI automation from instructional videos.

[2024/07] Openleaf accpeted to ACMMM 2024 BNI as an Oral presentation. List Items One by One accpeted to COLM 2024.

[2024/07] Three papers accepted to ECCV 2024: (1) Idea2Img, an LMM-based agent system for visual design and creation, (2) GRiT, a general and open-set object understanding framework, (3) IDOL, joint video-depth generation for human dance videos.

[2024/06] Check out our CVPR 2024 Tutorial on "Recent Advances in Vision Foundation Models". Slides and recordings now availble.

[2024/06] I will serve as an Area Chair for EMNLP 2024, and a SPC member for AAAI 2025.

[2024/05] Two papers accepted to ICML 2024: (1) MM-Vet, a modern evaluation benchmark for large multimodal models; (2) StrokeNUWA, generating vector graphics with LLMs.

[2024/02] Four papers accepted to CVPR 2024: (1) MM-Narrator, audio descriptions (AD) generation with GPT-4, (2) DisCo, human dance generation with disentangled controls, (3) Tuning diffusion models towards diverse image generation, (4) MMSum, a dataset for video multimodal summarization.

[2023/12] I will serve as an Area Chair for ACMMM 2024, and an Exhibits and Demos Chair for ICME 2024. Welcome to submit your demo papers!

[2023/11] How would it be if LMMs could interact with smartphones as humans do? Checkout GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation. [Article]

[2023/11] How could LMMs contribute to social good? Checkout GPT-4V(ision) as A Social Media Analysis Engine.

[2023/10] How might LMMs revolutionize the understanding of video and streaming content? Checkout MM-Vid: Advancing Video Understanding with GPT-4V(ision).

[2023/10] How well can image generation models assist visual design? Checkout DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design.

[2023/10] How can LMM-based agents achieve human-like multimodal iterative exploration? Checkout our initial study on a generative agent, named Idea2Img

, focusing on automatic image design and generation. Thanks for the great video!

[2023/09] What are the current state and promising future directions for large multimodal models (LMMs)? Please checkout our Preliminary Explorations with GPT-4V(ision): The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision).

[2023/09] Please checkout our survey paper/book on Multimodal Foundation Models: From Specialists to General-Purpose Assistants. [Slides] [YouTube] [Bilibili]

[2023/08] MM-Vet is an LMM evaluation benchmark that evaluates Large Multimodal Models' integrated VL capabilities. [MM-Vet Leaderbaord]

[2023/07] Two papers accepted to ICCV 2023: (1) PromptCap, prompt controlled visual captioning; (2) EQBen, a new diagnostic VLM benchmark.

[2023/06] I will serve as a SPC member for AAAI 2024.

[2023/06] Check out our CVPR 2023 Tutorial on "Recent Advances in Vision Foundation Models". Slides and recordings availble.

[2023/03] We build MM-REACT, a system paradigm that integrates LLMs with a pool of vision experts to achieve multimodal reasoning and action.

[2023/02] ReCo is our new text-to-image model that allows the precise region control of input text queries, accepted to CVPR 2023. See a teaser here.

[2022/10] My Ph.D. thesis "Visual Grounding: Building Cross-Modal Visual-Text Alignment" wins the 2022 ACM SIGMM Award for Outstanding Ph.D. Thesis.