Zhengyuan Yang

I am currently a Senior Researcher at Microsoft. I received my Ph.D. degree in Computer Science at University of Rochester, advised by Prof. Jiebo Luo. I did my bachelors at University of Science and Technology of China. I've received ACM SIGMM Award for Outstanding Ph.D. Thesis, Twitch Research Fellowship, and ICPR 2018 Best Industry Related Paper Award. My research interests involve the intersection of computer vision and natural language processing, including multi-modal vision-language understanding and generation.


Email  /  CV  /  Github  /  Google Scholar  /  LinkedIn  /  Name Pronounce  /  Publications


  • [2024/02]   Four papers accepted to CVPR 2024: (1) MM-Narrator, audio descriptions (AD) generation with GPT-4, (2) DisCo, human dance generation with disentangled controls, (3) Tuning diffusion models towards diverse image generation, (4) MMSum, a dataset for video multimodal summarization.
  • [2023/12]   I will serve as an Area Chair for ACMMM 2024, and an Exhibits and Demos Chair for ICME 2024. Welcome to submit your demo papers!
  • [2023/11]   How would it be if LMMs could interact with smartphones as humans do? Checkout GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation. [Article]
  • [2023/11]   How could LMMs contribute to social good? Checkout GPT-4V(ision) as A Social Media Analysis Engine.
  • [2023/10]   How might LMMs revolutionize the understanding of video and streaming content? Checkout MM-Vid: Advancing Video Understanding with GPT-4V(ision).
  • [2023/10]   How well can image generation models assist visual design? Checkout DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design.
  • [2023/10]   How can LMM-based agents achieve human-like multimodal iterative exploration? Checkout our initial study on a generative agent, named Idea2ImgIdea2Img, focusing on automatic image design and generation. Thanks for the great video!
  • [2023/09]   What are the current state and promising future directions for large multimodal models (LMMs)? Please checkout our Preliminary Explorations with GPT-4V(ision): The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision).
  • [2023/09]   Please checkout our survey paper/book on Multimodal Foundation Models: From Specialists to General-Purpose Assistants. [Slides] [YouTube] [Bilibili]
  • [2023/08]   MM-Vet is an LMM evaluation benchmark that evaluates Large Multimodal Models' integrated VL capabilities. [MM-Vet Leaderbaord]
  • [2023/07]   Two papers accepted to ICCV 2023: (1) PromptCap, prompt controlled visual captioning; (2) EQBen, a new diagnostic VLM benchmark.
  • [2023/06]   I will serve as a SPC member for AAAI 2024.
  • [2023/06]   Check out our CVPR 2023 Tutorial on "Recent Advances in Vision Foundation Models". Slides and recordings availble.
  • [2023/03]   We build MM-REACT, a system paradigm that integrates LLMs with a pool of vision experts to achieve multimodal reasoning and action.
  • [2023/03]   IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) special issue on "AI-Generated Content for Multimedia." Submission deadline: July 1st, 2023.
  • [2023/02]   ReCo is our new text-to-image model that allows the precise region control of input text queries, accepted to CVPR 2023. See a teaser here.
  • [2023/01]   Prompting GPT-3 To Be Reliable accepted to ICLR 2023.
  • [2022/10]   My Ph.D. thesis "Visual Grounding: Building Cross-Modal Visual-Text Alignment" wins the 2022 ACM SIGMM Award for Outstanding Ph.D. Thesis.
  • [2022/10]   I am selected as one of Outstanding Reviewers for ECCV 2022.
  • [2022/07]   UniTAB accepted to ECCV 2022 as an Oral presentation.
  • [2022/07]   I will serve as a SPC member for AAAI 2023.
  • [2022/06]   Check out our CVPR 2022 Tutorial on "Recent Advances in Vision-and-Language Pre-training". Slides and recordings availble.
  • [2022/05]   The new multimodal generative foundation model Florence-GIT achieves new sota across 12 image/video VL tasks, including the first human-parity on TextCaps. GIT achieves 88.79% ImageNet-1k accuracy using a generative scheme. See a teaser here.
  • [2022/01]   I will serve as an Associate Editor for IEEE TCSVT.

  • Selected Publications

    My current research mainly focues on multimodal generation and understanding. Please check the Google Scholar for more complete and up-to-date publication list.


    arXiv preprints

  • Chenglei Si*, Yanzhe Zhang*, Zhengyuan Yang, Ruibo Liu, Diyi Yang, "Design2Code: How Far Are We From Automating Front-End Engineering?" [PDF][Code][Project page]
  • Zhengyuan Yang*, Linjie Li*, Kevin Lin*, Jianfeng Wang*, Chung-Ching Lin*, Zicheng Liu, Lijuan Wang, "The dawn of lmms: Preliminary explorations with gpt-4v (ision)." [PDF][Acknowledgments] (Exploratory work cataloguing use of GPT-4V)
  • Zhengyuan Yang, Jianfeng Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Lijuan Wang, "Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation." [PDF][Code][Project page][Video]
  • An Yan*, Zhengyuan Yang*, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, Zicheng Liu, Lijuan Wang, "Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation." [PDF][Code]
  • Kevin Lin*, Faisal Ahmed*, Linjie Li*, Chung-Ching Lin*, Ehsan Azarnasab, Zhengyuan Yang, Jianfeng Wang, Lin Liang, Zicheng Liu, Yumao Lu, Ce Liu, Lijuan Wang, "Mm-vid: Advancing video understanding with gpt-4v (ision)." [PDF][Project page]
  • Weihao Yu*, Zhengyuan Yang*, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, Lijuan Wang, "MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities." [PDF][Code][Leaderbaord]
  • Hanjia Lyu*, Jinfa Huang*, Daoan Zhang*, Yongsheng Yu*, Xinyi Mou, Jinsheng Pan, Zhengyuan Yang, Zhongyu Wei, Jiebo Luo, "Gpt-4v (ision) as a social media analysis engine." [PDF][Code]
  • Zecheng Tang, Chenfei Wu, Zekai Zhang, Mingheng Ni, Shengming Yin, Yu Liu, Zhengyuan Yang, Lijuan Wang, Zicheng Liu, Juntao Li, Nan Duan, "StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis." [PDF]
  • Jie An, Zhengyuan Yang, Jianfeng Wang, Linjie Li, Zicheng Liu, Lijuan Wang, Jiebo Luo, "Bring Metric Functions into Diffusion Models." [PDF]
  • Alex Jinpeng Wang, Linjie Li, Kevin Qinghong Lin, Jianfeng Wang, Kevin Lin, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou, "COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training." [PDF][Project page]
  • Xueyan Zou, Linjie Li, Jianfeng Wang, Jianwei Yang, Mingyu Ding, Zhengyuan Yang, Feng Li, Hao Zhang, Shilong Liu, Arul Aravinthan, Yong Jae Lee, Lijuan Wang, "Interfacing Foundation Models' Embeddings." [PDF][Code]
  • Jie An, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Lijuan Wang, Jiebo Luo, "Openleaf: Open-domain interleaved image-text generation and evaluation." [PDF]
  • Kevin Lin*, Zhengyuan Yang*, Linjie Li, Jianfeng Wang, Lijuan Wang, "DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design." [PDF][Project page]
  • Jaemin Cho, Linjie Li, Zhengyuan Yang, Zhe Gan, Lijuan Wang, Mohit Bansal, "Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation." [PDF][Project page]
  • Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, Lijuan Wang, "GRiT: A Generative Region-to-text Transformer for Object Understanding." [PDF][Code]

  • 2024

  • Chaoyi Zhang, Kevin Lin, Zhengyuan Yang, Jianfeng Wang, Linjie Li, Chung-Ching Lin, Zicheng Liu, Lijuan Wang, "MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, June 2024. [PDF][Project page]
  • Tan Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang, "DisCo: Disentangled Control for Referring Human Dance Generation in Real World," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, June 2024. [PDF][Code][Project page]
  • Zichen Miao, Jiang Wang, Ze Wang, Zhengyuan Yang, Lijuan Wang, Qiang Qiu, Zicheng Liu, "Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, June 2024.
  • Jielin Qiu, Jiacheng Zhu, William Han, Aditesh Kumar, Karthik Mittal, Claire Jin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Ding Zhao, Bo Li, Lijuan Wang, "MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, June 2024. [PDF][Project page]

  • 2023

  • Chunyuan Li*, Zhe Gan*, Zhengyuan Yang*, Jianwei Yang*, Linjie Li*, Lijuan Wang, Jianfeng Gao, "Multimodal Foundation Models: From Specialists to General-Purpose Assistants," Foundations and Trends in Computer Graphics and Vision, 2023. (A survey book on multimodal foundation models) [PDF]
  • Zhengyuan Yang*, Linjie Li*, Jianfeng Wang*, Kevin Lin*, Ehsan Azarnasab*, Faisal Ahmed*, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang, "MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action." [PDF][Code][Project page]
  • Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A. Smith, Jiebo Luo, "PromptCap: Prompt-Guided Task-Aware Image Captioning," International Conference on Computer Vision (ICCV), Paris, France, Oct 2023. [PDF][Code]
  • Tan Wang, Kevin Lin, Linjie Li, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang, "Equivariant Similarity for Vision-Language Foundation Models," International Conference on Computer Vision (ICCV), Paris, France, Oct 2023. (Oral Presentation) [PDF][Code]
  • Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Linjie Li, Kevin Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang, "ReCo: Region-Controlled Text-to-Image Generation," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, June 2023. [PDF][Code]
  • Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, Jianlong Fu, Gong Ming, Lijuan Wang, Zicheng Liu, Houqiang Li, Nan Duan, "NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation," Annual Meeting of the Association for Computational Linguistics (ACL), Toronto, Canada, July 2023. (Oral Presentation) [PDF][Project page]
  • Xiaodong Wang, Chenfei Wu, Shengming Yin, Minheng Ni, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Fan Yang, Lijuan Wang, Zicheng Liu, Yuejian Fang, Nan Duan, "Learning 3D Photography Videos via Self-supervised Diffusion on Single Images," The 32nd International Joint Conference on Artificial Intelligence (IJCAI), Macao, August 2023. [PDF]
  • Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, Jordan Boyd-Graber, Lijuan Wang, "Prompting GPT-3 To Be Reliable," The Eleventh International Conference on Learning Representations (ICLR), Kigali, Rwanda, May 2023. [PDF][Code]
  • Jiajun Deng, Zhengyuan Yang, Daqing Liu, Tianlang Chen, Wengang Zhou, Yanyong Zhang, Houqiang Li, Wanli Ouyang, "TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer," IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023. [PDF][Code]

  • 2022

  • Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang, "GIT: A Generative Image-to-text Transformer for Vision and Language," Transactions on Machine Learning Research (TMLR), 2022. [PDF][Code]
  • Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu and Lijuan Wang, "UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling," European Conference on Computer Vision (ECCV), Tel Aviv, Israel, October 2022. (Oral Presentation) [PDF][Code]
  • Jianfeng Wang, Xiaowei Hu, Zhe Gan, Zhengyuan Yang, Xiyang Dai, Zicheng Liu, Yumao Lu and Lijuan Wang, "UFO: A UniFied TransfOrmer for Vision-Language Representation Learning." [PDF]
  • Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu and Lijuan Wang, "Scaling Up Vision-Language Pre-training for Image Captioning," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, June 2022. [PDF]
  • Zhengyuan Yang, Jingen Liu, Jing Huang, Xiaodong He, Tao Mei, Chenliang Xu, Jiebo Luo, "Cross-modal Contrastive Distillation for Instructional Activity Anticipation," International Conference on Pattern Recognition (ICPR), Montreal, Quebec, Canada, August 2022. (Oral Presentation) [PDF]
  • Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu and Lijuan Wang, "An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA," The 36th AAAI Conference on Artificial Intelligence (AAAI), February 2022. (Oral Presentation) [PDF][Code]

  • PhD Thesis

  • Zhengyuan Yang, "Visual Grounding: Building Cross-Modal Visual-Text Alignment," University of Rochester. (ACM SIGMM Award for Outstanding Ph.D. Thesis) [PDF]

  • 2021

  • Zhengyuan Yang, Songyang Zhang, Liwei Wang, Jiebo Luo, "SAT: 2D Semantics Assisted Training for 3D Visual Grounding," International Conference on Computer Vision (ICCV), Oct 2021. (Oral Presentation) [PDF][Code]
  • Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, Houqiang Li, "TransVG: End-to-End Visual Grounding with Transformers," International Conference on Computer Vision (ICCV), Oct 2021. [PDF][Code]
  • Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, Jiebo Luo, "TAP: Text-Aware Pre-training for Text-VQA and Text-Caption," Conference on Computer Vision and Pattern Recognition (CVPR), June 2021. (Oral Presentation) [PDF][Code]
  • Liwei Wang, Jing Huang, Yin Li, Kun Xu, Zhengyuan Yang, Dong Yu, "Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation," Conference on Computer Vision and Pattern Recognition (CVPR), June 2021. [PDF][Code]

  • 2020

  • Zhengyuan Yang, Tianlang Chen, Liwei Wang, Jiebo Luo, "Improving One-stage Visual Grounding by Recursive Sub-query Construction," European Conference on Computer Vision (ECCV), Glasgow, UK, August 2020. [PDF][Code]
  • Huan Lin, Fandong Meng, Jinsong Su, Yongjing Yin, Zhengyuan Yang, Yubin Ge, Jie Zhou, Jiebo Luo, "Dynamic Context-guided Capsule Network for Multimodal Machine Translation," ACM Multimedia Conference (ACMMM), Seattle, WA, October 2020. (Oral Presentation) [PDF][Code]
  • Yongjing Yin, Fandong Meng, Jinsong Su, Chulun Zhou, Zhengyuan Yang, Jie Zhou, Jiebo Luo, "A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation," Annual Meeting of the Association for Computational Linguistics (ACL), Seattle, WA, July 2020. [PDF][Code]
  • Zhengyuan Yang, Tushar Kumar, Tianlang Chen, Jingsong Su, Jiebo Luo, "Grounding-Tracking-Integration," IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT). [PDF]

  • 2019 and Earlier

  • Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, Jiebo Luo, "A Fast and Accurate One-Stage Approach to Visual Grounding," International Conference on Computer Vision (ICCV, Seoul, South Korea, October 2019. (Oral Presentation) [PDF][Code]
  • Zhengyuan Yang, Yuncheng Li, Linjie Yang, Ning Zhang, Jiebo Luo, "Weakly Supervised Body Part Parsing with Pose based Part Priors," International Conference on Pattern Recognition (ICPR), Millan, Italy, January, 2020. [PDF] [Demo]
  • Zhengyuan Yang, Amanda Kay, Yuncheng Li, Wendi Cross, Jiebo Luo, "Pose-based Body Language Recognition for Emotion and Psychiatric Symptom Interpretation," International Conference on Pattern Recognition (ICPR), Millan, Italy, January, 2020. [PDF]
  • Mengshi Qi, Weijian Li, Zhengyuan Yang, Yunhong Wang, Jiebo Luo, "Attentive Relational Networks for Mapping Images to Scene Graphs," Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, USA, June 2019. [PDF]
  • Zhengyuan Yang, Yixuan Zhang, Jiebo Luo, "Human-Centered Emotion Recognition in Animated GIFs with Facial Landmarks," International Conference on Multimedia and Expo (ICME), Shanghai, China, July 2019. [PDF] [Data]
  • Zhengyuan Yang, Yuncheng Li, Jianchao Yang, Jiebo Luo, "Action Recognition with Spatio-Temporal Visual Attention on Skeleton Image Sequences," IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT). [PDF] [Data]
  • Zhengyuan Yang, Yuncheng Li, Jianchao Yang, Jiebo Luo, "Action Recognition with Visual Attention on Skeleton Images," International Conference on Pattern Recognition (ICPR), Beijing, China, August 2018. (Oral Presentation) [PDF]
  • Zhengyuan Yang, Yixuan Zhang, Jerry Yu, Junjie Cai, Jiebo Luo, "End-to-end Multi-Modal Multi-Task Vehicle Control for Self-Driving Cars with Visual Perceptions," International Conference on Pattern Recognition (ICPR), Beijing, China, August 2018. (Oral Presentation) Best Industry Related Paper Award (BIRPA). [PDF] [Demo]
  • Zhengyuan Yang, Wendi Cross, Jiebo Luo, "Personalized pose estimation for body language understanding," International Conference on Image Processing (ICIP), Beijing, China, September 2017. (Oral Presentation)
  • Professional Experience

    Senior Researcher, Microsoft, Redmond, WA
    June 2021 - Current.
    Research on multimodal understanding and generation.

    Research Intern, Microsoft, Redmond, WA
    May - Aug 2020. Advisor: Yijuan Lu, Jianfeng Wang, Xi Yin.
    Project: Text-aware pre-training for Text-VQA and Text-Caption.

    Research Intern, Tencent AI Lab, Bellevue, WA
    Jan - Apr 2019. Advisor: Boqing Gong, Liwei Wang.
    Project: Visual Grounding with Natural Language Quires.

    Research Intern, SnapChat, Venice, CA
    May - Aug 2018. Advisor: Yuncheng Li, Linjie Yang, Ning Zhang.
    Project: Weakly Supervised Human Part Parsing.

    Research Intern, SAIC Innovation Center, San Jose, CA
    Jun - Aug 2017. Advisor: Jerry Yu.
    Project: Steering Angle Control with End-to-end Neural Networks.

    Awards

  • 2022 ACM SIGMM Award for Outstanding Ph.D. Thesis
  • Winner of CVPR 2021 TextCaps Challenge
  • Winner of CVPR 2021 ReferIt3D Challenge
  • Twitch Research Fellowship
  • Best Industry Related Paper Award at ICPR 2018
  • Service

  • Outstanding Reviewer, ECCV 2022; Outstanding Reviewer, CVPR 2021
  • Exhibits and Demos Chair: IEEE International Conference on Multimedia and Expo (ICME) 2024
  • Area Chair: ACM Multimedia Conference (ACMMM) 2024
  • Senior Program Committee (SPC): 37, 38th AAAI Conference on Artificial Intelligence (AAAI-23, 24)
  • Associate Editor: IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)
  • Journal Reviewer: TPAMI, IJCV, TIP, TMM, TCybernetics, TCSVT, Pattern Recognition, Neurocomputing, TBioCAS, IEEE Access.
  • Conference Reviewer: CVPR, ICCV, ECCV, NeurIPS, ICLR, ICML, ACL, EMNLP, AAAI, ACCV, WACV, ICME, ICIP.

  • © 2024 Zhengyuan Yang. All rights reserved.
    Template borrowed from Jon Barron. Thanks!