[2024/07]   Openleaf accpeted to ACMMM 2024 BNI as an Oral presentation. List Items One by One accpeted to COLM 2024.
[2024/07]   Three papers accepted to ECCV 2024: (1) Idea2Img, an LMM-based agent system for visual design and creation, (2) GRiT, a general and open-set object understanding framework, (3) IDOL, joint video-depth generation for human dance videos.
[2024/06]   I will serve as an Area Chair for EMNLP 2024, and a SPC member for AAAI 2025.
[2024/05]   Two papers accepted to ICML 2024: (1) MM-Vet, a modern evaluation benchmark for large multimodal models; (2) StrokeNUWA, generating vector graphics with LLMs.
[2024/02]   Four papers accepted to CVPR 2024: (1) MM-Narrator, audio descriptions (AD) generation with GPT-4, (2) DisCo, human dance generation with disentangled controls, (3) Tuning diffusion models towards diverse image generation, (4) MMSum, a dataset for video multimodal summarization.
[2023/12]   I will serve as an Area Chair for ACMMM 2024, and an Exhibits and Demos Chair for ICME 2024. Welcome to submit your demo papers!
[2023/03]   We build MM-REACT, a system paradigm that integrates LLMs with a pool of vision experts to achieve multimodal reasoning and action.
[2023/03]   IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) special issue on "AI-Generated Content for Multimedia." Submission deadline: July 1st, 2023.
[2023/02]   ReCo is our new text-to-image model that allows the precise region control of input text queries, accepted to CVPR 2023. See a teaser here.
[2022/07]   UniTAB accepted to ECCV 2022 as an Oral presentation.
[2022/05]   The new multimodal generative foundation model Florence-GIT achieves new sota across 12 image/video VL tasks, including the first human-parity on TextCaps. GIT achieves 88.79% ImageNet-1k accuracy using a generative scheme. See a teaser here.
Selected Publications
My current research mainly focues on multimodal generation and understanding. Please check the Google Scholar for more complete and up-to-date publication list.
arXiv preprints
Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William Yang Wang, Lijuan Wang, Xin Eric Wang, "MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos"
[PDF][Code][Project page]
Chenglei Si*, Yanzhe Zhang*, Zhengyuan Yang, Ruibo Liu, Diyi Yang, "Design2Code: How Far Are We From Automating Front-End Engineering?"
[PDF][Code][Project page]
Zhengyuan Yang*, Linjie Li*, Kevin Lin*, Jianfeng Wang*, Chung-Ching Lin*, Zicheng Liu, Lijuan Wang, "The dawn of lmms: Preliminary explorations with gpt-4v (ision)."
[PDF][Acknowledgments] (Exploratory work cataloguing use of GPT-4V)
An Yan*, Zhengyuan Yang*, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, Zicheng Liu, Lijuan Wang, "Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation."
[PDF][Code]
Kevin Lin*, Faisal Ahmed*, Linjie Li*, Chung-Ching Lin*, Ehsan Azarnasab, Zhengyuan Yang, Jianfeng Wang, Lin Liang, Zicheng Liu, Yumao Lu, Ce Liu, Lijuan Wang, "Mm-vid: Advancing video understanding with gpt-4v (ision)."
[PDF][Project page]
Hanjia Lyu*, Jinfa Huang*, Daoan Zhang*, Yongsheng Yu*, Xinyi Mou, Jinsheng Pan, Zhengyuan Yang, Zhongyu Wei, Jiebo Luo, "Gpt-4v (ision) as a social media analysis engine."
[PDF][Code]
Alex Jinpeng Wang, Linjie Li, Kevin Qinghong Lin, Jianfeng Wang, Kevin Lin, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou, "COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training."
[PDF][Project page]
Kevin Lin*, Zhengyuan Yang*, Linjie Li, Jianfeng Wang, Lijuan Wang, "DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design."
[PDF][Project page]
2024
Yuanhao Zhai, Kevin Lin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Chung-Ching Lin, David Doermann, Junsong Yuan, Lijuan Wang, "Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation," The Thirty-eight Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Dec 2024.
[PDF][Code][Project page]
Xueyan Zou, Linjie Li, Jianfeng Wang, Jianwei Yang, Mingyu Ding, Junyi Wei, Zhengyuan Yang, Feng Li, Hao Zhang, Shilong Liu, Arul Aravinthan, Yong Jae Lee, Lijuan Wang, "Interfacing Foundation Models' Embeddings," The Thirty-eight Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Dec 2024.
[PDF][Code]
Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen WU, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou, "VideoGUI: A Benchmark for GUI Automation from Instructional Videos," The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS DB Track), Vancouver, BC, Dec 2024. (Spotlight Presentation)
[PDF][Project page]
An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jianwei Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Julian McAuley, Jianfeng Gao, Lijuan Wang, "List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs," the 1st Conference on Language Modeling (COLM), Philadelphia, PA, October 2024.
[PDF][Code][Data]
Zhengyuan Yang, Jianfeng Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Lijuan Wang, "Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation," The 18th European Conference on Computer Vision (ECCV), Milano, Italy, Sept 2024.
[PDF][Code][Project page][Video]
Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, Lijuan Wang, "GRiT: A Generative Region-to-text Transformer for Object Understanding," The 18th European Conference on Computer Vision (ECCV), Milano, Italy, Sept 2024.
[PDF][Code]
Yuanhao Zhai, Kevin Lin, Linjie Li, Chung-Ching Lin, Jianfeng Wang, Zhengyuan Yang, David Doermann, Junsong Yuan, Zicheng Liu, Lijuan Wang, "IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation," The 18th European Conference on Computer Vision (ECCV), Milano, Italy, Sept 2024.
[PDF][Code][Project page]
Jie An, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Lijuan Wang, Jiebo Luo, "Openleaf: Open-domain interleaved image-text generation and evaluation," ACM Multimedia Conference, Brave New Ideas track (ACMMM), Melbourne, Australia, October 2024. (Oral Presentation)
[PDF]
Weihao Yu*, Zhengyuan Yang*, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, Lijuan Wang, "MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities," The 41st International Conference on Machine Learning (ICML), Vienna, Austria, July 2024.
[PDF][Code][Leaderbaord]
Zecheng Tang, Chenfei Wu, Zekai Zhang, Mingheng Ni, Shengming Yin, Yu Liu, Zhengyuan Yang, Lijuan Wang, Zicheng Liu, Juntao Li, Nan Duan, "StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis," The 41st International Conference on Machine Learning (ICML), Vienna, Austria, July 2024.
[PDF][Code]
Jie An, Zhengyuan Yang, Jianfeng Wang, Linjie Li, Zicheng Liu, Lijuan Wang, Jiebo Luo, "Bring Metric Functions into Diffusion Models," The 33rd International Joint Conference on Artificial Intelligence (IJCAI), Jeju, August 2024.
[PDF]
Jaemin Cho, Linjie Li, Zhengyuan Yang, Zhe Gan, Lijuan Wang, Mohit Bansal, "Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation," CVPR Workshop on the Evaluation of Generative Foundation Models, Seattle, WA, June 2024.
[PDF][Project page]
Chaoyi Zhang, Kevin Lin, Zhengyuan Yang, Jianfeng Wang, Linjie Li, Chung-Ching Lin, Zicheng Liu, Lijuan Wang, "MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, June 2024. (Highlight Presentation)
[PDF][Project page]
Tan Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang, "DisCo: Disentangled Control for Referring Human Dance Generation in Real World," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, June 2024.
[PDF][Code][Project page]
Zichen Miao, Jiang Wang, Ze Wang, Zhengyuan Yang, Lijuan Wang, Qiang Qiu, Zicheng Liu, "Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, June 2024.
Jielin Qiu, Jiacheng Zhu, William Han, Aditesh Kumar, Karthik Mittal, Claire Jin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Ding Zhao, Bo Li, Lijuan Wang, "MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, June 2024. (Highlight Presentation)
[PDF][Project page]
2023
Chunyuan Li*, Zhe Gan*, Zhengyuan Yang*, Jianwei Yang*, Linjie Li*, Lijuan Wang, Jianfeng Gao, "Multimodal Foundation Models: From Specialists to General-Purpose Assistants," Foundations and Trends in Computer Graphics and Vision, 2023. (A survey book on multimodal foundation models)
[PDF]
Zhengyuan Yang*, Linjie Li*, Jianfeng Wang*, Kevin Lin*, Ehsan Azarnasab*, Faisal Ahmed*, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang, "MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action."
[PDF][Code][Project page]
Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A. Smith, Jiebo Luo, "PromptCap: Prompt-Guided Task-Aware Image Captioning," International Conference on Computer Vision (ICCV), Paris, France, Oct 2023.
[PDF][Code]
Tan Wang, Kevin Lin, Linjie Li, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang, "Equivariant Similarity for Vision-Language Foundation Models," International Conference on Computer Vision (ICCV), Paris, France, Oct 2023. (Oral Presentation)
[PDF][Code]
Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Linjie Li, Kevin Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang, "ReCo: Region-Controlled Text-to-Image Generation," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, June 2023.
[PDF][Code]
Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, Jianlong Fu, Gong Ming, Lijuan Wang, Zicheng Liu, Houqiang Li, Nan Duan, "NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation," Annual Meeting of the Association for Computational Linguistics (ACL), Toronto, Canada, July 2023. (Oral Presentation)
[PDF][Project page]
Xiaodong Wang, Chenfei Wu, Shengming Yin, Minheng Ni, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Fan Yang, Lijuan Wang, Zicheng Liu, Yuejian Fang, Nan Duan, "Learning 3D Photography Videos via Self-supervised Diffusion on Single Images," The 32nd International Joint Conference on Artificial Intelligence (IJCAI), Macao, August 2023.
[PDF]
Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, Jordan Boyd-Graber, Lijuan Wang, "Prompting GPT-3 To Be Reliable," The Eleventh International Conference on Learning Representations (ICLR), Kigali, Rwanda, May 2023.
[PDF][Code]
Jiajun Deng, Zhengyuan Yang, Daqing Liu, Tianlang Chen, Wengang Zhou, Yanyong Zhang, Houqiang Li, Wanli Ouyang, "TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer," IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023.
[PDF][Code]
2022
Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang, "GIT: A Generative Image-to-text Transformer for Vision and Language," Transactions on Machine Learning Research (TMLR), 2022.
[PDF][Code]
Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu and Lijuan Wang, "UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling," European Conference on Computer Vision (ECCV), Tel Aviv, Israel, October 2022. (Oral Presentation)
[PDF][Code]
Jianfeng Wang, Xiaowei Hu, Zhe Gan, Zhengyuan Yang, Xiyang Dai, Zicheng Liu, Yumao Lu and Lijuan Wang, "UFO: A UniFied TransfOrmer for Vision-Language Representation Learning."
[PDF]
Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu and Lijuan Wang, "Scaling Up Vision-Language Pre-training for Image Captioning," IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), New Orleans, June 2022.
[PDF]
Zhengyuan Yang, Jingen Liu, Jing Huang, Xiaodong He, Tao Mei, Chenliang Xu, Jiebo Luo, "Cross-modal Contrastive Distillation for Instructional Activity Anticipation," International Conference on Pattern Recognition (ICPR), Montreal, Quebec, Canada, August 2022. (Oral Presentation)
[PDF]
Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu and Lijuan Wang, "An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA," The 36th AAAI Conference on Artificial Intelligence (AAAI), February 2022. (Oral Presentation)
[PDF][Code]
PhD Thesis
Zhengyuan Yang, "Visual Grounding: Building Cross-Modal Visual-Text Alignment," University of Rochester. (ACM SIGMM Award for Outstanding Ph.D. Thesis) [PDF]
2021
Zhengyuan Yang, Songyang Zhang, Liwei Wang, Jiebo Luo, "SAT: 2D Semantics Assisted Training for 3D Visual Grounding," International Conference on Computer Vision (ICCV), Oct 2021. (Oral Presentation)
[PDF][Code]
Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, Houqiang Li, "TransVG: End-to-End Visual Grounding with Transformers," International Conference on Computer Vision (ICCV), Oct 2021.
[PDF][Code]
Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, Jiebo Luo, "TAP: Text-Aware Pre-training for Text-VQA and Text-Caption," Conference on Computer Vision and Pattern Recognition (CVPR), June 2021. (Oral Presentation)
[PDF][Code]
Liwei Wang, Jing Huang, Yin Li, Kun Xu, Zhengyuan Yang, Dong Yu, "Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation," Conference on Computer Vision and Pattern Recognition (CVPR), June 2021.
[PDF][Code]
2020
Zhengyuan Yang, Tianlang Chen, Liwei Wang, Jiebo Luo, "Improving One-stage Visual Grounding by Recursive Sub-query Construction," European Conference on Computer Vision (ECCV), Glasgow, UK, August 2020.
[PDF][Code]
Huan Lin, Fandong Meng, Jinsong Su, Yongjing Yin, Zhengyuan Yang, Yubin Ge, Jie Zhou, Jiebo Luo, "Dynamic Context-guided Capsule Network for Multimodal Machine Translation," ACM Multimedia Conference (ACMMM), Seattle, WA, October 2020. (Oral Presentation)
[PDF][Code]
Yongjing Yin, Fandong Meng, Jinsong Su, Chulun Zhou, Zhengyuan Yang, Jie Zhou, Jiebo Luo, "A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation," Annual Meeting of the Association for Computational Linguistics (ACL), Seattle, WA, July 2020.
[PDF][Code]
Zhengyuan Yang, Tushar Kumar, Tianlang Chen, Jingsong Su, Jiebo Luo, "Grounding-Tracking-Integration," IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT).
[PDF]
2019 and Earlier
Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, Jiebo Luo, "A Fast and Accurate One-Stage Approach to Visual Grounding," International Conference on Computer Vision (ICCV, Seoul, South Korea, October 2019. (Oral Presentation)
[PDF][Code]
Zhengyuan Yang, Yuncheng Li, Linjie Yang, Ning Zhang, Jiebo Luo, "Weakly Supervised Body Part Parsing with Pose based Part Priors," International Conference on Pattern Recognition (ICPR), Millan, Italy, January, 2020.
[PDF]
[Demo]
Zhengyuan Yang, Amanda Kay, Yuncheng Li, Wendi Cross, Jiebo Luo, "Pose-based Body Language Recognition for Emotion and Psychiatric Symptom Interpretation," International Conference on Pattern Recognition (ICPR), Millan, Italy, January, 2020.
[PDF]
Mengshi Qi, Weijian Li, Zhengyuan Yang, Yunhong Wang, Jiebo Luo, "Attentive Relational Networks for Mapping Images to Scene Graphs," Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, USA, June 2019.
[PDF]
Zhengyuan Yang, Yixuan Zhang, Jiebo Luo, "Human-Centered Emotion Recognition in Animated GIFs with Facial Landmarks," International Conference on Multimedia and Expo (ICME), Shanghai, China, July 2019.
[PDF]
[Data]
Zhengyuan Yang, Yuncheng Li, Jianchao Yang, Jiebo Luo, "Action Recognition with Spatio-Temporal Visual Attention on Skeleton Image Sequences," IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT).
[PDF]
[Data]
Zhengyuan Yang, Yuncheng Li, Jianchao Yang, Jiebo Luo, "Action Recognition with Visual Attention on Skeleton Images," International Conference on Pattern Recognition (ICPR), Beijing, China, August 2018. (Oral Presentation)
[PDF]
Zhengyuan Yang, Yixuan Zhang, Jerry Yu, Junjie Cai, Jiebo Luo, "End-to-end Multi-Modal Multi-Task Vehicle Control for Self-Driving Cars with Visual Perceptions," International Conference on Pattern Recognition (ICPR), Beijing, China, August 2018. (Oral Presentation)Best Industry Related Paper Award (BIRPA).
[PDF]
[Demo]
Zhengyuan Yang, Wendi Cross, Jiebo Luo, "Personalized pose estimation for body language understanding," International Conference on Image Processing (ICIP), Beijing, China, September 2017. (Oral Presentation)
Professional Experience
Principal Researcher,
Microsoft, Redmond, WA
June 2021 - Current.
Research on multimodal understanding and generation.
Research Intern,
Microsoft, Redmond, WA
May - Aug 2020. Advisor: Yijuan Lu,
Jianfeng Wang,
Xi Yin.
Project: Text-aware pre-training for Text-VQA and Text-Caption.
Research Intern,
Tencent AI Lab, Bellevue, WA
Jan - Apr 2019. Advisor: Boqing Gong,
Liwei Wang.
Project: Visual Grounding with Natural Language Quires.
Research Intern,
SAIC Innovation Center, San Jose, CA
Jun - Aug 2017. Advisor: Jerry Yu.
Project: Steering Angle Control with End-to-end Neural Networks.