[2023/03]   We build MM-REACT, a system paradigm that integrates LLMs with a pool of vision experts to achieve multimodal reasoning and action.
[2023/03]   IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) special issue on "AI-Generated Content for Multimedia." Submission deadline: July 1st, 2023.
[2023/02]   ReCo is our new text-to-image model that allows the precise region control of input text queries, accepted to CVPR 2023. See a teaser here.
[2022/05]   The new multimodal generative foundation model Florence-GIT achieves new sota across 12 image/video VL tasks, including the first human-parity on TextCaps. GIT achieves 88.79% ImageNet-1k accuracy using a generative scheme. See a teaser here.
[2022/01]   I will serve as an Associate Editor for IEEE TCSVT.
[2021/09]   Can GPT-3 benefit multimodal tasks? We provide an empirical study of GPT-3 for knowledge-based VQA, named PICa. (Selected as Oral in AAAI 2022)
[2021/07]   Two papers accepted to ICCV 2021 (The SAT paper was selected as Oral).
We propose DisCo for referring human dance generation, producing human dance images/videos with good faithfulness, generalizability, and compositionality.
EqBen explores the concept of equivariance in VLMs, focusing specifically on the multimodal similarity function that is not only the major training objective but also the core delivery to support downstream tasks.
GRiT is a general object understanding framework that detects objects and describes them with any style of texts it was trained with, e.g., class names, object attributes, actions, counts, etc.
Florence-GIT is our new multimodal generative foundation model. GIT shows a strong capbility of describing entities in the wild, such as scene texts, logos, landmarks, characters, etc.
Tan Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang, "DisCo: Disentangled Control for Referring Human Dance Generation in Real World." 2023.
[PDF]
Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang, "MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action." 2023.
[PDF]
Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A. Smith, Jiebo Luo, "PromptCap: Prompt-Guided Task-Aware Image Captioning," International Conference on Computer Vision (ICCV), Paris, France, Oct 2023.
[PDF][Code]
Tan Wang, Kevin Lin, Linjie Li, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang, "Equivariant Similarity for Vision-Language Foundation Models," International Conference on Computer Vision (ICCV), Paris, France, Oct 2023.
[PDF][Code]
Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Linjie Li, Kevin Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang, "ReCo: Region-Controlled Text-to-Image Generation," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, June 2023.
[PDF]
Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, Jianlong Fu, Gong Ming, Lijuan Wang, Zicheng Liu, Houqiang Li, Nan Duan, "NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation," Annual Meeting of the Association for Computational Linguistics (ACL), Toronto, Canada, July 2023. (Oral Presentation)
[PDF]
Xiaodong Wang, Chenfei Wu, Shengming Yin, Minheng Ni, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Fan Yang, Lijuan Wang, Zicheng Liu, Yuejian Fang, Nan Duan, "Learning 3D Photography Videos via Self-supervised Diffusion on Single Images," The 32nd International Joint Conference on Artificial Intelligence (IJCAI-23), Macao, August 2023.
[PDF]
Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, Jordan Boyd-Graber, Lijuan Wang, "Prompting GPT-3 To Be Reliable," The Eleventh International Conference on Learning Representations (ICLR), Kigali, Rwanda, May 2023.
[PDF][Code]
Jiajun Deng, Zhengyuan Yang, Daqing Liu, Tianlang Chen, Wengang Zhou, Yanyong Zhang, Houqiang Li, Wanli Ouyang, "TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer," IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023.
[PDF][Code]
Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang, "GIT: A Generative Image-to-text Transformer for Vision and Language," Transactions on Machine Learning Research (TMLR), 2022.
[PDF][Code]
Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu and Lijuan Wang, "UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling," European Conference on Computer Vision (ECCV), Tel Aviv, Israel, October 2022. (Oral Presentation)
[PDF][Code]
Jianfeng Wang, Xiaowei Hu, Zhe Gan, Zhengyuan Yang, Xiyang Dai, Zicheng Liu, Yumao Lu and Lijuan Wang, "UFO: A UniFied TransfOrmer for Vision-Language Representation Learning," 2021.
[PDF]
Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu and Lijuan Wang, "Scaling Up Vision-Language Pre-training for Image Captioning," IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), New Orleans, June 2022.
[PDF]
Zhengyuan Yang, Jingen Liu, Jing Huang, Xiaodong He, Tao Mei, Chenliang Xu, Jiebo Luo, "Cross-modal Contrastive Distillation for Instructional Activity Anticipation," International Conference on Pattern Recognition (ICPR), Montreal, Quebec, Canada, August 2022. (Oral Presentation).
[PDF]
Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu and Lijuan Wang, "An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA," The 36th AAAI Conference on Artificial Intelligence (AAAI), February 2022. (Oral Presentation)
[PDF]
Zhengyuan Yang, Songyang Zhang, Liwei Wang, Jiebo Luo, "SAT: 2D Semantics Assisted Training for 3D Visual Grounding," International Conference on Computer Vision (ICCV), Oct 2021. (Oral Presentation)
[PDF]
Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, Houqiang Li, "TransVG: End-to-End Visual Grounding with Transformers," International Conference on Computer Vision (ICCV), Oct 2021.
[PDF]
Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, Jiebo Luo, "TAP: Text-Aware Pre-training for Text-VQA and Text-Caption," Conference on Computer Vision and Pattern Recognition (CVPR), June 2021. (Oral Presentation)
[PDF]
Liwei Wang, Jing Huang, Yin Li, Kun Xu, Zhengyuan Yang, Dong Yu, "Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation," Conference on Computer Vision and Pattern Recognition (CVPR), June 2021.
[PDF]
Zhengyuan Yang, Tianlang Chen, Liwei Wang, Jiebo Luo, "Improving One-stage Visual Grounding by Recursive Sub-query Construction," European Conference on Computer Vision (ECCV), Glasgow, UK, August 2020.
[PDF][Code]
Huan Lin, Fandong Meng, Jinsong Su, Yongjing Yin, Zhengyuan Yang, Yubin Ge, Jie Zhou, Jiebo Luo, "Dynamic Context-guided Capsule Network for Multimodal Machine Translation," ACM Multimedia Conference (ACMMM), Seattle, WA, October 2020. (Oral Presentation)
[PDF][Code]
Yongjing Yin, Fandong Meng, Jinsong Su, Chulun Zhou, Zhengyuan Yang, Jie Zhou, Jiebo Luo, "A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation," Annual Meeting of the Association for Computational Linguistics (ACL), Seattle, WA, July 2020.
[PDF]
Zhengyuan Yang, Tushar Kumar, Tianlang Chen, Jingsong Su, Jiebo Luo, "Grounding-Tracking-Integration," IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT).
[PDF]
Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, Jiebo Luo, "A Fast and Accurate One-Stage Approach to Visual Grounding," International Conference on Computer Vision (ICCV), Seoul, South Korea, October 2019. (Oral Presentation)
[PDF][Code]
Zhengyuan Yang, Yuncheng Li, Linjie Yang, Ning Zhang, Jiebo Luo, "Weakly Supervised Body Part Parsing with Pose based Part Priors," International Conference on Pattern Recognition (ICPR), Millan, Italy, January, 2020.
[PDF]
[Demo]
Zhengyuan Yang, Amanda Kay, Yuncheng Li, Wendi Cross, Jiebo Luo, "Pose-based Body Language Recognition for Emotion and Psychiatric Symptom Interpretation," International Conference on Pattern Recognition (ICPR), Millan, Italy, January, 2020.
[PDF]
Mengshi Qi, Weijian Li, Zhengyuan Yang, Yunhong Wang, Jiebo Luo, "Attentive Relational Networks for Mapping Images to Scene Graphs," Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, USA, June 2019.
[PDF]
Zhengyuan Yang, Yixuan Zhang, Jiebo Luo, "Human-Centered Emotion Recognition in Animated GIFs with Facial Landmarks," International Conference on Multimedia and Expo (ICME), Shanghai, China, July 2019.
[PDF]
[Data]
Zhengyuan Yang, Yuncheng Li, Jianchao Yang, Jiebo Luo, "Action Recognition with Spatio-Temporal Visual Attention on Skeleton Image Sequences," IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT).
[PDF]
[Data]
Zhengyuan Yang, Yuncheng Li, Jianchao Yang, Jiebo Luo, "Action Recognition with Visual Attention on Skeleton Images," International Conference on Pattern Recognition (ICPR), Beijing, China, August 2018. (Oral Presentation).
[PDF]
Zhengyuan Yang, Yixuan Zhang, Jerry Yu, Junjie Cai, Jiebo Luo, "End-to-end Multi-Modal Multi-Task Vehicle Control for Self-Driving Cars with Visual Perceptions," International Conference on Pattern Recognition (ICPR), Beijing, China, August 2018. (Oral Presentation)
Best Industry Related Paper Award (BIRPA).
[PDF]
[Demo]
Zhengyuan Yang, Wendi Cross, Jiebo Luo, "Personalized pose estimation for body language understanding," International Conference on Image Processing (ICIP), Beijing, China, September 2017. (Oral Presentation)