Zhengyuan Yang

I am currently a Senior Researcher at Microsoft. I received my Ph.D. degree in Computer Science at University of Rochester, advised by Prof. Jiebo Luo. I did my bachelors at University of Science and Technology of China. I've received ACM SIGMM Award for Outstanding Ph.D. Thesis, Twitch Research Fellowship, and ICPR 2018 Best Industry Related Paper Award. My research interests involve the intersection of computer vision and natural language processing, including multi-modal vision-language understanding and generation.


Email  /  CV  /  Github  /  Google Scholar  /  LinkedIn  /  Name Pronounce

  • [2023/12]   I will serve as an Area Chair for ACMMM 2024, and an Exhibits and Demos Chair for ICME 2024. Welcome to submit your demo papers!
  • [2023/11]   How would it be if LMMs could interact with smartphones as humans do? Checkout GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation. [Article]
  • [2023/11]   How could LMMs contribute to social good? Checkout GPT-4V(ision) as A Social Media Analysis Engine.
  • [2023/10]   How might LMMs revolutionize the understanding of video and streaming content? Checkout MM-Vid: Advancing Video Understanding with GPT-4V(ision).
  • [2023/10]   How well can image generation models assist visual design? Checkout DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design.
  • [2023/10]   How can LMM-based agents achieve human-like multimodal iterative exploration? Checkout our initial study on a generative agent, named Idea2ImgIdea2Img, focusing on automatic image design and generation. Thanks for the great video!
  • [2023/09]   What are the current state and promising future directions for large multimodal models (LMMs)? Please checkout our Preliminary Explorations with GPT-4V(ision): The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision).
  • [2023/09]   Please checkout our survey paper/book on Multimodal Foundation Models: From Specialists to General-Purpose Assistants. [Slides] [YouTube] [Bilibili]
  • [2023/08]   MM-Vet is an LMM evaluation benchmark that evaluates Large Multimodal Models' integrated VL capabilities. [MM-Vet Leaderbaord]
  • [2023/07]   Two papers accepted to ICCV 2023: (1) PromptCap, prompt controlled visual captioning; (2) EQBen, a new diagnostic VLM benchmark.
  • [2023/06]   I will serve as a SPC member for AAAI 2024.
  • [2023/06]   Check out our CVPR 2023 Tutorial on "Recent Advances in Vision Foundation Models". Slides and recordings availble.
  • [2023/03]   We build MM-REACT, a system paradigm that integrates LLMs with a pool of vision experts to achieve multimodal reasoning and action.
  • [2023/03]   IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) special issue on "AI-Generated Content for Multimedia." Submission deadline: July 1st, 2023.
  • [2023/02]   ReCo is our new text-to-image model that allows the precise region control of input text queries, accepted to CVPR 2023. See a teaser here.
  • [2023/01]   Prompting GPT-3 To Be Reliable accepted to ICLR 2023.
  • [2022/10]   My Ph.D. thesis "Visual Grounding: Building Cross-Modal Visual-Text Alignment" wins the 2022 ACM SIGMM Award for Outstanding Ph.D. Thesis.
  • [2022/10]   I am selected as one of Outstanding Reviewers for ECCV 2022.
  • [2022/07]   UniTAB accepted to ECCV 2022 as an Oral presentation.
  • [2022/07]   I will serve as a SPC member for AAAI 2023.
  • [2022/06]   Check out our CVPR 2022 Tutorial on "Recent Advances in Vision-and-Language Pre-training". Slides and recordings availble.
  • [2022/05]   The new multimodal generative foundation model Florence-GIT achieves new sota across 12 image/video VL tasks, including the first human-parity on TextCaps. GIT achieves 88.79% ImageNet-1k accuracy using a generative scheme. See a teaser here.
  • [2022/01]   I will serve as an Associate Editor for IEEE TCSVT.

  • Research

    My current research mainly focues on vision+language understanding and generation. Representative works are highlighted.

  • Click for zooming up.
  • MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
    Weihao Yu*, Zhengyuan Yang*, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, Lijuan Wang
    Technical report
    [PDF] [MM-Vet Leaderbaord] [Code] [Demo] [Bibtex]

    MM-Vet is an LMM evaluation benchmark that evaluates Large Multimodal Models' integrated VL capabilities.

  • Click for zooming up.
  • DisCo: Disentangled Control for Referring Human Dance Generation in Real World
    Tan Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang
    Technical report
    [PDF] [Project Page] [Code] [Demo] [Bibtex]

    We propose DisCo for referring human dance generation, producing human dance images/videos with good faithfulness, generalizability, and compositionality.

  • Click for zooming up.
  • MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
    Zhengyuan Yang*, Linjie Li*, Jianfeng Wang*, Kevin Lin*, Ehsan Azarnasab*, Faisal Ahmed*, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang
    Technical report
    [PDF] [Project Page] [Code] [Demo] [Bibtex]

    MM-REACT is a system paradigm that integrates LLMs with a pool of vision experts to achieve multimodal reasoning and action.

  • Click for zooming up.
  • PromptCap: Prompt-Guided Task-Aware Image Captioning
    Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A. Smith, Jiebo Luo
    ICCV 2023.
    [PDF] [Project Page] [Code] [Bibtex]

    PromptCap takes a natural language prompt to control the visual content to describe by the captioning model.

  • Click for zooming up.
  • Equivariant Similarity for Vision-Language Foundation Models
    Tan Wang, Kevin Lin, Linjie Li, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang
    ICCV 2023.
    [PDF] [Code] [Benchmark] [Bibtex]

    EqBen explores the concept of equivariance in VLMs, focusing specifically on the multimodal similarity function that is not only the major training objective but also the core delivery to support downstream tasks.

  • Click for zooming up.
  • Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation
    Jaemin Cho, Linjie Li, Zhengyuan Yang, Zhe Gan, Lijuan Wang, Mohit Bansal
    Technical report
    [PDF] [Project Page] [Code] [Demo] [Bibtex]

    LayoutBench evaluates layout-guided image generation models with out-of-distribution (OOD) layouts in four skills: number, position, size, and shape.

  • Click for zooming up.
  • ReCo: Region-Controlled Text-to-Image Generation
    Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Linjie Li, Kevin Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang
    CVPR 2023.
    [PDF] [Code] [Video] [Bibtex]

    ReCo is our new text-to-image model that allows the precise region control of input text queries.

  • Click for zooming up.
  • Prompting GPT-3 To Be Reliable
    Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, Jordan Boyd-Graber, Lijuan Wang
    ICLR 2023.
    [PDF] [Code] [Tweet] [Bibtex]

    Establish simple and effective prompts to demonstrate GPT-3's reliability in four facets: generalizability, fairness, calibration, and factuality.

  • Click for zooming up.
  • UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
    Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, Lijuan Wang
    ECCV 2022. (Oral Presentation)
    [PDF] [Code] [Poster] [Video] [Bibtex]

    We propose UniTAB, a vision-language (VL) model that unifies text generation and bounding box prediction into a single architecture.

  • Click for zooming up.
  • GRiT: A Generative Region-to-text Transformer for Object Understanding
    Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, Lijuan Wang
    Technical report
    [PDF] [Code] [Bibtex]

    GRiT is a general object understanding framework that detects objects and describes them with any style of texts it was trained with, e.g., class names, object attributes, actions, counts, etc.

  • Click for zooming up.
  • TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer
    Jiajun Deng, Zhengyuan Yang, Daqing Liu, Tianlang Chen, Wengang Zhou, Yanyong Zhang, Houqiang Li, Wanli Ouyang
    TPAMI 2023.
    [PDF] [Code] [Bibtex]

    Adapting Vision Transformer (ViT) for visual grounding.

  • Click for zooming up.
  • GIT: A Generative Image-to-text Transformer for Vision and Language
    Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang
    TMLR 2022.
    [PDF] [Code] [Bibtex]

    Florence-GIT is our new multimodal generative foundation model. GIT shows a strong capbility of describing entities in the wild, such as scene texts, logos, landmarks, characters, etc.

  • Click for zooming up.
  • Cross-modal Contrastive Distillation for Instructional Activity Anticipation
    Zhengyuan Yang, Jingen Liu, Jing Huang, Xiaodong He, Tao Mei, Chenliang Xu, Jiebo Luo
    ICPR 2022.
    [PDF] [Video] [Bibtex]

    We propose cross-modal contrastive distillation (CCD) that facilities distilling teacher’s information to the student in a different modality.

  • Click for zooming up.
  • UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
    Jianfeng Wang, Xiaowei Hu, Zhe Gan, Zhengyuan Yang, Xiyang Dai, Zicheng Liu, Yumao Lu, Lijuan Wang
    Technical report
    [PDF] [Bibtex]

    A single unified transformer(UFO), which is capable of processing either unimodal inputs or multimodal inputs, for VL representation learning.

  • Click for zooming up.
  • Scaling Up Vision-Language Pre-training for Image Captioning
    Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, Lijuan Wang
    CVPR 2022.
    [PDF] [Code] [Bibtex]

    The first empirical study on the scaling behavior of VLP for image captioning.

  • Click for zooming up.
  • An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
    Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu, Lijuan Wang
    AAAI 2022. (Oral Presentation)
    [PDF] [Code] [Bibtex] [Benchmarks]

    Can GPT-3 benefit multimodal tasks? We provide an empirical study of GPT-3 for knowledge-based VQA, named PICa.

    #1 in OKVQA leaderboard. (Sept. 2021)

  • Click for zooming up.
  • SAT: 2D Semantics Assisted Training for 3D Visual Grounding
    Zhengyuan Yang, Songyang Zhang, Liwei Wang, Jiebo Luo
    ICCV 2021. (Oral Presentation)
    [PDF] [Code] [Video] [Bibtex] [Benchmarks]

    Boosting 3D visual grounding by using training-time 2D semantics.

    #1 in referit3d CVPR 2021 challenge.

  • Click for zooming up.
  • TransVG: End-to-End Visual Grounding with Transformers
    Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, Houqiang Li
    ICCV 2021.
    [PDF] [Code] [Bibtex]

    A transformer-based framework for visual grounding.

  • Click for zooming up.
  • TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
    Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, Jiebo Luo
    CVPR 2021. (Oral Presentation)
    [PDF] [Code] [Poster] [Video] [Bibtex]

    We propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks.

    #1 in TextCaps CVPR 2021 challenge.

    Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation
    Liwei Wang, Jing Huang, Yin Li, Kun Xu, Zhengyuan Yang, Dong Yu
    CVPR 2021.
    [PDF] [Code] [Bibtex]

    A weakly supervised visual grounding method that removes the need of object detection at test time.

  • Click for zooming up.
  • Grounding-Tracking-Integration
    Zhengyuan Yang, Tushar Kumar, Tianlang Chen, Jingsong Su, Jiebo Luo
    IEEE T-CSVT.
    [PDF] [Annotations] [Demo1] [Demo2] [Bibtex]

    A simple yet effective modular framework for tracking by natural language specification.

  • Click for zooming up.
  • Improving One-stage Visual Grounding by Recursive Sub-query Construction
    Zhengyuan Yang, Tianlang Chen, Liwei Wang, Jiebo Luo
    ECCV 2020.
    [PDF] [Code] [Slides] [Video] [Bibtex]

    Improving one-stage visual grounding by addressing previous weaknesses in modeling long and complex queries.

  • Click for zooming up.
  • Dynamic Context-guided Capsule Network for Multimodal Machine Translation
    Huan Lin, Fandong Meng, Jinsong Su, Yongjing Yin, Zhengyuan Yang, Yubin Ge, Jie Zhou, Jiebo Luo
    ACMMM 2020. (Oral Presentation)
    [PDF] [Code] [Bibtex]

    we propose a novel Dynamic Context-guided Capsule Network (DCCN) for multimodal machine translation.

  • Click for zooming up.
  • A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation
    Yongjing Yin, Fandong Meng, Jinsong Su, Chulun Zhou, Zhengyuan Yang, Jie Zhou, Jiebo Luo
    ACL 2020.
    [PDF] [Bibtex]

    Multi-modal neural machine translation (NMT) with fine-grained cross-modality semantic correspondence.

  • Click for zooming up.
  • A Fast and Accurate One-Stage Approach to Visual Grounding
    Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, Jiebo Luo
    ICCV 2019. (Oral Presentation) (187/4303=4.3%)
    [PDF] [Code] [Slides] [Poster] [Bibtex]

    A simple, fast, and accurate one-stage approach to visual grounding. 10 times faster and 7~20% higher in accuracy.

  • Click for zooming up.
  • Weakly Supervised Body Part Parsing with Pose based Part Priors
    Zhengyuan Yang, Yuncheng Li, Linjie Yang, Ning Zhang, Jiebo Luo
    ICPR 2020.
    [PDF] [Demo] [Poster] [Slides] [Video] [Bibtex]

    Weakly-supervised body part parsing that achieves comparable results to the fully-supervised method with a same backbone.

  • Click for zooming up.
  • Pose-based Body Language Recognition for Emotion and Psychiatric Symptom Interpretation
    Zhengyuan Yang, Amanda Kay, Yuncheng Li, Wendi Cross, Jiebo Luo
    ICPR 2020.
    [PDF] [Poster] [Slides] [Video] [Bibtex]

    A pose-based body language recognition framework for body language recognition and emotion interpretation.

  • Click for zooming up.
  • Attentive Relational Networks for Mapping Images to Scene Graphs
    Mengshi Qi, Weijian Li, Zhengyuan Yang, Yunhong Wang, Jiebo Luo
    CVPR 2019.
    [PDF] [Bibtex]

    A novel Attentive Relational Network for scene graph generation.

  • Click for zooming up.
  • Action Recognition with Spatio-Temporal Visual Attention on Skeleton Image Sequences
    Zhengyuan Yang, Yuncheng Li, Jianchao Yang, Jiebo Luo
    ICPR 2018; IEEE T-CSVT
    [PDF] [UCF-Motion-Joints] [Bibtex]

    A CNN-based approach for skeleton-based action recognition. SOTA on both clean 3D joints and noisy 2D estimated keypoints.

  • Click for zooming up.
  • Human-Centered Emotion Recognition in Animated GIFs with Facial Landmarks
    Zhengyuan Yang, Yixuan Zhang, Jiebo Luo
    ICME 2019.
    [PDF] [Data] [Bibtex]

    Focusing on human faces to improve emotion recognition.

    End-to-end Multi-Modal Multi-Task Vehicle Control for Self-Driving Cars with Visual Perception
    Zhengyuan Yang, Yixuan Zhang, Jerry Yu, Junjie Cai, Jiebo Luo
    ICPR 2018. Best Industry Related Paper Award (BIRPA) (1/1258=0.08%)
    [PDF] [Demo] [Bibtex]

    Building a prototype that controls the self-driving car's steering angle and speed. Check out the demo that we recorded in the vehicle!

    Internship

    Microsoft, Redmond, WA
    May - Aug 2020. Advisor: Yijuan Lu, Jianfeng Wang, Xi Yin.
    Project: Text-aware pre-training for Text-VQA and Text-Caption.

    Tencent AI Lab, Bellevue, WA
    Jan - Apr 2019. Advisor: Boqing Gong, Liwei Wang.
    Project: Visual Grounding with Natural Language Quires.

    SnapChat, Venice, CA
    May - Aug 2018. Advisor: Yuncheng Li, Linjie Yang, Ning Zhang.
    Project: Weakly Supervised Human Part Parsing.

    SAIC Innovation Center, San Jose, CA
    Jun - Aug 2017. Advisor: Jerry Yu.
    Project: Steering Angle Control with End-to-end Neural Networks.

    Awards

  • 2022 ACM SIGMM Award for Outstanding Ph.D. Thesis
  • Winner of CVPR 2021 TextCaps Challenge
  • Winner of CVPR 2021 ReferIt3D Challenge
  • Twitch Research Fellowship
  • Best Industry Related Paper Award at ICPR 2018
  • Publications
  • Tan Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang, "DisCo: Disentangled Control for Referring Human Dance Generation in Real World." 2023. [PDF]
  • Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang, "MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action." 2023. [PDF]
  • Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A. Smith, Jiebo Luo, "PromptCap: Prompt-Guided Task-Aware Image Captioning," International Conference on Computer Vision (ICCV), Paris, France, Oct 2023. [PDF][Code]
  • Tan Wang, Kevin Lin, Linjie Li, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang, "Equivariant Similarity for Vision-Language Foundation Models," International Conference on Computer Vision (ICCV), Paris, France, Oct 2023. [PDF][Code]
  • Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Linjie Li, Kevin Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang, "ReCo: Region-Controlled Text-to-Image Generation," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, June 2023. [PDF]
  • Jaemin Cho, Linjie Li, Zhengyuan Yang, Zhe Gan, Lijuan Wang, Mohit Bansal, "Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation," 2023. [PDF][Project page]
  • Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, Jianlong Fu, Gong Ming, Lijuan Wang, Zicheng Liu, Houqiang Li, Nan Duan, "NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation," Annual Meeting of the Association for Computational Linguistics (ACL), Toronto, Canada, July 2023. (Oral Presentation) [PDF]
  • Xiaodong Wang, Chenfei Wu, Shengming Yin, Minheng Ni, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Fan Yang, Lijuan Wang, Zicheng Liu, Yuejian Fang, Nan Duan, "Learning 3D Photography Videos via Self-supervised Diffusion on Single Images," The 32nd International Joint Conference on Artificial Intelligence (IJCAI-23), Macao, August 2023. [PDF]
  • Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, Lijuan Wang, "GRiT: A Generative Region-to-text Transformer for Object Understanding," 2022. [PDF][Code]
  • Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, Jordan Boyd-Graber, Lijuan Wang, "Prompting GPT-3 To Be Reliable," The Eleventh International Conference on Learning Representations (ICLR), Kigali, Rwanda, May 2023. [PDF][Code]
  • Jiajun Deng, Zhengyuan Yang, Daqing Liu, Tianlang Chen, Wengang Zhou, Yanyong Zhang, Houqiang Li, Wanli Ouyang, "TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer," IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023. [PDF][Code]
  • Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang, "GIT: A Generative Image-to-text Transformer for Vision and Language," Transactions on Machine Learning Research (TMLR), 2022. [PDF][Code]
  • Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu and Lijuan Wang, "UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling," European Conference on Computer Vision (ECCV), Tel Aviv, Israel, October 2022. (Oral Presentation) [PDF][Code]
  • Jianfeng Wang, Xiaowei Hu, Zhe Gan, Zhengyuan Yang, Xiyang Dai, Zicheng Liu, Yumao Lu and Lijuan Wang, "UFO: A UniFied TransfOrmer for Vision-Language Representation Learning," 2021. [PDF]
  • Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu and Lijuan Wang, "Scaling Up Vision-Language Pre-training for Image Captioning," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, June 2022. [PDF]
  • Zhengyuan Yang, Jingen Liu, Jing Huang, Xiaodong He, Tao Mei, Chenliang Xu, Jiebo Luo, "Cross-modal Contrastive Distillation for Instructional Activity Anticipation," International Conference on Pattern Recognition (ICPR), Montreal, Quebec, Canada, August 2022. (Oral Presentation). [PDF]
  • Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu and Lijuan Wang, "An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA," The 36th AAAI Conference on Artificial Intelligence (AAAI), February 2022. (Oral Presentation) [PDF]
  • Zhengyuan Yang, Songyang Zhang, Liwei Wang, Jiebo Luo, "SAT: 2D Semantics Assisted Training for 3D Visual Grounding," International Conference on Computer Vision (ICCV), Oct 2021. (Oral Presentation) [PDF]
  • Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, Houqiang Li, "TransVG: End-to-End Visual Grounding with Transformers," International Conference on Computer Vision (ICCV), Oct 2021. [PDF]
  • Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, Jiebo Luo, "TAP: Text-Aware Pre-training for Text-VQA and Text-Caption," Conference on Computer Vision and Pattern Recognition (CVPR), June 2021. (Oral Presentation) [PDF]
  • Liwei Wang, Jing Huang, Yin Li, Kun Xu, Zhengyuan Yang, Dong Yu, "Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation," Conference on Computer Vision and Pattern Recognition (CVPR), June 2021. [PDF]
  • Zhengyuan Yang, Tianlang Chen, Liwei Wang, Jiebo Luo, "Improving One-stage Visual Grounding by Recursive Sub-query Construction," European Conference on Computer Vision (ECCV), Glasgow, UK, August 2020. [PDF][Code]
  • Huan Lin, Fandong Meng, Jinsong Su, Yongjing Yin, Zhengyuan Yang, Yubin Ge, Jie Zhou, Jiebo Luo, "Dynamic Context-guided Capsule Network for Multimodal Machine Translation," ACM Multimedia Conference (ACMMM), Seattle, WA, October 2020. (Oral Presentation) [PDF][Code]
  • Yongjing Yin, Fandong Meng, Jinsong Su, Chulun Zhou, Zhengyuan Yang, Jie Zhou, Jiebo Luo, "A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation," Annual Meeting of the Association for Computational Linguistics (ACL), Seattle, WA, July 2020. [PDF]
  • Zhengyuan Yang, Tushar Kumar, Tianlang Chen, Jingsong Su, Jiebo Luo, "Grounding-Tracking-Integration," IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT). [PDF]
  • Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, Jiebo Luo, "A Fast and Accurate One-Stage Approach to Visual Grounding," International Conference on Computer Vision (ICCV), Seoul, South Korea, October 2019. (Oral Presentation) [PDF][Code]
  • Zhengyuan Yang, Yuncheng Li, Linjie Yang, Ning Zhang, Jiebo Luo, "Weakly Supervised Body Part Parsing with Pose based Part Priors," International Conference on Pattern Recognition (ICPR), Millan, Italy, January, 2020. [PDF] [Demo]
  • Zhengyuan Yang, Amanda Kay, Yuncheng Li, Wendi Cross, Jiebo Luo, "Pose-based Body Language Recognition for Emotion and Psychiatric Symptom Interpretation," International Conference on Pattern Recognition (ICPR), Millan, Italy, January, 2020. [PDF]
  • Mengshi Qi, Weijian Li, Zhengyuan Yang, Yunhong Wang, Jiebo Luo, "Attentive Relational Networks for Mapping Images to Scene Graphs," Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, USA, June 2019. [PDF]
  • Zhengyuan Yang, Yixuan Zhang, Jiebo Luo, "Human-Centered Emotion Recognition in Animated GIFs with Facial Landmarks," International Conference on Multimedia and Expo (ICME), Shanghai, China, July 2019. [PDF] [Data]
  • Zhengyuan Yang, Yuncheng Li, Jianchao Yang, Jiebo Luo, "Action Recognition with Spatio-Temporal Visual Attention on Skeleton Image Sequences," IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT). [PDF] [Data]
  • Zhengyuan Yang, Yuncheng Li, Jianchao Yang, Jiebo Luo, "Action Recognition with Visual Attention on Skeleton Images," International Conference on Pattern Recognition (ICPR), Beijing, China, August 2018. (Oral Presentation). [PDF]
  • Zhengyuan Yang, Yixuan Zhang, Jerry Yu, Junjie Cai, Jiebo Luo, "End-to-end Multi-Modal Multi-Task Vehicle Control for Self-Driving Cars with Visual Perceptions," International Conference on Pattern Recognition (ICPR), Beijing, China, August 2018. (Oral Presentation) Best Industry Related Paper Award (BIRPA). [PDF] [Demo]
  • Zhengyuan Yang, Wendi Cross, Jiebo Luo, "Personalized pose estimation for body language understanding," International Conference on Image Processing (ICIP), Beijing, China, September 2017. (Oral Presentation)
  • Service

  • Outstanding Reviewer, ECCV 2022 , Outstanding Reviewer, CVPR 2021
  • Senior Program Committee (SPC): 37, 38th AAAI Conference on Artificial Intelligence (AAAI-23, 24)
  • Associate Editor: IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)
  • Journal Reviewer: TPAMI, IJCV, TIP, TMM, TCybernetics, TCSVT, Pattern Recognition, Neurocomputing, TBioCAS, IEEE Access.
  • Conference Reviewer: CVPR, ICCV, ECCV, NeurIPS, ICLR, ICML, ACL, EMNLP, AAAI, ACCV, WACV, ICME, ICIP.

  • © 2023 Zhengyuan Yang. All rights reserved.
    Template borrowed from Jon Barron. Thanks!