Mingfei Han @MBZUAI

I am a postdoctoral researcher at Mohamed Bin Zayed University of Artificial Intelligence, advised by Prof. Ivan Laptev. I obtained my Ph.D. degree from University of Technology Sydney, advised by Prof. Xiaojun Chang. I also worked closely with Heng Wang, Linjie Yang, and Xiaojie Jin on various video-language projects at Bytedance Seed. Before moving to UTS, I spent a wonderful two years in Monash University. Prior to my candidature, I was a visiting student at MMLab, SIAT, Chinese Academy of Sciences, where I was fortunate to work with Prof. Yu Qiao, and Prof. Yali Wang.
I received my Master's degree from University of Chinese Academy of Sciences (UCAS) and my Bachelor's degree from Nankai University (NKU) with graduate honours.

Recent Activities

IROS 2025! MALMM: Extendable and Effective multi-agent framework for robotics manipulation.
🌟GRAIL Challenge & Workshop🌟-CVPR 2025: Benchmarking generalization in robotics manipulation with held testing set and offline real-world setup.
🌟SMM Challenge🌟-EAI Workshop CVPR 2025: Benchmarking the capability of performing long-sequence complex tasks through social interactions.
🌟CVPR 2025🌟 RoomTour3D: Diverse and Scalable video-instruction(-action) data for embodied navigation (Vision-and-Language Navigation). Ongoing effort with current 200k instructions released. We achieve SOTA on SOON and REVERIE with this newly introduced data.
🌟ICLR 2025🌟 Shot2Story: Manually annotated Multi-Shot Video Understanding Suite. Single-shot captions, Mult-shot summaries, Video question-answering pairs.
ECCV 2024 Oral LongVLM: Efficient long-video frame encoding for large video language models.
CVPR 2024 PMV-400: Portrait-mode videos rock the social media! We have developed the first video dataset dedicated to the research of this emerging video format.

Research interest

My research lies at the interface of computer vision and robotics. I currently investigate large vision–language models, summarising videos and analysing their hallucination behaviour, further extending their use beyond embodied agents. Recent work spans video–language understanding, with a focus on long video understanding, video grounding tasks such as Referring Video Object Segmentation, alongside vision–language navigation and manipulation for robots. Prior projects addressed individual and group activity recognition and video object detection under both fully and weakly supervised regimes, while my master’s thesis centred on moving-object detection and multi-object tracking.

Publications and preprints

RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation
Mingfei Han, Liang Ma, Kamila Zhumakhanova, Ekaterina Radionova, Jingyi Zhang, Xiaojun Chang, Xiaodan Liang and Ivan Laptev
Conference on Computer Vision and Pattern Recognition 2025 (CVPR), 2025
Web-video based video-instruction(-action) training data. Effective, Automatic and Scalable!
project page / paper / slides / code / Annotations / Video Frames / Models / bibtex

Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos
Mingfei Han, Linjie Yang, Xiaojun Chang, Lina Yao and Heng Wang
International Conference on Learning Representations (ICLR), 2025
We present a new multi-shot video understanding benchmark Shot2Story with detailed shot-level captions, comprehensive video summaries and diverse question-answering pairs. 43K human annotations (with per-shot annotated visual and audio captions) + 90K GPTV annotations.
project page / paper / slides / demo / code / data / video / bibtex

MALMM: Multi-Agent Large Language Models for Zero-Shot Robotics Manipulation
Harsh Singh, Rocktim Jyoti Das, Mingfei Han, Preslav Nakov, Ivan Laptev
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025
A novel multi-agent framework to perform zero-shot object manipulation. Effective and Extendable!
project page / pdf / bibtex

LongVLM: Efficient Long Video Understanding via Large Language Models
Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang and Bohan Zhuang
European Conference on Computer Vision (ECCV), 2024 (Oral)
code / pdf / bibtex

Video Recognition in Portrait Mode
Mingfei Han, Linjie Yang, Xiaojie Jin, Jiashi Feng, Xiaojun Chang and Heng Wang
We have developed the first dataset dedicated to portrait mode videos and focus on the research of this emerging video format. CVPR 2024
project page / paper / data / bibtex

Mask Propagation for Efficient Video Semantic Segmentation
Yuetian Weng, Mingfei Han, Haoyu He, Mingjie Li, Lina Yao, Xiaojun Chang and Bohan Zhuang
Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023
code / pdf / bibtex

HTML: Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object Segmentation
Mingfei Han, Yali Wang, Zhihui Li, Lina Yao, Xiaojun Chang and Yu Qiao
International Conference on Computer Vision (ICCV), 2023
project page / pdf / poster / bibtex

Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition
Mingfei Han, David Junhao Zhang, Yali Wang, Ruiyan, Lina Yao, Xiaojun Chang and Yu Qiao
Conference on Computer Vision and Pattern Recognition (CVPR), 2022 (Oral)
project page / arXiv / slides / poster / presentation / bibtex

Progressive Frame-Proposal Mining for Weakly Supervised Video Object Detection
Mingfei Han, Yali Wang, Mingjie Li, Xiaojun Chang, Yi Yang and Yu Qiao
IEEE Transactions on Image Processing (TIP), 2021
IEEE / bibtex

Mining Inter-Video Relations for Video Object Detection
Mingfei Han, Yali Wang, Xiaojun Chang, Yu Qiao
European Conference on Computer Vision (ECCV), 2020
ECVA / code / bibtex

Object tracking in satellite videos by improved correlation filters with motion estimations
Shiyu Xuan, Shengyang Li, Mingfei Han, Xue Wan, Gui-song Xia
IEEE Transactions on Geoscience and Remote Sensing (TGRS), 2019
IEEE / code / bibtex

MMVG-INF-Etrol@ TRECVID 2019: Activities in Extended Video
Xiaojun Chang, Wenhe Liu, Po-Yao Huang, Changlin Li, Fengda Zhu, Mingfei Han, et al.
First Prize on Trecvid Activities in Extended Video (ActEV) challenge, 2019
NIST / bibtex

Talks

Institute of Computing Technology "Towards Generalization in Robotics Navigation and Manipulation", Will happen, July 2025
KCL - ISSA technical exchange event "Shot2Story: A Multi-shot Understanding Approach for Archive Videos", Will happen, June 2025
CVPR 2025 - EAI Workshop "SMM Challenge: Social Mobile Manipulation", June 2025
Multimodal Minds "Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos", Mar 2025
3DCVer in Chinese, "RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation", Mar 2025
China Society of Image and Graphics - Guangdong Branch in Chinese, "CSIG-Guangdong CVPR Papers sharing - Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition", May 2022
Jishi Live in Chinese, with my firend Xiangtao Kong who's on Low-level Vision and Super-Resolution, "CAS-SIAT CVPR Papers sharing - Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition", with recording here, April 2022
ML and VL Seminar at Monash University, "Mining Inter-Video Proposal Relations for Video Object Detection", November 2020

Academic service

Reviewer for journals: TPAMI, IJCV, TIP, TCSVT, TNNLS, TMM, KBS, TOMM, PR.

Reviewer for conferences: CVPR, ICCV, ECCV, ICLR, ICML, NeurIPS, 3DV, ACM MM, ACCV.

Contact

You are very welcome to contact me regarding my research. I typically respond within a few days.
I can be contacted directly at mhannku030 [at] gmail.com