Mingfei Han @MBZUAI

I am a postdoctoral researcher at Mohamed Bin Zayed University of Artificial Intelligence. I obtained my Ph.D. degree from University of Technology Sydney, advised by Prof. Xiaojun Chang. I also worked closely with Heng Wang, Linjie Yang, and Xiaojie Jin on various video-language projects at Bytedance Seed. Before moving to UTS, I spent a wonderful two years at Monash University. Prior to my candidature, I was a visiting student at MMLab, SIAT, Chinese Academy of Sciences, where I was fortunate to work with Prof. Yu Qiao, and Prof. Yali Wang.
I received my Master's degree from the University of Chinese Academy of Sciences (UCAS) and my Bachelor's degree from Nankai University (NKU) with graduate honours.

I welcome collaborations with both industry and academia. There are also openings for self-motivated, active-thinking MSc and PhD students (relevant background and publications are a plus) in the groups at MBZUAI and USTC.

Recent Activities

AAAI 2026: Two papers are accepted on zero-shot image inpainting and referring audio-visual segmentation. Congratulations to my collaborators!
IROS 2025: We won the first place awards in Intern Robotics Challenge Track#2 (Vision-and-Language Navigation in Physical Environments) and RoboSense Challenge Track#2 (Social Navigation)!
NeurIPS 2025: Our WorldWeaver and PhyBlock are accepted to NeurIPS 2025!
IROS 2025! MALMM: Extendable and Effective multi-agent framework for robotics manipulation.
🌟GRAIL Challenge & Workshop🌟-CVPR 2025: Benchmarking generalization in robotics manipulation with held testing set and offline real-world setup.
🌟SMM Challenge🌟-EAI Workshop CVPR 2025: Benchmarking the capability of performing long-sequence complex tasks through social interactions.
🌟CVPR 2025🌟 RoomTour3D: Diverse and Scalable video-instruction(-action) data for embodied navigation (Vision-and-Language Navigation). Ongoing effort with current 200k instructions released. We achieve SOTA on SOON and REVERIE with this newly introduced data.
🌟ICLR 2025🌟 Shot2Story: Train a rich video summarizer and test it on multi-shot videos! Just release our manually annotated Multi-Shot Video Understanding Suite. Single-shot captions, Mult-shot summaries, Video question-answering pairs: 134K multishot videos covering over 548k video shots; Detailed text summaries with over 6M words! SOTA video QA performance with the derived rich text summaries.
ECCV 2024 Oral LongVLM: Efficient long-video frame encoding for large video language models.

Research interest

My research lies at the interface of computer vision and robotics. I currently investigate large vision–language models, summarising videos and analysing their hallucination behaviour, further extending their use beyond embodied agents. Recent work spans video–language understanding, with a focus on long video understanding, video grounding tasks such as Referring Video Object Segmentation, alongside vision–language navigation and manipulation for robots. Prior projects addressed individual and group activity recognition and video object detection under both fully and weakly supervised regimes, while my master’s thesis centred on moving-object detection and multi-object tracking.

Publications and preprints

Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos
TBR, 2025
Implicit geometric prior from YouTube RoomTour videos at low cost, powered by our scalable data curation pipeline and unified geometry perception architecture for strong VLN performance across multiple benchmarks and zero-shot setup.

Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions
TBR, 2025
A proactive VLM agent with evidence-aligned and trasnparent decision making process. Our Thinking-QwenVL provides timely, accurate streaming decisions and achieves state-of-the-art performance on diverse online and offline video benchmarks.

See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation
TBR, 2025
Progress awareness matters! By seeing remaining subtasks and planning toward spatial subgoals, SPR enables progress-aware robotic manipulation with autonomous error recovery, achieving state-of-the-art out-of-distribution robustness.

Beyond Next Frames: World Models as Structured Planners for Robotic Manipulation
Minghao Jin, Mozheng Liao, Mingfei Han, Zhihui Li, Xiaojun Chang
TBR, 2025
New SOTA on SimplerEnv-WidowX! Our SP-VLA simply converts your world models into a structured & middle-level robotic planner by predicting manipulation-relevant frames, unifying semantic planning with precise short-horizon control.

Self-Consistency as a Free Lunch: Reducing Hallucinations in Vision-Language Models via Self-Reflection
Mingfei Han, Haihong Hao, Jinxing Zhou, Zhihui Li, Yuhui Zheng, Xueqing Deng, Linjie Yang and Xiaojun Chang
arxiv, 2025
Boosted by free self-reflection signals, our approach largely mitigates vision-language hallucinations, while also achieving improvements over general instruction-following benchmarks!
paper

RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation
Mingfei Han, Liang Ma, Kamila Zhumakhanova, Ekaterina Radionova, Jingyi Zhang, Xiaojun Chang, Xiaodan Liang and Ivan Laptev
Conference on Computer Vision and Pattern Recognition 2025 (CVPR), 2025
Web-video based video-instruction(-action) training data. Effective, Automatic and Scalable!
project page / paper / slides / code / Annotations / Video Frames / Models / bibtex

Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos
Mingfei Han, Linjie Yang, Xiaojun Chang, Lina Yao and Heng Wang
International Conference on Learning Representations (ICLR), 2025
We present a new multi-shot video understanding benchmark Shot2Story with detailed shot-level captions, comprehensive video summaries and diverse question-answering pairs. 43K human annotations (with per-shot annotated visual and audio captions) + 90K GPTV annotations.
project page / paper / slides / demo / code / data / video / bibtex

MALMM: Multi-Agent Large Language Models for Zero-Shot Robotics Manipulation
Harsh Singh, Rocktim Jyoti Das, Mingfei Han, Preslav Nakov, Ivan Laptev
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025
A novel multi-agent framework to perform zero-shot object manipulation. Effective and Extendable!
project page / pdf / bibtex

LongVLM: Efficient Long Video Understanding via Large Language Models
Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang and Bohan Zhuang
European Conference on Computer Vision (ECCV), 2024 (Oral)
code / pdf / bibtex

Video Recognition in Portrait Mode
Mingfei Han, Linjie Yang, Xiaojie Jin, Jiashi Feng, Xiaojun Chang and Heng Wang
We have developed the first dataset dedicated to portrait mode videos and focus on the research of this emerging video format. CVPR 2024
project page / paper / data / bibtex

Mask Propagation for Efficient Video Semantic Segmentation
Yuetian Weng, Mingfei Han, Haoyu He, Mingjie Li, Lina Yao, Xiaojun Chang and Bohan Zhuang
Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023
code / pdf / bibtex

HTML: Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object Segmentation
Mingfei Han, Yali Wang, Zhihui Li, Lina Yao, Xiaojun Chang and Yu Qiao
International Conference on Computer Vision (ICCV), 2023
project page / pdf / poster / bibtex

Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition
Mingfei Han, David Junhao Zhang, Yali Wang, Ruiyan, Lina Yao, Xiaojun Chang and Yu Qiao
Conference on Computer Vision and Pattern Recognition (CVPR), 2022 (Oral)
project page / arXiv / slides / poster / presentation / bibtex

Progressive Frame-Proposal Mining for Weakly Supervised Video Object Detection
Mingfei Han, Yali Wang, Mingjie Li, Xiaojun Chang, Yi Yang and Yu Qiao
IEEE Transactions on Image Processing (TIP), 2021
IEEE / bibtex

Mining Inter-Video Relations for Video Object Detection
Mingfei Han, Yali Wang, Xiaojun Chang, Yu Qiao
European Conference on Computer Vision (ECCV), 2020
ECVA / code / bibtex

Object tracking in satellite videos by improved correlation filters with motion estimations
Shiyu Xuan, Shengyang Li, Mingfei Han, Xue Wan, Gui-song Xia
IEEE Transactions on Geoscience and Remote Sensing (TGRS), 2019
IEEE / code / bibtex

MMVG-INF-Etrol@ TRECVID 2019: Activities in Extended Video
Xiaojun Chang, Wenhe Liu, Po-Yao Huang, Changlin Li, Fengda Zhu, Mingfei Han, et al.
First Prize on Trecvid Activities in Extended Video (ActEV) challenge, 2019
NIST / bibtex

Talks

Institute of Computing Technology "Towards Generalization in Robotics Navigation and Manipulation", July 2025
KCL - ISSA technical exchange event "Shot2Story: A Multi-shot Understanding Approach for Archive Videos", June 2025
CVPR 2025 - EAI Workshop "SMM Challenge: Social Mobile Manipulation", June 2025
Multimodal Minds "Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos", Mar 2025
3DCVer in Chinese, "RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation", Mar 2025
China Society of Image and Graphics - Guangdong Branch in Chinese, "CSIG-Guangdong CVPR Papers sharing - Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition", May 2022
Jishi Live in Chinese, with my firend Xiangtao Kong who's on Low-level Vision and Super-Resolution, "CAS-SIAT CVPR Papers sharing - Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition", with recording here, April 2022
ML and VL Seminar at Monash University, "Mining Inter-Video Proposal Relations for Video Object Detection", November 2020

Academic service

Reviewer for journals: TPAMI, IJCV, TIP, TCSVT, TNNLS, TMM, KBS, TOMM, PR.

Reviewer for conferences: CVPR, ICCV, ECCV, ICLR, ICML, NeurIPS, 3DV, ACM MM, ACCV.

Contact

You are very welcome to contact me regarding my research. I typically respond within a few days.
I can be contacted directly at mhannku030 [at] gmail.com