Mingfei Han

Ph.D. student, University of Technology Sydney

I am a final-year Ph.D. student at University of Technology Sydney, advised by Prof. Xiaojun Chang. I also work closely with Heng Wang, Linjie Yang, and Xiaojie Jin on various video-language projects at Bytedance. Before moving to UTS, I spent a wonderful two years in Monash University. Prior to my candidature, I was a visiting student at MMLab, SIAT, Chinese Academy of Sciences, where I was fortunate to work with Prof. Yu Qiao, and Prof. Yali Wang.
I received my Master's degree from University of Chinese Academy of Sciences (UCAS) and my Bachelor's degree from Nankai University (NKU) with graduate honours.

Recent Activities

  • 🌟🌟 I am currently on the job market for a research position. 🌟🌟
  • RoomTour3D: Automatic, scalable and cheap! Diverse and Scalable video-instruction data for embodied navigation (VLN). Ongoing effort with current 200k instructions released. We achieve SOTA on SOON and REVERIE with this newly introduced data.
  • Shot2Story-134K: Manual 43K + GPTV 90K. 134K multishot videos covering over 548k video shots; Detailed text summaries with over 6M words! We have released this new video description dataset. With the assistance of LLM, our method achieves SOTA performance on zero-shot MSRVTT-QA.
  • ECCV 2024 LongVLM: Efficient long-video frame encoding for large video language models.
  • CVPR 2024 PMV-400: Portrait-mode videos rock the social media! We have developed the first video dataset dedicated to the research of this emerging video format.
  • ICCV 2023 HTML: One paper on language referring video object segmentation (RVOS) gets accepted. No additional cost during inference with performance largely boosted!
  • NeurIPS 2023: One paper on efficient video segmentation gets accepted.

Research interest

My research interests lie in computer vision and machine learning. Currently, I am focusing on large vision-language models and their application in robotics. I worked on video-language downstream tasks related to object and event prediction in videos, like Referring-VOS and video grounding. Previously, I worked on individual and group activity recognition, and video object detection with full and limited supervision. During my Master's thesis, I worked on moving object detection and tracking.

Publications and preprints

RoomTour3D: A Geometry-Aware Video-Instruction Data for Embodied Navigation
Mingfei Han, Liang Ma, Kamila Zhumakhanova, Ekaterina Radionova, Jingyi Zhang, Xiaojun Chang, Xiaodan Liang and Ivan Laptev
Web-video based video-instruction training data. Automatic, scalable and cheap! 2024
project page / code (T.B.R.) / Annotations / Video Frames
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos
Mingfei Han, Linjie Yang, Xiaojun Chang and Heng Wang
We present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries. 43K human annotations + 90K GPTV annotations. 2023
project page / paper / demo / code / data / video / bibtex
LongVLM: Efficient Long Video Understanding via Large Language Models
Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang and Bohan Zhuang
European Conference on Computer Vision (ECCV), 2024
code (T.B.R.) / pdf / bibtex
Video Recognition in Portrait Mode
Mingfei Han, Linjie Yang, Xiaojie Jin, Jiashi Feng, Xiaojun Chang and Heng Wang
We have developed the first dataset dedicated to portrait mode videos and focus on the research of this emerging video format. CVPR 2024
project page / paper / data / bibtex
Mask Propagation for Efficient Video Semantic Segmentation
Yuetian Weng, Mingfei Han, Haoyu He, Mingjie Li, Lina Yao, Xiaojun Chang and Bohan Zhuang
Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023
code / pdf / bibtex
HTML: Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object Segmentation
Mingfei Han, Yali Wang, Zhihui Li, Lina Yao, Xiaojun Chang and Yu Qiao
International Conference on Computer Vision (ICCV), 2023
project page / pdf / poster / bibtex
Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition
Mingfei Han, David Junhao Zhang, Yali Wang, Ruiyan, Lina Yao, Xiaojun Chang and Yu Qiao
Conference on Computer Vision and Pattern Recognition (CVPR), 2022 (Oral)
project page / arXiv / slides / poster / presentation / bibtex
Progressive Frame-Proposal Mining for Weakly Supervised Video Object Detection
Mingfei Han, Yali Wang, Mingjie Li, Xiaojun Chang, Yi Yang and Yu Qiao
IEEE Transactions on Image Processing (TIP), 2021
Mining Inter-Video Relations for Video Object Detection
Mingfei Han, Yali Wang, Xiaojun Chang, Yu Qiao
European Conference on Computer Vision (ECCV), 2020
ECVA / code / bibtex
Object tracking in satellite videos by improved correlation filters with motion estimations
Shiyu Xuan, Shengyang Li, Mingfei Han, Xue Wan, Gui-song Xia
IEEE Transactions on Geoscience and Remote Sensing (TGRS), 2019
IEEE / code / bibtex
MMVG-INF-Etrol@ TRECVID 2019: Activities in Extended Video
Xiaojun Chang, Wenhe Liu, Po-Yao Huang, Changlin Li, Fengda Zhu, Mingfei Han, et al.
First Prize on Trecvid Activities in Extended Video (ActEV) challenge, 2019
NIST / bibtex



You are very welcome to contact me regarding my research. I typically respond within a few days.
I can be contacted directly at mhannku030 [at] gmail.com