Video Recognition in Portrait Mode

1ReLER, AAII, University of Technology Sydney, 2Bytedance Inc., 3Data61, CSIRO, 4Department of Computer Vision, MBZUAI

We have developed the first dataset dedicated to portrait mode video recognition, namely PortraitMode-400 and focus on the research of this emerging video format.

Abstract

The creation of new datasets often presents new challenges for video recognition and can inspire novel ideas while addressing these challenges. While existing datasets mainly comprise landscape mode videos, our paper seeks to introduce portrait mode videos to the research community and highlight the unique challenges associated with this video format. With the growing popularity of smartphones and social media applications, recognizing portrait mode videos is becoming increasingly important.

To this end, we have developed the first dataset dedicated to portrait mode video recognition, namely PortraitMode-400. The taxonomy of PortraitMode-400 was constructed in a data-driven manner, comprising 400 fine-grained categories, and rigorous quality assurance was implemented to ensure the accuracy of human annotations.

In addition to the new dataset, we conducted a comprehensive analysis of the impact of video format (portrait mode versus landscape mode) on recognition accuracy and spatial bias due to the different formats.

Furthermore, we designed extensive experiments to explore key aspects of portrait mode video recognition, including the choice of data augmentation and evaluation procedure. Building on the insights from our experimental results and the introduction of PortraitMode-400, our paper aims to inspire further research efforts in this emerging research area.



PortraitMode-400

While existing video datasets are mostly built on landscape mode videos, portrait mode videos have become increasingly more popular on major social media applications. The shift from landscape mode to portrait mode is not just changing the aspect ratios of the videos. It has significant implications for the types of content that are created and the spatial bias inherent in the data.

Portrait mode videos bring in distinct challenges for video recognition as well. For example, they tend to focus more on the subject (i.e., typically humans) with much less background context, and include more egocentric content. In addition, they contain a lot of verbal communication that is essential to understand the video content. There is a pressing need for portrait mode video datasets to explore these new research problems.

To facilitate the research in portrait mode videos, we introduce the first dataset dedicated to portrait mode video recognition, named PotraitMode-400. Some demo videos are shown above.



Portrait Mode vs. Landscape Mode

Question: How well does a model trained on landscape mode videos perform on portrait mode videos, and vice versa?

Answer: We investigate this question by constructing a subset from the Kinetics-700 dataset for a rigorous comparison and visualize classification heatmaps to reveal the differences in spatial bias resulting from the change in video format.



Optimal protocals

Question: What are the optimal training and testing protocols for portrait mode video recognition?

Answer: We delve into various components of state-of-the-art deep learning systems, such as data augmentation, evaluation cropping strategies, etc. Some of our findings contradict the current standard practices for landscape mode videos, highlighting the need for further research in the domain of portrait mode videos.

BibTeX


        @misc{han2023pmv,
          title={Video Recognition in Portrait Mode}, 
          author={Mingfei Han and Linjie Yang and Xiaojie Jin and Jiashi Feng and Xiaojun Chang and Heng Wang},
          year={2023},
          eprint={2312.13746},
          archivePrefix={arXiv},
          primaryClass={cs.CV}
        }