HTML: Hybrid Temporal-scale Multimodal Learning framework for Referring Video Object Segmentation


Referring Video Object Segmentation (RVOS) is to segment the object instance from a given video, according to the textual description of this object. However, in the open world, the object descriptions are often diversified in contents and flexible in lengths.

This leads to the key difficulty in RVOS, i.e., various descriptions of different objects are corresponding to different temporal scales in the video, which is ignored by most existing approaches with single stride offrame sampling. To tackle this problem, we propose a concise Hybrid Temporal-scale Multimodal Learning (HTML) framework, which can effectively align lingual and visual features to discover core object semantics in the video, by learning multimodal interaction hierarchically from different temporal scales.

More specifically, we introduce a novel inter-scale multimodal perception module, where the language queries dynamically interact with visual features across temporal scales. It can effectively reduce complex object confusion by passing video context among different scales. Finally, we conduct extensive experiments on the widely used benchmarks, including Ref-Youtube-VOS, Ref-DAVIS17, A2D-Sentences and JHMDB-Sentences, where our HTML achieves state-of-the-art performance on all these datasets.


  • Our method with ResNet-50 achieves 57.8 in L&F, surpassing the recent SOTA with ResNet-101.
  • Our HTML boosts the baseline model without additional modules and computations during inference.
  • Our HTML can significantly benefit from diversified text descriptions, as shown in Figure 1.
Benefiting from diversified language descriptions.

Figure 1. L&F comparison on using language descriptions in different lengths.


    The open-world descriptions vary in length and contain rich semantics about the referred object, e.g., where it is, how it moves, which objects it interact with. Apparently, such diversified texts are corresponding to various temporal-scale snippets.

    For example, the language query in Figure 2(a) is a tennis ball. Such a short description is corresponding to the ball located at a small region in the middle two frames. If the single-scale baseline samples four frames as input, it will fail to segment the referred object. This is because it overlooks the dog in the center place among all these four frames, while lacking the detailed understanding in the middle two frames.

    Alternatively, the language query in Figure 1(b) is a sheep top second right moves down and comes out of the circle. Such a long description is corresponding to the particular sheep in the group, which moves across frames. If the single-scale baseline samples two frames as input, it will fail to segment the referred object. This is because it is misled by the subtle movement of sheep group in only two frames, without understanding how each sheep moves from the adjacent frames.

Motivation illustrations.

Figure 2. Referring descriptions in different lengths. (a) The description is simple containing only the category name. (b) The description is complicated with movement and position of the object. Single-scale baseline (e.g., four frames in (a) and two frames in (b)) fails to segment the referred object, while our hybrid-scale HTML succeeds.

Experiment Results

    On Ref-Youtube-VOS, our approach achieves 58.5 in L&F(%) with ResNet-50, as shown in Table 1, which surpasses the recent SOTA method ReferFormer with same backbone by 2.2 points. Moreover, it surpasses all the other SOTA methods with larger ResNet-101 on all evaluation metrics, which fully suggests the superiority of our method. When equipped with larger backbone, our method still show considerable superiority with accuracy gap of 1.2 points for ResNet-101 and 1.0 points for Swin-L. We also experiment our method with the well-known Video Swin Transformers. Our method with Video-Swin-Tiny backbone surpasses the SOTA method with the same backbone by 1.8 points. With larger Video-Swin Transformers (Small and Base models), our method still achieves SOTA performance, which shows the generality of our method.
Experiment results.

Table 1. Comparison with the SOTA methods on Ref-YTB-VOS. Please refer to the reference IDs in the paper.


      title={HTML: Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object Segmentation},
      author={Han, Mingfei and Wang, Yali and Li, Zhihui and Yao, Lina and Chang, Xiaojun and Qiao, Yu},
      booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},