Video summarization, Compositional video understanding, & Tracking everything | Multimodal Weekly 63

Описание к видео Video summarization, Compositional video understanding, & Tracking everything | Multimodal Weekly 63

​​​​In the 63rd session of Multimodal Weekly, we had three exciting presentations on video summarization, compositional video understanding, and tracking everything.

✅ Siyuan Li from ETH Zurich will discuss his work MASA, a novel method for robust instance association learning, capable of matching any objects within videos across diverse domains without tracking labels. Leveraging the rich object segmentation from the Segment Anything Model (SAM), MASA learns instance-level correspondence through exhaustive data transformations.
Follow Siyuan: https://siyuanliii.github.io/
MASA: https://matchinganything.github.io/

​​​​​​​​​​​​​​​​​​​​✅ Jaewon Son from Sungkyunkwan University will discuss CSTA, a method that stacks each feature of frames from a single video to form image-like frame representations and applies 2D CNN to these frame features.
Follow Jaewon: https://scholar.google.com/citations?...
CSTA: https://github.com/thswodnjs3/CSTA?ta...
​​​​​​​​​
​​​​​​​​​​✅ Jinwoo Ahn from Hanyang University will discuss his paper Compositional Video Understanding. The proposed method includes a new transformer architecture capable of learning spatiotemporal graphs, and a compositional learning method to learn disentangled features for each semantic unit.
Follow Jinwoo: https://github.com/wasabipretzel
Compositional Video Understanding: https://github.com/hy0y/st-gt

Timestamps:
00:07 Introduction
03:24 Siyuan starts
03:50 Motivation and problem
05:35 Scaling up
08:08 Existing self-supervision signal
09:43 Training images and real-world images
10:21 Key idea
10:58 Method overview
12:40 Training pipeline
14:05 Inference pipeline
14:55 Experiments
16:07 Qualitative results
17:45 Compare with VOS-based method
20:00 Evaluation - normal tracking metric and TETA metric
21:25 Quick chat with Siyuan
24:38 Jaewon starts
25:20 Task - video summarization
25:54 Overall workflow - models are trained to predict the importance score for each frame
26:55 Preliminary - temporal attention
27:36 Preliminary - spatial attention
28:20 Problem - spatiotemporal attention is inefficient
29:17 Goal - video summarization models considering spatiotemporal attention and efficiency
29:28 Approach - CNN attention (consider video as image)
30:18 Motivation - CNN learns position in the image
31:13 Motivation - CNN reduces costs for attention models
32:02 Overview - CSTA CNN-based spatiotemporal attention
32:45 Architecture - Embedding process
33:10 Architecture - Prediction process
33:53 Architecture - Attention module
34:15 Architecture - Mixing module
34:58 Experiment - Prove CNN as the attention mechanism
37:03 Experiment - Performance comparison
38:15 Experiment - Computation analysis
39:39 Conclusion - Summary and contribution
41:50 Jinwoo starts
42:15 Introduction - Research problem
43:12 Introduction - Main ideas
44:14 Introduction - Task description
45:08 Compositional learning framework - Spatiotemporal graph construction
45:49 Compositional learning framework - Spatiotemporal graph Transformer
46:35 Compositional learning framework - Object-oriented video encoder
47:35 Compositional learning framework - Embedding disentangling module
48:25 Experimental results - Complex action recognition
49:55 Q&A with all speakers

Join the Multimodal Minds community to receive an invite for future webinars:   / discord  

Комментарии

Информация по комментариям в разработке