Unsupervised Video Continual Learning via Non-Parametric Deep Embedded Clustering

Computer Vision & MultiModal AI
Published: arXiv: 2508.21773v1
Authors

Nattapong Kurpukdee Adrian G. Bors

Abstract

We propose a realistic scenario for the unsupervised video learning where neither task boundaries nor labels are provided when learning a succession of tasks. We also provide a non-parametric learning solution for the under-explored problem of unsupervised video continual learning. Videos represent a complex and rich spatio-temporal media information, widely used in many applications, but which have not been sufficiently explored in unsupervised continual learning. Prior studies have only focused on supervised continual learning, relying on the knowledge of labels and task boundaries, while having labeled data is costly and not practical. To address this gap, we study the unsupervised video continual learning (uVCL). uVCL raises more challenges due to the additional computational and memory requirements of processing videos when compared to images. We introduce a general benchmark experimental protocol for uVCL by considering the learning of unstructured video data categories during each task. We propose to use the Kernel Density Estimation (KDE) of deep embedded video features extracted by unsupervised video transformer networks as a non-parametric probabilistic representation of the data. We introduce a novelty detection criterion for the incoming new task data, dynamically enabling the expansion of memory clusters, aiming to capture new knowledge when learning a succession of tasks. We leverage the use of transfer learning from the previous tasks as an initial state for the knowledge transfer to the current learning task. We found that the proposed methodology substantially enhances the performance of the model when successively learning many tasks. We perform in-depth evaluations on three standard video action recognition datasets, including UCF101, HMDB51, and Something-to-Something V2, without using any labels or class boundaries.

Paper Summary

Problem
The main problem this paper addresses is unsupervised video continual learning, where a model must learn from a succession of tasks without any labels or task boundaries. This is a challenging task because it requires the model to balance stability (preserving past knowledge) and plasticity (learning new information) in a dynamic and complex environment.
Key Innovation
The key innovation of this paper is the introduction of a non-parametric deep embedded clustering approach for unsupervised video continual learning. This approach uses kernel density estimation (KDE) to represent the data and mean-shift to extract clusters of video data. The model also employs memory buffers to store video features and mitigate catastrophic forgetting.
Practical Impact
This research has significant practical impact because it addresses a long-standing challenge in machine learning: unsupervised learning of complex data like videos. The proposed approach can be applied to various real-world applications, such as video surveillance, action recognition, and anomaly detection. By enabling models to learn from unlabeled data, this research can reduce the need for labeled data and make machine learning more accessible and efficient.
Analogy / Intuitive Explanation
Imagine a person trying to learn a new language without any guidance or feedback. They would need to balance remembering the grammar and vocabulary they've already learned with learning new words and phrases. Similarly, the model in this paper must balance stability and plasticity to learn from a succession of tasks without any labels or task boundaries. The proposed approach uses a dynamic clustering method to group similar video data together, allowing the model to learn from new information while preserving past knowledge.
Paper Information
Categories:
cs.CV cs.AI cs.LG
Published Date:

arXiv ID:

2508.21773v1

Quick Actions