In this section, we’ll present a brief introduction of video data preprocessing via fine-tuning a ViedeoMAE model for video classification task. We’ll use a lightweight model for this demonstration and fairly small dataset, meaning the code is runnable end-to-end on any consumer grade GPU, including the T4 16GB GPU provided in the Google Colab free tier.

Note: The overall code of training video classification is originated from Transformer’s Video Classification Guide. Please check the given link if you want to know further details.

Before you begin, make sure you have all the necessary libraries installed:

pip install -q transformers[torch] accelerate evaluate datasets git+https://github.com/facebookresearch/pytorchvideo.git

We will use torchvision and PyTorchVideo to process and prepare the video data.

We encourage you to log in to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to log in:

from huggingface_hub import notebook_login

notebook_login()

Load UCF101 dataset

Let’s start by loading a subset of the UCF-101 dataset. This is a video dataset from University of Central Florida which has 13320 videos from 101 action categories.

from huggingface_hub import hf_hub_download

hf_dataset_identifier = "sayakpaul/ucf101-subset"
filename = "UCF101_subset.tar.gz"
file_path = hf_hub_download(repo_id=hf_dataset_identifier, filename=filename, repo_type="dataset")

After the subset has been downloaded, you need to extract the compressed archive:

import tarfile

with tarfile.open(file_path) as t:
     t.extractall(".")

At a high level, the dataset is organized like so:

UCF101_subset/
    train/
        BandMarching/
            video_1.mp4
            video_2.mp4
            ...
        Archery
            video_1.mp4
            video_2.mp4
            ...
        ...
    val/
        BandMarching/
            video_1.mp4
            video_2.mp4
            ...
        Archery
            video_1.mp4
            video_2.mp4
            ...
        ...
    test/
        BandMarching/
            video_1.mp4
            video_2.mp4
            ...
        Archery
            video_1.mp4
            video_2.mp4
            ...
        ...

You can then count the number of total videos.

import pathlib
dataset_root_path = "UCF101_subset"
dataset_root_path = pathlib.Path(dataset_root_path)
video_count_train = len(list(dataset_root_path.glob("train/*/*.avi")))
video_count_val = len(list(dataset_root_path.glob("val/*/*.avi")))
video_count_test = len(list(dataset_root_path.glob("test/*/*.avi")))
video_total = video_count_train + video_count_val + video_count_test
print(f"Total videos: {video_total}")

# Total videos: 405

You can see this subset has 10 classes of videos while the original dataset has 101 classes.