In this section, we’ll present a brief introduction of video data preprocessing via fine-tuning a ViedeoMAE model for video classification task. We’ll use a lightweight model for this demonstration and fairly small dataset, meaning the code is runnable end-to-end on any consumer grade GPU, including the T4 16GB GPU provided in the Google Colab free tier.
Note: The overall code of training video classification is originated from Transformer’s Video Classification Guide. Please check the given link if you want to know further details.
Before you begin, make sure you have all the necessary libraries installed:
pip install -q transformers[torch] accelerate evaluate datasets git+https://github.com/facebookresearch/pytorchvideo.git
We will use torchvision and PyTorchVideo to process and prepare the video data.
We encourage you to log in to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to log in:
from huggingface_hub import notebook_login
notebook_login()
Let’s start by loading a subset of the UCF-101 dataset. This is a video dataset from University of Central Florida which has 13320 videos from 101 action categories.
from huggingface_hub import hf_hub_download
hf_dataset_identifier = "sayakpaul/ucf101-subset"
filename = "UCF101_subset.tar.gz"
file_path = hf_hub_download(repo_id=hf_dataset_identifier, filename=filename, repo_type="dataset")
After the subset has been downloaded, you need to extract the compressed archive:
import tarfile
with tarfile.open(file_path) as t:
t.extractall(".")
At a high level, the dataset is organized like so:
UCF101_subset/
train/
BandMarching/
video_1.mp4
video_2.mp4
...
Archery
video_1.mp4
video_2.mp4
...
...
val/
BandMarching/
video_1.mp4
video_2.mp4
...
Archery
video_1.mp4
video_2.mp4
...
...
test/
BandMarching/
video_1.mp4
video_2.mp4
...
Archery
video_1.mp4
video_2.mp4
...
...
You can then count the number of total videos.
import pathlib
dataset_root_path = "UCF101_subset"
dataset_root_path = pathlib.Path(dataset_root_path)
video_count_train = len(list(dataset_root_path.glob("train/*/*.avi")))
video_count_val = len(list(dataset_root_path.glob("val/*/*.avi")))
video_count_test = len(list(dataset_root_path.glob("test/*/*.avi")))
video_total = video_count_train + video_count_val + video_count_test
print(f"Total videos: {video_total}")
# Total videos: 405
You can see this subset has 10 classes of videos while the original dataset has 101 classes.