Skip to main content


Pre-extracted feature vectors are available for every video in the dataset. They can be accessed with the EGO4D CLI. Please consult the table below for the appropriate --dataset option.


Here is a table of the features pre-extracted from Ego4D. These features are extracted from the canonical videos. Canonical videos are all 30FPS.

Window Size and Stride are in frames.

Feature TypeDataset(s) Trained OnModel ArchWindow SizeStrideModel Weights Location
slowfast8x8_r101_k400Kinetics 400SlowFast 8x8 (R101 backbone)3216torchub path: facebookresearch/pytorchvideo/slowfast_r101
omnivore_videoKinetics 400 / ImageNet-1KOmnivore (swin B); video head326
omnivore_image (WIP)Kinetics 400 / ImageNet-1KOmnivore (swin B); image head16
AudioN/AN/A; planned

Features are extracted in a moving window fashion. At every extraction point the model sees the next Window Size (W) frames (i.e. at frame i the model sees features [i, i + W) frames). The window starts at frame 0, and then is offset by the stride until the end of the video is reached.

There is a boundary condition where the last window may extend past the video. In this case, the extraction point is backed up such that a window with W frames from the video is used. This occurs when the number of frames in the canonical video is not divisible by the stride.

Example Window Stride

Let's say a video has 39 frames. The frames for extraction will be (in frame numbers):

  • [0, 31]
  • [7, 38] which is “back-padded” from [16, 47] to fit the last window