Benchmarks Overview

Episodic Memory

The Episodic Memory task aims to make past video queryable and requires localizing where the answer can be seen within the user’s past video.

Hands and Objects

Hands & Objects aims to understand the camera-wearers present activity in terms of interactions with objects.


Forecasting movements and interactions requires comprehending the camera wearer’s intention.

Audio-Visual Diarization

The Audio-Visual Diarization tasks involve localizing and tracking of the participants, detecting each speaker's activity, and transcribing all speech content.

Social Interactions

The Social benchmark focuses on multimodal understanding of conversational interactions via attention and speech.