The Episodic Memory task aims to make past video queryable and requires localizing where the answer can be seen within the user’s past video.
Hands and Objects
Hands & Objects aims to understand the camera-wearers present activity in terms of interactions with objects.
Forecasting movements and interactions requires comprehending the camera wearer’s intention.
The Audio-Visual Diarization tasks involve localizing and tracking of the participants, detecting each speaker's activity, and transcribing all speech content.
The Social benchmark focuses on multimodal understanding of conversational interactions via attention and speech.