Egocentric 4D Perception (EGO4D)

Benchmarks

Episodic Memory

Where is my X?
Egocentric video gives a recording of a wearer's daily life, and can be used augment human memory on demand. Such a system might be able to remind a user where they left their keys, if they added salt to a recipe, or recall events they attended.

Querying Memory

There are three different tasks within this benchmark based on the input type used to query the memory: visual query (i.e. find the location given an image of keys), textual query ("how many cups of sugar did I add?"), and a moment query (find all instances of "When did I play with the dog").

Construction of Queries

For the language queries, a set of templates were designed which annotators used to write questions for the task. Examples include "what is the state of object X?" or "where is object X after event Y"? These were then re-written for variety.

Recalling Lives

Given the broad nature of this benchmark, there isn't a subset of activities that were focused on within this task, leading to a realistic and challenging benchmark.

Hand + Object Interaction

How do objects change during interactions?
Going beyond Action Recognition, this benchmark follows when, where and how an object is changed during its interaction - only possible through a first person Viewpoint.

Changes of State

We capture annotations of objects, as they transform, temporally, spatially and semantically - an onion might be minced. These are represented by three different tasks in the benchmark: Point-of-no-return Temporal Localisation, Active Object Detection and State-Change Classification.

Pre/Post Conditions

Each annotation has been labelled with prior states (i.e. the prior condition) and posterior states as well as the point of no return (PNR) in which the state change is triggered.

World of Interactions

The data for this challenge has been selected from activities with a high level of hand-object interactions such as knitting, carpentry, and baking.

Social Interactions

Who is attending to whom?
An egocentric video provides a unique lens for studying social interactions because it captures utterances and nonverbal cues from each participant’s unique view and enables embodied approaches to social understanding.

More than Conversation

Social extends the Audio-Visual Diaraization benchmark towards understanding the conversations of a social group over a longer period of time for specific tasks.

Talking and Listening

This benchmark includes two different tasks focused on when a person is Looking at Me and when a person is Talking to Me.

Unique Interactions

The data within the Social Interaction task was collected specifically for this task in mind with multi-user scenarios such as social deduction games, eating/drinking and playing basketball.

Audio-Visual Diarization

Who said what, and when?
Conversations are egocentric in nature, and a human-in-the-loop AI requires skills such as localizing a speaker and transcribing speech content

Looking for Conversation

This benchmark contains 2 different tasks focused on visual data: localizing and tracking of the speakers in the visual field of view. Note that identities are anonymized to match consortium guidelines.

Hearing the Words

The benchmark also includes 2 tasks for the audio modality: diaraization/temporal extent of the sentences spoken and the transcription of the conversation.

Much Ado About Talking

With this task focused on conversations, scenarios were chosen which included multiple participants interacting together, such as eating, playing games or setting up tents.

Forecasting

Predicting the future is a critical skill for AI systems to provide timely assistance for users. With a myriad of long-form, unscripted videos, Ego4D provides an interesting challenge for different forecasting tasks.

Where Will I Move?

Two tasks consider the future motion of the user with hands and feet. Models should predict where the camera wearer will go within the scene and the future location of wearer's hands.

What Will Happen Next?

Two tasks consider short and long term future anticipation. Algorithms should be able to predict the next object interaction that will take place and a countdown towards it taking place as well as the long term - what are the next possible sequence of actions?

Data for Prophets

The data for this challenge has been selected from a diverse set of activities containing many human-object interactions and movements such as brick making, cooking or carpentry.

QUESTIONS / ANSWERS

What to cite referencing this effort?

If using the dataset, annotations or inspiration from this work, cite ArXiv paper:

K Grauman et al. Ego4D: Around the World in 3,000 Hours of Egocentric Video. arXiv preprint arXiv:2110.07058 2021. BibTex (.bib)

How can I download the dataset?

Ego4D is now publicly avaiable. Obtaining the dataset or any annotations requires you first review our license agreement and accept the terms. Go here to review and execute this agreement, and you will be emailed a set of AWS access credentials when your license agreement is approved, which will take 48hrs. You can review a draft of the licenses before signing here In the meantime, you can check out data overview and sample notebooks here to get familiar with the dataset, and can download the CLI and dataloaders to get setup in advance.

What MetaData is available?

For each video, we provide information about the collecting partner/university, date of recording, recording equipment, as well as video parts when the video is made up of smaller chunks. Information about the availability of IMU, Audio and whether videos have been redacted are also included. Overviews of the metadata and annotations can be found in our docs.

Who collected this data?

The data was collected from 923 participants. We showcase a distribution of age, gender and jobs from around 70% of our participants who volunteered to self-identify their demographics — age, gender, countries of residence, and occupations.

Does the data contain identifying information of individuals?

The collecting partner holds consent forms and/or release forms for all videos. Only when consent has been collected from participants, the data will contain faces and other identifying information. For the majority of videos, data has been de-identified pre-release. Refer to our privacy statement and ArXiv (Sec 3.4 and appendix C) for details of our privacy and de-identification pipeline.

What coverage of scenarios do you have?

A sample visualisation of our scenarios is below. Outer circle shows the 14 most common scenarios (70% of the data). Wordle shows scenarios in the remaining 30%. Inner circle is color coded by the contributing partner (see map marker above).

Do you offer pre-extracted features?

Yes. We provide precomputed Action video features for the full dataset, and plan other features. You can find details and download these features from here.

What equipment, resolution and frame rate are available?

This depends on equipment. To avoid models overfitting to a single capture device, seven different head-mounted cameras were deployed across the dataset: GoPro, Vuzix Blade, Pupil Labs, ZShades, ORDRO EP6, iVue Rincon 1080, and Weeview. We release all footage using the native resolution, but also offer a standardised frame-rate version of 30fps for ease of use. All benchmark results use the standardised version.

How can I participate in the benchmarks?

The first round of challenges are open now - please see the challenge documentation here.
Results will be announced at the Joint 1st Ego4D Workshop (in conjunction with 10th EPIC Workshop) alongside CVPR 2022.

EGO4D Team

Carnegie Mellon University, Pittsburgh, U.S.

Kris Kitani (PI)
Xingyu Liu
Qichen Fu
Sean Crane
Xuhua Huang
Xindi Wu

Carnegie Mellon University Africa, Rawanda

Abrham Gebreselasie

King Abdullah University of Science and Technology, KSA

Bernard Ghanem (PI)
Chen Zhao
Mengmeng Xu
Merey Ramazanova

University of Minnesota, U.S.

Hyun Soo Park (PI)
Jayant Sharma
Tien Do
Zachary Chavis

International Institute of Information Technology, Hyderabad, India

C. V. Jawahar (PI)
Raghava Modhugu
Siddhant Bansal

Indiana University Bloomington, U.S.

David Crandall (PI)
Yuchen Wang
Weslie Khoo

University of Pennsylvania, U.S.

Jianbo Shi (PI)

University of Catania, Italy

Giovanni Maria Farinella (PI)
Antonino Furnari

University of Tokyo, Japan

Yoichi Sato (PI)
Takuma Yagi
Takumi Nishiyasu
Yifei Huang
Yusuke Sugano
Zhenqiang Li

Facebook AI Research, International

Kristen Grauman (PI)
Jitendra Malik (PI)
Dhruv Batra
Eugene Byrne
Vincent Cartillier
Morrie Doulaty
Akshay Erapalli
Christian Fuegen
Rohit Girdhar
Jackson Hamburger
Tal Hassner
James Hillis, FRL
Vamsi Krishna Ithapu, FRL
Hao Jiang
Hanbyul Joo
Jachym Kolar
Satwik Kottur
Devansh Kukreja
Anurag Kumar, FRL
Federico Landini
Chao Li, FRL
Miguel Martin
Tullie Murrell
Tushar Nagarajan
Christoph Feichtenhofer
Karttikeya Mangalam
Richard Newcombe, FRL
Santhosh Kumar Ramakrishnan
Leda Sari, FRL
Kiran Somasundaram, FRL
Lorenzo Torresani
Minh Vo, FRL
Andrew Westbury
Mingfei Yan, FRL

University of Bristol, UK

Dima Damen (PI)
Michael Wray
Will Price
Jonathan Munro
Adriano Fragomeni

National University of Singapore, Singapore

Mike Zheng Shou (PI)
Haizhou Li (Co-PI)
Eric Z. Xu
Ruijie Tao
Yunyi Zhu

Georgia Institute of Technology, U.S.

Jim Rehg (PI)
Miao Liu
Fiona Ryan
Audrey Southerland
Wenqi Jia

Universidad de los Andes, Colombia

Pablo Arbelaez (PI)
Cristina Gonzalez
Paola Ruiz Puentes

Massachusetts Institute of Technology, U.S.

Aude Oliva (PI)
Antonio Torralba (PI)

Others

Ilija Radosavovic, UC Berkeley

MASSIVE SCALE Explore Sample ↗

DIVERSE

PRIVACY/ETHICS

Benchmarks

Episodic Memory

Querying Memory

Construction of Queries

Recalling Lives

Hand + Object Interaction

Changes of State

Pre/Post Conditions

World of Interactions

Social Interactions

More than Conversation

Talking and Listening

Unique Interactions

Audio-Visual Diarization

Looking for Conversation

Hearing the Words

Much Ado About Talking

Forecasting

Where Will I Move?

What Will Happen Next?

Data for Prophets

Challenges

Episodic Memory

Hand-Object Interactions

AV Diarization

Social

Forecasting

EGO4D Consortium

QUESTIONS / ANSWERS

What to cite referencing this effort?

How can I download the dataset?

What MetaData is available?

Who collected this data?

Does the data contain identifying information of individuals?

What coverage of scenarios do you have?

Do you offer pre-extracted features?

What equipment, resolution and frame rate are available?

How can I participate in the benchmarks?

EGO4D Team

Carnegie Mellon University, Pittsburgh, U.S.

Carnegie Mellon University Africa, Rawanda

King Abdullah University of Science and Technology, KSA

University of Minnesota, U.S.

International Institute of Information Technology, Hyderabad, India

Indiana University Bloomington, U.S.

University of Pennsylvania, U.S.

University of Catania, Italy

University of Tokyo, Japan

Facebook AI Research, International

University of Bristol, UK

National University of Singapore, Singapore

Georgia Institute of Technology, U.S.

Universidad de los Andes, Colombia

Massachusetts Institute of Technology, U.S.

Others

DOWNLOAD EGO4D

CONTACT EGO4D