Egocentric 4D Perception (EGO4D)

Looking for the CVPR 2022 Workshop Page?

2nd International Ego4D Workshop @ ECCV 2022

For details about the Ego4D project and data, please refer to the dataset's webpage

Challenge Winners and Validation Reports

Ego4D Audio-Visual Diarization Challenge

Place	Team	Validation Report	Code
1st Place	Intel Labs Kyle Min, Intel Labs	Validation Report	Code

Audio-Only Diarization Challenge

AV Transcription Challenge

Place	Team	Validation Report	Code
1st Place	pyannote Hervé Bredin, IRIT, Université de Toulouse, CNRS	Validation Report	Code
2nd Place	diart Juan M. Coria, Universite Paris-Saclay CNRS, LISN Sahar Ghannay, Universite Paris-Saclay CNRS, LISN	Validation Report	Code

Place	Team	Validation Report	Code
1st Place	AVATAR-Google Paul Hongsuck Seo, Google Research Arsha Nagrani, Google Research Cordelia Schmid, Google Research	Validation Report	Code

Visual Queries 2D Localization

Place	Team	Validation Report	Code
1st Place	Thereisnospoon Mengmeng Xu, KAUST Juan-Manuel Perez-Rua, Facebook Cheng-Yang Fu, Facebook Yanghao Li, Facebook Bernard Ghanem, KAUST Tao Xiang, Facebook	Validation Report	Code

Visual Queries 3D Localization

Natural Language Queries

Place	Team	Validation Report	Code
1st Place	IVUL Jinjie Mai, KAUST Chen Zhao, KAUST Abdullah Hamdi, KAUST Silvio Giancola, KAUST Bernard Ghanem, KAUST	Validation Report	Code

Moments Queries

Place	Team	Validation Report	Code
1st Place	VideoIntern Guo Chen, Shanghai AI Laboratory, Nanjing University Jiahao Wang, Nanjing University Yi Liu, SIAT, Shanghai AI Laboratory Yifei Huang, University of Tokyo Jiashuo Yu, Shanghai AI Laboratory Yi Wang, Shanghai AI Laboratory Yali Wang, SIAT, Shanghai AI Laboratory Tong Lu, Nanjing University Limin Wang, Nanjing University, Shanghai AI Laboratory Yu Qiao, Shanghai AI Laboratory	Validation Report	Code
2nd Place	University of Wisconsin-Madison Sicheng Mo, University of Wisconsin-Madison Fangzhou Mu, University of Wisconsin-Madison Yin Li, University of Wisconsin-Madison	Validation Report	Forthcoming

Looking at Me

Place	Team	Validation Report	Code
1st Place	VideoIntern Guo Chen, Shanghai AI Laboratory, Nanjing University Jiahao Wang, Nanjing University Yi Liu, SIAT, Shanghai AI Laboratory Yifei Huang, University of Tokyo Jiashuo Yu, Shanghai AI Laboratory Yi Wang, Shanghai AI Laboratory Yali Wang, SIAT, Shanghai AI Laboratory Tong Lu, Nanjing University Limin Wang, Nanjing University, Shanghai AI Laboratory Yu Qiao, Shanghai AI Laboratory	Validation Report	Code
2nd Place	University of Wisconsin-Madison Fangzhou Mu, University of Wisconsin-Madison Sicheng Mo, University of Wisconsin-Madison Gillian Wang, University of Wisconsin-Madison Yin Li, University of Wisconsin-Madison	Validation Report	Code

Talking to Me

Place	Team	Validation Report	Code
1st Place	PKU-WICT-MIPL Xiyu Wei, Wangxuan Institute of Computer Technology, Peking University Dejie Yang, Wangxuan Institute of Computer Technology, Peking University Minghang Zheng, Wangxuan Institute of Computer Technology, Peking University Qingchao Chen, National Institute of Health Data Science, Peking University Yuxin Peng, Wangxuan Institute of Computer Technology, Peking University Yang Liu, Wangxuan Institute of Computer Technology, Peking University; Beijing Institute for General Artificial Intelligence	Forthcoming	Code
2nd Place	KeioEgo Haowen Hu, Graduate School of Science and Technology, Keio University Ryo Hachiuma, Graduate School of Science and Technology, Keio University Hideo Saito, Graduate School of Science and Technology, Keio University Research code: https://github.com/Huhaowen0130/EgoFlow	Validation Report	Forthcoming

Place	Team	Validation Report	Code
1st Place	University of Texas at Austin & Meta AI Zihui Xue, The University of Texas at Austin and Meta AI Yale Song, Meta AI Lorenzo Torresani, Meta AI Kristen Grauman, The University of Texas at Austin and Meta AI	Validation Report	Forthcoming

Long Term Action Anticipation

Place	Team	Validation Report	Code
1st Place	Autonomous Systems Esteve Valls Mascaro, Autonomous Systems, Technische Universitat Wien (TU Wien) Hyemin Ahn, Ulsan National Institute of Science and Technology (UNIST) Dongheui Lee, Autonomous Systems, Technische Universitat Wien (TU Wien) & Institute of Robotics and Mechatronics, German Aerospace Center (DLR)	Validation Report	Forthcoming

Short Term Object Interaction Anticipation

Place	Team	Validation Report	Code
1st Place	VideoIntern Guo Chen, Shanghai AI Laboratory, Nanjing University Yizhuo Li, The University of Hong Kong, Shanghai AI Laboratory Kunchang Li, SIAT, Shanghai AI Laboratory Yinan He, Shanghai AI Laboratory Bingkun Huang, Nanjing University, Shanghai AI Laboratory Yifei Huang, University of Tokyo Yi Wang, Shanghai AI Laboratory Yali Wang, SIAT, Shanghai AI Laboratory Tong Lu, Nanjing University Limin Wang, Nanjing University, Shanghai AI Laboratory Yu Qiao, Shanghai AI Laboratory	Validation Report	Code
2nd Place	HVRL Masashi Hatano, Graduate School of Science and Technology, Keio University Ryo Hachiuma, Graduate School of Science and Technology, Keio University Hideo Saito, Graduate School of Science and Technology, Keio University	Forthcoming	Forthcoming

Place	Team	Validation Report	Code
1st Place	Video Intern Sen Xing, Shanghai AI Laboratory, Tsinghua University Guo Chen, Nanjing University, Shanghai AI Laboratory Zhe Chen, Nanjing University, Shanghai AI Laboratory Junting Pan, Chinese University of Hong Kong, Shanghai AI Laboratory Yifei Huang, University of Tokyo Yi Wang, Shanghai AI Laboratory Yali Wang, SIAT, Shanghai AI Laboratory Limin Wang, Nanjing University, Shanghai AI Laboratory Yu Qiao, Shanghai AI Laboratory	Validation Report	Code

State Change Object Detection

Place	Team	Validation Report	Code
1st Place	VideoIntern Guo Chen, Shanghai AI Laboratory, Nanjing University Zhe Chen, Nanjing University, Shanghai AI Laboratory Yi Wang, Shanghai AI Laboratory Wenhai Wang, Shanghai AI Laboratory Yali Wang, SIAT, Shanghai AI Laboratory Limin Wang, Nanjing University, Shanghai AI Laboratory Yu Qiao, Shanghai AI Laboratory	Validation Report	Code

Object State Change Classification

PNR Temporal Localization

Place	Team	Validation Report	Code
1st Place	Red Panda@IMAGINE Yin-Dong Zheng, Nanjing University Guo Chen, Nanjing University, Shanghai AI Laboratory Jiahao Wang, Nanjing University Tong Lu, Nanjing University Limin Wang, Nanjing University, Shanghai AI Laboratory	Validation Report	Code
2nd Place	EgoMotion-COMPASS Jianchen Lei, Zhejiang University Shuang Ma, Microsoft Zhongjie Ba, Zhejiang University Kui Ren, Zhejiang University	Validation Report	Code

Place	Team	Validation Report	Code
1st Place	Red Panda@IMAGINE Yin-Dong Zheng, Nanjing University Guo Chen, Nanjing University, Shanghai AI Laboratory Jiahao Wang, Nanjing University Tong Lu, Nanjing University Limin Wang, Nanjing University, Shanghai AI Laboratory	Validation Report	Code
2nd Place	EgoMotion-COMPASS Jianchen Lei, Zhejiang University Shuang Ma, Microsoft Zhongjie Ba, Zhejiang University Kui Ren, Zhejiang University	Validation Report	Code
3rd Place	University of Texas at Austin & Meta AI Zihui Xue, The University of Texas at Austin and Meta AI Yale Song, Meta AI Lorenzo Torresani, Meta AI Kristen Grauman, The University of Texas at Austin and Meta AI	Validation Report	Forthcoming

Challenges

Episodic Memory

Hand-Object Interactions

AV Diarization

Social

Forecasting

Call for Extended Abstracts

You are invited to submit extended abstracts to the second edition of the International Workshop in Ego4D which will be held alongside ECCV2022 in Tel Aviv.

These abstracts represent existing or ongoing work and will not be published as part of any proceedings. We welcome all works that focus within the Egocentric Domain, it is not necessary to use the Ego4D dataset within your work. We expect a submission may contain one or more of the following topics (this is a non-exhaustive list):

Video Understanding for Egocentric Videos
Egocentric and Exocentric domain adaptation/transfer
Egocentric video Summarization
Egocentric Social Interaction and human behavior understanding
Computational eye tracking and gaze estimation from head mounted devices
Interactive Augmented and Virtual Reality for Egocentric Perception
Augmented Human performance for egocentric
Privacy and ethical concerns with wearable sensors and egocentric vision
Egocentric vision for social good.

Format

The length of the extended abstracts is 2-4 pages, including figures and tables, but not references. We invite submissions of ongoing or already published work, as well as reports on demonstrations and prototypes. The 2^nd international Ego4D workshop gives opportunities for authors to present their work to the egocentric community to provoke discussion and feedback. Accepted work will be presented as either an oral presentation (either virtual or in-person) or as a poster presentation. The review will be single-blind, so there is no need to anonymize your work, but otherwise will follow the format of the ECCV submissions, information can be found here. Accepted abstracts will not be published as part of a proceedings, so can be uploaded to ArXiv etc. and the links will be provided on the workshop’s webpage. The submission will be managed with the Ego4D@ECCV2022 CMT website.

Important Dates

Challenge Deadline	18 September 2022
Challenge Report Deadline	25 September 2022
Extended Abstract Deadline	30 September 2022
Notification to Authors	7 October 2022
Workshop Date	24 October 2022

Invited Speakers

We have several invited talks scheduled for the workshop

Lihi Zelnik-Manor

ECE, Technion

Imagine being able to touch virtual objects, interact physically with computer games, or feel items that are located elsewhere on the globe. The breadth of applications of such haptic technology would be diverse and broad. Interestingly, while excellent visual and auditory feedback devices exist, cutaneous feedback devices are in infancy stages. In this talk I will present a brief introduction to the world of haptic feedback devices and the challenges it poses. Then I will present HUGO, a device designed in a human-centered process, triggering the mechanoreceptors in our skin thus enabling people to experience the touch of digitized surfaces “in-the-wild". This talk is likely to leave us with many open questions that require research to answer.

Michal Irani

Weizmann Institute of Science

Can we reconstruct natural images & videos that a person saw, directly from his/her fMRI brain recordings? This is a particularly difficult problem, given the very few “paired” training examples available (images/videos with their corresponding fMRI recordings). In this talk I will show how such image/video reconstruction can be performed, despite the few training examples, by exploiting Self-Supervised training on many “unpaired” data – i.e., images & videos without any fMRI recordings. I will further show how large-scale image classification (to more than 1000 classes!) can be performed on sparse fMRI data.

Abhinav Gupta

Carnegie Mellon University

Christoph Feichtenhofer

Meta AI

"This talk discusses recent research for self-supervised learning from video. I first present a conceptually simple extension of Masked Autoencoders (MAE) to spatiotemporal representation learning from videos. We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels. A high masking ratio leads to a large speedup, e.g., > 4x in wall-clock time or even more. Then I will present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients, a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. Finally, I will talk about a simple extension of MAE to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. We observe that masked pre-training can outperform supervised pre-training by large margins. We further report encouraging results of training on real-world, uncurated Instagram data. Our studies suggests that the general framework of masked autoencoding (BERT, MAE, etc.) can be a unified methodology for representation learning with minimal domain knowledge. "

João Carreira

Google DeepMind

Schedule

All dates are local to Tel Aviv's time, GMT+3.

Start Time	End Time	Title	Speaker
9:00 AM	9:15 AM	Welcome Remarks	Giovanni Maria Farinella
9:15 AM	9:45 AM	Invited Talk -- Digitizing Touch Imagine being able to touch virtual objects, interact physically with computer games, or feel items that are located elsewhere on the globe. The breadth of applications of such haptic technology would be diverse and broad. Interestingly, while excellent visual and auditory feedback devices exist, cutaneous feedback devices are in infancy stages. In this talk I will present a brief introduction to the world of haptic feedback devices and the challenges it poses. Then I will present HUGO, a device designed in a human-centered process, triggering the mechanoreceptors in our skin thus enabling people to experience the touch of digitized surfaces “in-the-wild". This talk is likely to leave us with many open questions that require research to answer.	Lihi Zelnik-Manor
9:45 AM	10:00 AM	Ego4D Challenge Results: Insights from the Winning Approaches across Five Benchmarks	Rohit Girdhar
10:00 AM	10:30 AM	Break
10:30 AM	11:00 AM	Invited Talk -- Masked Video Representation Learning This talk discusses recent research for self-supervised learning from video. I first present a conceptually simple extension of Masked Autoencoders (MAE) to spatiotemporal representation learning from videos. We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels. A high masking ratio leads to a large speedup, e.g., > 4x in wall-clock time or even more. Then I will present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients, a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. Finally, I will talk about a simple extension of MAE to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. We observe that masked pre-training can outperform supervised pre-training by large margins. We further report encouraging results of training on real-world, uncurated Instagram data. Our studies suggests that the general framework of masked autoencoding (BERT, MAE, etc.) can be a unified methodology for representation learning with minimal domain knowledge.	Christoph Feichtenhofer
11:00 AM	11:15 AM	Episodic Memory for Egocentric Perception: Sharing the Leading Approaches to Moments, Visual and Natural Language Queries	Satwik Kottur
11:15 AM	11:30 AM	Hand + Object Interactions: Examining Leading Approaches to Temporal Localization, Active Object Detection and State-Change Classification	Siddhant Bansal
11:30 AM	11:45 AM	Forecasting Activities in First-Person Videos: What Works for Action, Object Interaction, and Hand Position Anticipation	Antonino Furnari
11:45 AM	12:00 PM	Understanding Interaction: Sharing Insights from Social Understanding and AV Diarization in the Context of Egocentric Videos	Mike Z. Shou
12:00 PM	12:15 PM	Winners felicitation/certificate	Andrew Westbury
12:15 PM	12:20 PM	Spotlight Talk -- SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric Action Recognition	Victor Escorcia
12:20 PM	12:25 PM	Spotlight Talk -- UnrealEgo: A New Dataset for Robust Egocentric 3D Human Motion Capture	Hiroyasu Akada
12:25 PM	12:30 PM	Spotlight Talk -- EgoBody: Human Body Shape and Motion of Interacting People from Head-Mounted Devices	Siwei Zhang
12:30 PM	12:35 PM	Spotlight Talk -- OWL (Observe, Watch, Listen): Audiovisual Temporal Context for Localizing Actions in Egocentric Videos	Victor Escorcia
12:35 PM	12:40 PM	Spotlight Talk -- Students taught by multimodal teachers are superior action recognizers (VIRTUAL)	Gorjan Radevski
12:40 PM	12:45 PM	Spotlight Talk -- Event-based Stereo Depth Estimation from Ego-motion using Ray Density Fusion (VIRTUAL)	Suman Ghosh
12:45 PM	12:50 PM	Spotlight Talk -- Hand and Object Detection in Egocentric Videos with Color Local Features and Random Forest (VIRTUAL)	María Elena Buemi
12:50 PM	12:55 PM	Spotlight Talk -- Egocentric Activity Recognition and Localization on a 3D Map	James Rehg
1:00 PM	2:00 PM	Lunch Break
2:00 PM	2:30 PM	Invited talk -- “Mind Reading”: Self-supervised decoding of visual data from brain activity Can we reconstruct natural images & videos that a person saw, directly from his/her fMRI brain recordings? This is a particularly difficult problem, given the very few “paired” training examples available (images/videos with their corresponding fMRI recordings). In this talk I will show how such image/video reconstruction can be performed, despite the few training examples, by exploiting Self-Supervised training on many “unpaired” data – i.e., images & videos without any fMRI recordings. I will further show how large-scale image classification (to more than 1000 classes!) can be performed on sparse fMRI data.	Michal Irani
2:30 PM	3:30 PM	Project Aria: Devices and Machine Perception Services Supporting Academic Research in Egocentric Perception	Prince Gupta
3:30 PM	4:30 PM	Break, poster session + invited papers/posters
4:30 PM	5:00 PM	Invited Talk	João Carreira
5:00 PM	5:30 PM	Ego4D Battle Royale: Who Knows the Dataset Best?	Devansh Kukreja
5:30 PM	6:00 PM	Invited Talk	Abhinav Gupta
6:00 PM	6:10 PM	Closing Remarks	Rohit Girdhar

Instructions for Presentation and Poster session

In-person/Virtual Presentation Information

Presentations will be held in-person or virtually. Rare cases of pre-recorded full presentations.
In person presentations will be broadcast into the platform including speaker video, sound and presentation.
Virtual speakers will receive a link from Ortra to connect to the presentation hall and will be projected on the screen including speaker video, sound and presentation.
All presentations will be live! Pre-recordings will be used as emergency backups in case a speaker could not attend.
Each session will be monitored by two technicians in the presentation hall: broadcasting technician (that monitors the transmission from the hall to the platform) & AV technician (that monitors the AV equipment in the hall) + live event manager that monitors the whole platform.

Poster Preparation Information

Posters should be in Landscape mode (as is common in, say, CVPR).
Recommended poster dimensions are: 118.8cm wide X 84.1cm high. This is the standard A0 (landscape) paper size.
We plan to have about 1 meter between poster boards, to reduce crowding in the in-person poster sessions.
For your convenience, here is a blank Google Slides template of the appropriate dimensions: https://tinyurl.com/ECCVPoster
Adhesive material and/or pins will be provided for mounting the posters to the boards.
Making and printing posters is your responsibility. However, if poster presenters prefer to print their posters locally, Ortra has arranged with a local printing house to support this service. Please see the following link: Click Here for a poster printing request form. This service is handled directly by the printing house.
Poster presenters are expected to set up their posters just prior to their poster session. We will not provide this service, regardless of in-person or virtual attendance.