Annotation Guidelines

This page contains context mostly on the annotation guidelines used in each tasks. Please also see annotations for the specific formats and benchmark tasks for more detail on the tasks themselves. And please read the paper here for the most comprehensive introduction.

note

Numbers quoted below are those available at the time of writing this documentation; authoritative information is found in our arxiv paper.

Devices:

Scenario breakdown:

Annotations tl;dr

Task	Output	Volume
Pre-annotations
Narrations	Dense written sentence narrations in English & a summary of the whole video clip	Full Dataset
Episodic Memory (EM)
Natural Language Queries	N free-form natural language queries per video (N=length of video in minutes) selected from a list of query templates + temporal response window from which answers can be deduced	~240h
Moments	Temporal localizations of high level events in a long video clip from a provided taxonomy	~300h
Visual Object Queries	For N=3 query objects (freely chosen and named by the annotator) such that each appears at least twice at separate times in a single video, annotations include: (1) response track: bounding boxes over time for one continuous occurrence of the query object; (2) query frame: a frame that does not contain the query object, sometime after the response track but before any subsequent occurrence of the object; (3) visual crop: bounding box of a single frame from another occurrence of the same object elsewhere in the video (before or after the originally marked instance)	~403h
Forecasting + Hands & Objects (FHO)
1 Critical Frames	Pre-condition (PRE), CONTACT, point of no return (PNR), and post-condition (Post) frames for each narrated action in a video	~120h
2 Pre-condition	Bounding boxes and roles for hands (right/left) and objects (objects of change and tools) for each frame from CONTACT to PRE
3 Post-condition	Bounding boxes and roles for hands and objects for each frame from CONTACT to POST
Audio-Visual Diarization & Social (AVS)
AV0: Automated Face & Head Detection	Automated overlaid bounding boxes for faces in video clips	50h
AV1: Face & Head Tracks Correction	Manually adjusted overlaid bounding boxes for faces in video clips
AV2: Speaker Labeling and AV anchor extraction	Anonymous Person IDs for each Face Track in video clip
AV3: Speech Segmentation (Per Speaker)	Temporal segments for voice activity for the camera wearer and for each Person ID
AV4: Transcription	Video clip audio transcriptions
AV5: Correcting Speech Transcriptions	Corrected Speech Transcription annotations matching voice activity segments and Person IDs from AV2
S1: Camera-Wearer Attention	Temporal segments in which a person is looking at the camera wearer
S2: Speech Target Classification	Temporal segments in which a person is talking to the camera wearer

Narrations

Objective: Annotator provides dense written sentence narrations in English on a first-person video clip of length 10-30 minutes + a summary of the whole video.

Motivation: Understand what data is available and which data to push through which annotation phases. Provide a starting point for forming a taxonomy of labels for actions and objects.

Tags:

There are four flags that annotators use in the sentence boxes:

#unsure to denote they are unsure about a specific statement
#summary to denote they are giving the overall video
#C to denote the sentence is an action done by the camera wearer (the person who recorded the video while wearing a camera on their head)
#O to denote that the sentence is an action done by someone other than the camera wearer

Note that every sentence will have either #C or #O. Only some sentences (or none) may have #unsure. Only one sentence for the entire video clip will have #summary.

Annotation task:

#	Step	Sub-step	Example
1	Narrate the Complete Video with Temporal Sentences	Watch the video from the beginning until something new occurs. At that time, pause the video, mark the temporal window for which the sentence applies, then "narrate" what you see in the video by typing in a simple sentence into the free-form text input. Next, resume watching the video. Once you recognize an action to narrate, immediately pause again and repeat.	[set the start time as the point when the person has the knife and the tomato, and the end time as the point when the person has finished chopping, then type]: "C is chopping a tomato" into the text input. ("C" refers to the camera wearer).
2	Provide a Summary of the Entire Video	As needed, watch the entire video on fast forward to recall the content of the entire video. Provide a short summary in text about the contents of the entire video (1-3 sentences). This summary should convey the main setting(s) of the video clip (e.g., an apartment, a restaurant, a shop, etc.) as well as an overview of what happened.	#summary C fixed their breakfast, ate it, then got dressed and left the house."

Annotated video examples:

Annotation Stats

Total hours narrated: 3670
Unique scenarios: 51

Benchmark Annotations

Target	#	Benchmark task	Research Goal
Places	1	Episodic Memory	Allow an user to ask free-form, natural language questions, with the answer brought back after analyzing past video (When was the last time I changed the batteries in the smoke detector?).
Objects	2	Forecasting	To intelligently deliver notifications to a user, an AR system must understand how an action or piece of information may impact the future state of the world.
	3	Hands-Object interaction	AR applications, e.g. providing users instructions in their egocentric real-world view to accomplish tasks (e.g., cooking a recipe).
People	4	Audio-visual Diarization	To effectively aid people in daily life scenarios, augmented reality must be able to detect and track sounds, responding to users queries or information needs.
	5	Social interactions	Recognize people's interactions, their roles, and their attention within collaborative and competitive scenarios within a range of social interactions captured in the Ego4D data.

Episodic Memory

Motivation: Augment human memory through personal semantic video index for an always-on wearable camera

Objective: Given long first-person video, localize answers for queries about objects and events from first-person experience

Who did I sit by at the party? Where are my keys? When did I change the batteries? How often did I read to my child last week? Did I leave the window open?...

Query types (annotation sub-tasks):

a. Natural language queries (response = temporal)

b. Moments queries (response = temporal)

c. Visual/object queries (response = temporal+spatial)

Natural Language Queries

Objective: Create and annotate N (N=length of video in minutes) interesting questions and their corresponding answers for the given video.

Annotation Task:

#	Step	Sub-step	Example
0	Annotator watches video
1	Asks free-form natural language query at end of video, selecting from list of query templates.	- Select an interesting query template & template category. - Paraphrase question in the past tense.	- Template: "What X is Y?" - Template Category: "Objects" - Paraphrased query: "What color shirt did the person performing on the road wear?"
		Using "free-form" text, fill the query slots (X, Y, ...) in the template to form a meaningful question equivalent to the paraphrase.	First free-form query slot: "color" Second free-form query slot: "the shirt of the person performing on the road"
		Pick the closest verb for each of the slots in the respective drop-down menus	- Paraphrased query: What instrument was the musician playing? - First verb drop-down selection: "[VERB NOT APPLICABLE]" Second verb drop-down selection: "play"
2	Identifies the temporal response window from which answer can be deduced	Seek in the video to the temporal window where the response to the natural language query can be deduced visually. Specify query to have only one valid, contiguous temporal window response.
3	Repeat this process N=length of video in minutes creating N diverse language queries

Annotation Stats:

Total hours annotated: ~240 (x2; one for each vendor)
Distribution over question types:

Scenario breakdown:

Moments

Objective: Localize high level events in a long video clip -- marking any instance of provided activity categories with a temporal window and the activity's name.

Motivation: Learn to detect activities or "moments" and their temporal extent in the video. In the context of episodic memory, the implicit query from a user would be "When is the last time I did X?", and the response from the system would be to show the time window where activity X was last seen.

Annotation Task:

#	Step	Sub-step	Example
1	Review the Taxonomy
2	Annotate the Video	1. Play the video until you observe an activity, then pause. 2. Draw a temporal window around the time span where the activity occurs. 3. Select from the dropdown list the name for that activity. 4. Play the video from the start of the previous activity, repeat steps 1-3.

Annotation Stats:

Total hours annotated: ~328 (x3 annotators)

Visual Object Queries

Objective: Localize past instances of a given object that appears at least twice in different parts of the video.

Motivation: Support an object search application for video in which a user asks at time T "where did I last see X?", and the system scans back in the video history starting at query frame T, finds the most recent instance of X, and outlines it in a short track.

Annotation Task:

#	Step	Sub-step
1	Identify query objects	Preview the entire video. Identify a set of N=3 interesting objects to label as queries (= objects that appear at least twice at distinct non-contiguous parts of the video clip)
2	Select a response track	- Select one occurrence of the query object. - Mark the query object with a bounding box over time, from the frame the object enters the field of view until it leaves the field of view, for that object occurrence.
3	Select a query frame	- Select a frame that does not contain the query object, sometime far after that object occurrence, but before any subsequent occurrence of the object. - Mark the time point with a large bounding box.
4	Select a visual crop	- Find another occurrence of the same object elsewhere in the video (before or after the originally marked instance from Step 2). - Draw a bounding box in one frame around that object.
5	Name the object using the free text box
6	Repeat Steps 1-5 three times for the same video clip and different objects

Annotation Stats:

Total hours annotated: ~432

Scenario breakdown:

Forecasting + Hands & Objects (FHO)

Objective: Recognize object state changes temporally and spatially (HO); predict these interactions spatially and temporally before they happen (F).

Motivation: Understanding and anticipating human-object interactions.

Annotation Stats	Scenario Distribution
Labeled videos: 1,074 Labeled clips: 1,672 Labeled hours: 116.274 Number of scenarios: 53 Number of universities: 7 Number of participants: 397 Num interactions: 91,002 Num rejected: 18,839 Num with state change: 70,718

Stage 1 - Critical Frames

Objective: Annotator watches an egocentric video and marks pre-condition (PRE), contact, point of no return (PNR), and post-condition (Post) frames.

Annotation Task:

#	Step	Sub-step	Example
1	Read the narrated action to be labeled	1. Reject videos that do not contain hand-object interactions 2. Reject videos that not contain the narrated action	Example: "C glides hand planer along the wood"
2	Select the verb corresponding to the narration	- If an appropriate verb is not available, select OTHER from the dropdown and type in the verb in the text box.
3	Select the state change type* present in the video	- Select from one of 8 options from the dropdown
4	Mark the CONTACT* (only if present), PRE and POST frames.	- Find the CONTACT frame - Pause the video - Select the "Contact Frame" from the dropdown - Repeat the same protocol for PRE and POST frames.

PRE, CONTACT, PNR, POST examples:

a. Example: "light blowtorch"

b. Example: "put down wood" (object already in hands, no CONTACT frame)

Stage 2 - Pre-condition

Objective: Label bounding boxes and roles for hands (right/left) and objects (objects of change and tools).

Annotation Task:

Note: clips annotated from previous stage play in reverse from CONTACT to PRE frame:

#	Step	Sub-step	Example
1	Read the narrated action to be labeled		Example: "C straightens the cloth"
2	Label the contact frame (first frame shown)	Label right and left hands (if visible), by correcting the existing bounding box or adding a new one.
		Label the object(s) of change: - Draw the bounding box - Mark the object as Object of change - Select the name of the object from list provided - Select instance ID (for multiple objects of the same type) - Repeat for each object of change
		Label the tool (if present): Draw the bounding box - Mark the object as Tool - Select the name of the tool from list provided - Select instance ID (for multiple objects of the same type)
3	Label the remaining frames	Go to the next frame - Adjust the hand boxes - Adjust the object of change box - Adjust the tool box (if present) - Repeat for the remaining frames

Stage 3 - Post-condition

Objective: Label bounding boxes and roles for hands and objects (from Contact to Post frame).

Annotation Task: [Note]: clips annotated from Stage 1 play from CONTACT to POST frame:

#	Step	Sub-step	Example
1	Read the narrated action to be labeled		Example: "C straightens the cloth"
2	Check the contact frame (first frame shown)	Contact frame will already be labeled with: - Left hand (if visible) - Right hand (if visible) - Active object - Tool (if applicable)
3	Label the remaining frames	- Go to the next frame - Adjust (or add) the hand boxes - Adjust the object of change box - Adjust the tool box (if present) - Repeat for the remaining frames

Objective:

AV: Locate each speaker spatially and temporally, segment and transcribe the speech content (in a given video), assign each speaker an anonymous label.
S: predict the following social cues:
- Who is talking to the camera wearer at each time segment
- Who is looking at the camera wearer at each time segment

Motivation: Understand conversational behavior from the naturalistic egocentric perspective; capture low level detection, segmentation and tracking attributes of people\'s interactions in a scene, and more high level (intent/emotions driven) attributes that drive social and group conversations in the real world.

AV Step 0: Automated Face & Head Detection

A face detection algorithm is run on the given input video to detect all the faces. The resulting bounding boxes are going to be populated and overlaid on the input video.

AV Step 1: Face & Head Tracks Correction

Objective: Have a correct face bounding box around all the faces visible in the video

Annotation Task:

#	Step	Sub-step
1	For each frame in the video, identify all subjects in the frame and check to see if they have bounding boxes.	1. Subject has a bounding box (bbox): a. Bbox is PASSING → Move onto the next subject in the frame. b. Bbox is FAILING → Adjust/Re-draw the bbox (making sure the right face track is selected) 2. Subject doesn't have a bbox → Create a new bounding box and either assign it a new track or merge an existing face track. 3. Bbox does not capture a face → Delete bbox.

Examples:


Passing Bbox
Failing Bbox
Missing Bbox
Bbox to be deleted

AV Step 2: Speaker Labeling and AV anchor extraction

Objective: Assign each Face Track¹ (from Step 1) a 'Person ID' (for each new subject which has an interaction with the camera-wearer or is present in the camera for 500+ frames).

Annotation Task:

#	Step	Sub-step
1	Identify the 'Next Track' and go to the first frame of this track.	1. Toggle On the 'Out-of-Frame' Track List 2. Select the next Track from the list 3. Click 'First Key Frame'
2	Assign this Track a unique ‘Person ID’ (e.g. Person 1, Person 2, ect)	1. Use the drop down menu to select a Person ID 2. Each time this person appears in the video, assign their Track # to their designated Person ID
3	Repeat steps 1-4 until all tracks have Person ID’s assigned.

|--------------------------|

AV Step 3: Speech Segmentation (Per Speaker)

Objective: Label voice activity for all subjects in the video.

Annotation:

#	Step	Sub-step
1	Label voice activity for the camera wearer* first and then for each Person ID.*	1. Annotate the video using the time segment tool.  2. Start an annotation when a person makes a sound (speech, coughing, sigh, any utterance).  3. Stop an annotation when a person stops making sounds.  4. Do not stop an annotation if a person starts making sound again within 1 second after they stopped.  5. Label the segment according to the Person ID displayed in the bounding box around their head.  6. Repeat the process for all sounds made by the people in the video.

|--------------------------|

AV Step 4: Transcription

Objective: Transcribe voice activity for all subjects in the video.

AV Step 5: Correcting Speech Transcriptions [WIP]

Objective: Correcting Speech Transcription annotation from Step 4.

Annotation Task:

#	Step	Sub-step
0	Pre-load the annotation tool.	The task begins with the pre-load of the following things: - Output of AV Step 3 (Speech Segmentation per Person ID) - Output of AV Step 4 (Human transcriptions) - Automatic transcriptions from ASR algorithms.
1	For each human transcription chunk, identify the corresponding person IDs with voice activity on.	For each person with the active voice activity: - Listen to the video - If the person’s speech is = to the content in the transcription chunk, then copy this speech content from transcript into a new dialog box/tag that corresponds to the person.
2	Repeat Step 1 for the machine generated transcription chunks

Examples:

< To Be Uploaded >

Objective: Annotate temporal segments in which a person is looking at the camera wearer.

Annotation Task:

#	Step	Sub-step
1	Watch the video and find the time when someone is looking at the camera wearer
2	Annotate the time segment using the time segment tool:	1. Start an annotation when a person start to look at the camera wearer. 2. Stop an annotation when a person stops looking at the camera wearer. 3. Label the segment according to the Person ID displayed in the bounding box around their head. 4. Repeat the process for all cases in the video.

|--------------------------|

Objective: Given already annotated AV Voice Activity segmentation, the annotator is going to annotate the particular speech segments in which the person is talking to the camera wearer.

Annotation Task:

#	Step	Sub-step
1	Watch the video with AV voice segmentation results (start-end time, person ID)
2	Annotate segments where someone is talking to the camera wearer. Repeat the process for all cases in the video.	1. Identify a segment in which someone is talking to the camera wearer. 2. Click the time segment, then you can see the Voice activity annotation information on the left side bar. 3. Click the drop down box below the "Target of Speech." 4. In the dropdown menu, select "Camera-Wearer" if the speech is only toward the camera wearer. 5. Choose "Camera-Wearer and others" if the speech segment is toward multiple people including the camera wearer (e.g., talking to multiple audience members). 6. Repeat the process for all relevant segments.

Annotations tl;dr​

Narrations​

Benchmark Annotations​

Episodic Memory​

Natural Language Queries​

Moments​

Visual Object Queries​

Forecasting + Hands & Objects (FHO)​

Stage 1 - Critical Frames​

Stage 2 - Pre-condition​

Stage 3 - Post-condition​

Audio-Visual Diarization & Social (AVS)​

AV Step 0: Automated Face & Head Detection​

AV Step 1: Face & Head Tracks Correction​

AV Step 2: Speaker Labeling and AV anchor extraction​

AV Step 3: Speech Segmentation (Per Speaker)​

AV Step 4: Transcription​

AV Step 5: Correcting Speech Transcriptions [WIP]​

Social Step 1: Camera-Wearer Attention​

Social Step 2: Speech Target Classification​

Annotations tl;dr

Narrations

Benchmark Annotations

Episodic Memory

Natural Language Queries

Moments

Visual Object Queries

Forecasting + Hands & Objects (FHO)

Stage 1 - Critical Frames

Stage 2 - Pre-condition

Stage 3 - Post-condition

Audio-Visual Diarization & Social (AVS)

AV Step 0: Automated Face & Head Detection

AV Step 1: Face & Head Tracks Correction

AV Step 2: Speaker Labeling and AV anchor extraction

AV Step 3: Speech Segmentation (Per Speaker)

AV Step 4: Transcription

AV Step 5: Correcting Speech Transcriptions [WIP]

Social Step 1: Camera-Wearer Attention

Social Step 2: Speech Target Classification