Skip to main content

Annotation Guidelines

This page contains context mostly on the annotation guidelines used in each tasks. Please also see annotations for the specific formats and benchmark tasks for more detail on the tasks themselves. And please read the paper here for the most comprehensive introduction.

note

Numbers quoted below are those available at the time of writing this documentation; authoritative information is found in our arxiv paper.

Devices:

Scenario breakdown:

Annotations tl;dr

TaskOutputVolume
Pre-annotations
NarrationsDense written sentence narrations in English & a summary of the whole video clipFull Dataset
Episodic Memory (EM)
Natural Language QueriesN free-form natural language queries per video (N=length of video in minutes) selected from a list of query templates + temporal response window from which answers can be deduced~240h
MomentsTemporal localizations of high level events in a long video clip from a provided taxonomy~300h
Visual Object QueriesFor N=3 query objects (freely chosen and named by the annotator) such that each appears at least twice at separate times in a single video, annotations include:
(1) response track: bounding boxes over time for one continuous occurrence of the query object;
(2) query frame: a frame that does not contain the query object, sometime after the response track but before any subsequent occurrence of the object;
(3) visual crop: bounding box of a single frame from another occurrence of the same object elsewhere in the video (before or after the originally marked instance)
~403h
Forecasting + Hands & Objects (FHO)
1 Critical FramesPre-condition (PRE), CONTACT, point of no return (PNR), and post-condition (Post) frames for each narrated action in a video~120h
2 Pre-conditionBounding boxes and roles for hands (right/left) and objects (objects of change and tools) for each frame from CONTACT to PRE
3 Post-conditionBounding boxes and roles for hands and objects for each frame from CONTACT to POST
Audio-Visual Diarization & Social (AVS)
AV0: Automated Face & Head DetectionAutomated overlaid bounding boxes for faces in video clips50h
AV1: Face & Head Tracks CorrectionManually adjusted overlaid bounding boxes for faces in video clips
AV2: Speaker Labeling and AV anchor extractionAnonymous Person IDs for each Face Track in video clip
AV3: Speech Segmentation (Per Speaker)Temporal segments for voice activity for the camera wearer and for each Person ID
AV4: TranscriptionVideo clip audio transcriptions
AV5: Correcting Speech TranscriptionsCorrected Speech Transcription annotations matching voice activity segments and Person IDs from AV2
S1: Camera-Wearer AttentionTemporal segments in which a person is looking at the camera wearer
S2: Speech Target ClassificationTemporal segments in which a person is talking to the camera wearer

Narrations

Objective: Annotator provides dense written sentence narrations in English on a first-person video clip of length 10-30 minutes + a summary of the whole video.

Motivation: Understand what data is available and which data to push through which annotation phases. Provide a starting point for forming a taxonomy of labels for actions and objects.

Tags:

There are four flags that annotators use in the sentence boxes:

  • #unsure to denote they are unsure about a specific statement
  • #summary to denote they are giving the overall video
  • #C to denote the sentence is an action done by the camera wearer (the person who recorded the video while wearing a camera on their head)
  • #O to denote that the sentence is an action done by someone other than the camera wearer

Note that every sentence will have either #C or #O. Only some sentences (or none) may have #unsure. Only one sentence for the entire video clip will have #summary.

Annotation task:

#StepSub-stepExample
1Narrate the Complete Video with Temporal SentencesWatch the video from the beginning until something new occurs.
At that time, pause the video, mark the temporal window for which the sentence applies, then "narrate" what you see in the video by typing in a simple sentence into the free-form text input.
Next, resume watching the video. Once you recognize an action to narrate, immediately pause again and repeat.
[set the start time as the point when the person has the knife and the tomato, and the end time as the point when the person has finished chopping, then type]: "C is chopping a tomato" into the text input.
("C" refers to the camera wearer).
2Provide a Summary of the Entire VideoAs needed, watch the entire video on fast forward to recall the content of the entire video.

Provide a short summary in text about the contents of the entire video (1-3 sentences).

This summary should convey the main setting(s) of the video clip (e.g., an apartment, a restaurant, a shop, etc.) as well as an overview of what happened.
#summary C fixed their breakfast, ate it, then got dressed and left the house."

Annotated video examples:

Annotation Stats

  • Total hours narrated: 3670

  • Unique scenarios: 51

Benchmark Annotations

Target#Benchmark taskResearch Goal
Places1Episodic MemoryAllow an user to ask free-form, natural language questions, with the answer brought back after analyzing past video (When was the last time I changed the batteries in the smoke detector?).
Objects2ForecastingTo intelligently deliver notifications to a user, an AR system must understand how an action or piece of information may impact the future state of the world.
3Hands-Object interactionAR applications, e.g. providing users instructions in their egocentric real-world view to accomplish tasks (e.g., cooking a recipe).
People4Audio-visual DiarizationTo effectively aid people in daily life scenarios, augmented reality must be able to detect and track sounds, responding to users queries or information needs.
5Social interactionsRecognize people's interactions, their roles, and their attention within collaborative and competitive scenarios within a range of social interactions captured in the Ego4D data.

Episodic Memory

Motivation: Augment human memory through personal semantic video index for an always-on wearable camera

Objective: Given long first-person video, localize answers for queries about objects and events from first-person experience

Who did I sit by at the party? Where are my keys? When did I change the batteries? How often did I read to my child last week? Did I leave the window open?...

Query types (annotation sub-tasks):

a. Natural language queries (response = temporal)

b. Moments queries (response = temporal)

c. Visual/object queries (response = temporal+spatial)

Natural Language Queries

Objective: Create and annotate N (N=length of video in minutes) interesting questions and their corresponding answers for the given video.

Annotation Task:

#StepSub-stepExample
0Annotator watches video
1Asks free-form natural language query at end of video, selecting from list of query templates.- Select an interesting query template & template category.
- Paraphrase question in the past tense.
- Template: "What X is Y?"
- Template Category: "Objects"
- Paraphrased query: "What color shirt did the person performing on the road wear?"
Using "free-form" text, fill the query slots (X, Y, ...) in the template to form a meaningful question equivalent to the paraphrase.First free-form query slot: "color"

Second free-form query slot: "the shirt of the person performing on the road"
Pick the closest verb for each of the slots in the respective drop-down menus- Paraphrased query: What instrument was the musician playing?
- First verb drop-down selection: "[VERB NOT APPLICABLE]"
Second verb drop-down selection: "play"
2Identifies the temporal response window from which answer can be deducedSeek in the video to the temporal window where the response to the natural language query can be deduced visually.

Specify query to have only one valid, contiguous temporal window response.
3Repeat this process N=length of video in minutes creating N diverse language queries

Annotation Stats:

  • Total hours annotated: ~240 (x2; one for each vendor)

  • Distribution over question types:

  • Scenario breakdown:

Moments

Objective: Localize high level events in a long video clip -- marking any instance of provided activity categories with a temporal window and the activity's name.

Motivation: Learn to detect activities or "moments" and their temporal extent in the video. In the context of episodic memory, the implicit query from a user would be "When is the last time I did X?", and the response from the system would be to show the time window where activity X was last seen.

Annotation Task:

#StepSub-stepExample
1Review the Taxonomy
2Annotate the Video1. Play the video until you observe an activity, then pause.
2. Draw a temporal window around the time span where the activity occurs.
3. Select from the dropdown list the name for that activity.
4. Play the video from the start of the previous activity, repeat steps 1-3.

Annotation Stats:

  • Total hours annotated: ~328 (x3 annotators)

Visual Object Queries

Objective: Localize past instances of a given object that appears at least twice in different parts of the video.

Motivation: Support an object search application for video in which a user asks at time T "where did I last see X?", and the system scans back in the video history starting at query frame T, finds the most recent instance of X, and outlines it in a short track.

Annotation Task:

#StepSub-stepExample
1Identify query objectsPreview the entire video.
Identify a set of N=3 interesting objects to label as queries (= objects that appear at least twice at distinct non-contiguous parts of the video clip)
2Select a response track- Select one occurrence of the query object.
- Mark the query object with a bounding box over time, from the frame the object enters the field of view until it leaves the field of view, for that object occurrence.
3Select a query frame- Select a frame that does not contain the query object, sometime far after that object occurrence, but before any subsequent occurrence of the object.
- Mark the time point with a large bounding box.
4Select a visual crop- Find another occurrence of the same object elsewhere in the video (before or after the originally marked instance from Step 2).
- Draw a bounding box in one frame around that object.
5Name the object using the free text box
6Repeat Steps 1-5 three times for the same video clip and different objects

Annotation Stats:

  • Total hours annotated: ~432

  • Scenario breakdown:

Forecasting + Hands & Objects (FHO)

Objective: Recognize object state changes temporally and spatially (HO); predict these interactions spatially and temporally before they happen (F).

Motivation: Understanding and anticipating human-object interactions.

Annotation StatsScenario Distribution
Labeled videos: 1,074

Labeled clips: 1,672

Labeled hours: 116.274

Number of scenarios: 53

Number of universities: 7

Number of participants: 397

Num interactions: 91,002

Num rejected: 18,839

Num with state change: 70,718

Stage 1 - Critical Frames

Objective: Annotator watches an egocentric video and marks pre-condition (PRE), contact, point of no return (PNR), and post-condition (Post) frames.

Annotation Task:

#StepSub-stepExample
1Read the narrated action to be labeled1. Reject videos that do not contain hand-object interactions
2. Reject videos that not contain the narrated action
Example: "C glides hand planer along the wood"
2Select the verb corresponding to the narration- If an appropriate verb is not available, select OTHER from the dropdown and type in the verb in the text box.
3*Select the state change type present in the video- Select from one of 8 options from the dropdown
4*Mark the CONTACT (only if present), PRE and POST frames.- Find the CONTACT frame
- Pause the video
- Select the "Contact Frame" from the dropdown
- Repeat the same protocol for PRE and POST frames.

PRE, CONTACT, PNR, POST examples:

a. Example: "light blowtorch"

b. Example: "put down wood" (object already in hands, no CONTACT frame)

Stage 2 - Pre-condition

Objective: Label bounding boxes and roles for hands (right/left) and objects (objects of change and tools).

Annotation Task:

Note: clips annotated from previous stage play in reverse from CONTACT to PRE frame:
#StepSub-stepExample
1Read the narrated action to be labeledExample: "C straightens the cloth"
2Label the contact frame (first frame shown)Label right and left hands (if visible), by correcting the existing bounding box or adding a new one.
Label the object(s) of change:

- Draw the bounding box
- Mark the object as Object of change
- Select the name of the object from list provided
- Select instance ID (for multiple objects of the same type)
- Repeat for each object of change
Label the tool (if present): Draw the bounding box
- Mark the object as Tool
- Select the name of the tool from list provided
- Select instance ID (for multiple objects of the same type)
3Label the remaining framesGo to the next frame
- Adjust the hand boxes
- Adjust the object of change box
- Adjust the tool box (if present)
- Repeat for the remaining frames

Stage 3 - Post-condition

Objective: Label bounding boxes and roles for hands and objects (from Contact to Post frame).

Annotation Task: [Note]: clips annotated from Stage 1 play from CONTACT to POST frame:

#StepSub-stepExample
1Read the narrated action to be labeledExample: "C straightens the cloth"
2Check the contact frame (first frame shown)Contact frame will already be labeled with:
- Left hand (if visible)
- Right hand (if visible)
- Active object
- Tool (if applicable)
3Label the remaining frames- Go to the next frame
- Adjust (or add) the hand boxes
- Adjust the object of change box
- Adjust the tool box (if present)
- Repeat for the remaining frames

Audio-Visual Diarization & Social (AVS)

Objective:

  • AV: Locate each speaker spatially and temporally, segment and transcribe the speech content (in a given video), assign each speaker an anonymous label.

  • S: predict the following social cues:

    • Who is talking to the camera wearer at each time segment

    • Who is looking at the camera wearer at each time segment

Motivation: Understand conversational behavior from the naturalistic egocentric perspective; capture low level detection, segmentation and tracking attributes of people\'s interactions in a scene, and more high level (intent/emotions driven) attributes that drive social and group conversations in the real world.

AV Step 0: Automated Face & Head Detection

A face detection algorithm is run on the given input video to detect all the faces. The resulting bounding boxes are going to be populated and overlaid on the input video.

AV Step 1: Face & Head Tracks Correction

Objective: Have a correct face bounding box around all the faces visible in the video

Annotation Task:

#StepSub-step
1For each frame in the video, identify all subjects in the frame and check to see if they have bounding boxes.1. Subject has a bounding box (bbox):
      a. Bbox is PASSING → Move onto the next subject in the frame.
      b. Bbox is FAILING → Adjust/Re-draw the bbox (making sure the right face track is selected)
2. Subject doesn't have a bbox → Create a new bounding box and either assign it a new track or merge an existing face track.
3. Bbox does not capture a face → Delete bbox.

Examples:

Passing Bbox
Failing Bbox
Missing Bbox
Bbox to be deleted

AV Step 2: Speaker Labeling and AV anchor extraction

Objective: Assign each Face Track1 (from Step 1) a 'Person ID' (for each new subject which has an interaction with the camera-wearer or is present in the camera for 500+ frames).

Annotation Task:

#StepSub-step
1Identify the 'Next Track' and go to the first frame of this track.1. Toggle On the 'Out-of-Frame' Track List
2. Select the next Track from the list
3. Click 'First Key Frame'
2Assign this Track a unique ‘Person ID’ (e.g. Person 1, Person 2, ect)1. Use the drop down menu to select a Person ID
2. Each time this person appears in the video, assign their Track # to their designated Person ID
3Repeat steps 1-4 until all tracks have Person ID’s assigned.

|--------------------------|

AV Step 3: Speech Segmentation (Per Speaker)

Objective: Label voice activity for all subjects in the video.

Annotation:

#StepSub-step
1Label voice activity for the camera wearer first and then for each Person ID.1. Annotate the video using the time segment tool. 
2. Start an annotation when a person makes a sound (speech, coughing, sigh, any utterance). 
3. Stop an annotation when a person stops making sounds. 
4. Do not stop an annotation if a person starts making sound again within 1 second after they stopped. 
5. Label the segment according to the Person ID displayed in the bounding box around their head. 
6. Repeat the process for all sounds made by the people in the video.

|--------------------------|

AV Step 4: Transcription

Objective: Transcribe voice activity for all subjects in the video.

AV Step 5: Correcting Speech Transcriptions [WIP]

Objective: Correcting Speech Transcription annotation from Step 4.

Annotation Task:

#StepSub-step
0Pre-load the annotation tool.The task begins with the pre-load of the following things:
      - Output of AV Step 3 (Speech Segmentation per Person ID)
      - Output of AV Step 4 (Human transcriptions)
      - Automatic transcriptions from ASR algorithms.
1For each human transcription chunk, identify the corresponding person IDs with voice activity on.For each person with the active voice activity:
       - Listen to the video
       - If the person’s speech is = to the content in the transcription chunk, then copy this speech content from transcript into a new dialog box/tag that corresponds to the person.
2Repeat Step 1 for the machine generated transcription chunks

Examples:

< To Be Uploaded >

Social Step 1: Camera-Wearer Attention

Objective: Annotate temporal segments in which a person is looking at the camera wearer.

Annotation Task:

#StepSub-step
1Watch the video and find the time when someone is looking at the camera wearer
2Annotate the time segment using the time segment tool:1. Start an annotation when a person start to look at the camera wearer.
2. Stop an annotation when a person stops looking at the camera wearer.
3. Label the segment according to the Person ID displayed in the bounding box around their head.
4. Repeat the process for all cases in the video.

|--------------------------|

Social Step 2: Speech Target Classification

Objective: Given already annotated AV Voice Activity segmentation, the annotator is going to annotate the particular speech segments in which the person is talking to the camera wearer.

Annotation Task:

#StepSub-step
1Watch the video with AV voice segmentation results (start-end time, person ID)
2Annotate segments where someone is talking to the camera wearer. Repeat the process for all cases in the video.1. Identify a segment in which someone is talking to the camera wearer.
2. Click the time segment, then you can see the Voice activity annotation information on the left side bar.
3. Click the drop down box below the "Target of Speech."

4. In the dropdown menu, select "Camera-Wearer" if the speech is only toward the camera wearer.
5. Choose "Camera-Wearer and others" if the speech segment is toward multiple people including the camera wearer (e.g., talking to multiple audience members).
6. Repeat the process for all relevant segments.