Annotation Schemas
Once you download the annotations with the cli, you'll have a set of json files. Here are their schemas for quick reference - see annotation guidelines and benchmark tasks for more information on what the fields represent.
Metadata - ego4d.json schema
date
(string): Date of generation.version
(string): Dataset specific version.description
(string)videos
(array)- Items (object)
video_uid
(string): The unique, primary video id.duration_sec
(number)scenarios
(array)- Items (string)
video_metadata
(object)fps
(number)num_frames
(integer): The number of frames in the video stream.video_codec
(string)display_resolution_width
(['integer', 'null'])display_resolution_height
(['integer', 'null'])sample_resolution_width
(['integer', 'null'])sample_resolution_height
(['integer', 'null'])mp4_duration_sec
(number)video_start_sec
(number): The start time of the vido stream (>= 0 for sync offset).video_duration_sec
(number): The duration of the video stream (<= container duration).audio_start_sec
(['null', 'number']): The start time of the audio stream (>= 0 for sync offset).audio_duration_sec
(['null', 'number']): The duration of the audio stream (<= container duration).video_start_pts
(integer)video_duration_pts
(integer)video_base_numerator
(integer)video_base_denominator
(integer)audio_start_pts
(['integer', 'null'])audio_duration_pts
(['integer', 'null'])audio_base_numerator
(['integer', 'null'])audio_base_denominator
(['integer', 'null'])
split_em
(['null', 'string']): Split (train/test/val) for Episodic Memory benchmark tasks (per video).split_av
(['null', 'string']): FHO splits are clip dependent - specified for video only where consistent (or multi).split_fho
(['null', 'string']): Split (train/test/val) for AV benchmark tasks (per video).s3_path
(string): Path on AWS share - for reference, download via the CLI.origin_video_id
(string): A university assigned id (no standardization across universities).video_source
(string): The origin university that collected the data.device
(['null', 'string'])physical_setting_name
(['null', 'string']): The physical setting if a 3d scan exists.fb_participant_id
(['integer', 'null']): A sequentially assigned participant id - entirely unrelated to FB.is_stereo
(boolean): Is the video stereoscopic.has_imu
(boolean)has_gaze
(boolean)imu_s3_path
(['null', 'string'])imu_manifold_path
(['null', 'string'])gaze_s3_path
(['null', 'string'])gaze_manifold_path
(['null', 'string'])video_components
(array)- Items (object)
video_component_uid
(string)video_uid
(string)component_idx
(integer)redacted
(boolean)canonical_video_start_sec
(number)canonical_video_end_sec
(number)canonical_video_start_frame
(integer)canonical_video_end_frame
(integer)video_metadata
(object)fps
(number)num_frames
(integer)video_codec
(string)display_resolution_width
(integer)display_resolution_height
(integer)sample_resolution_width
(integer)sample_resolution_height
(integer)mp4_duration_sec
(number)video_start_sec
(['null', 'number'])video_duration_sec
(['null', 'number'])audio_start_sec
(['null', 'number'])audio_duration_sec
(['null', 'number'])video_start_pts
(integer)video_duration_pts
(['integer', 'null'])video_base_numerator
(integer)video_base_denominator
(integer)audio_start_pts
(['integer', 'null'])audio_duration_pts
(['integer', 'null'])audio_base_numerator
(['integer', 'null'])audio_base_denominator
(['integer', 'null'])
- Items (object)
concurrent_sets
has_redacted_regions
(boolean)redacted_intervals
(array)- Items (object)
start_sec
(number)end_sec
(number)start_frame
(integer)end_frame
(integer)
- Items (object)
gaps
(null)
- Items (object)
concurrent_video_sets
(array)- Items (object)
concurrent_video_set_id
(integer)valid
(boolean)videos
(array)- Items (object)
concurrent_video_set_id
(integer)video_uid
(string)video_start_offset_sec
(number)
- Items (object)
- Items (object)
physical_settings
(array)- Items (object)
name
(string)fb_physical_setting_id
(integer)source
(string)s3_path
(string)
- Items (object)
clips
(array)- Items (object)
clip_uid
(string)video_uid
(string)video_start_sec
(number)video_end_sec
(number)video_start_frame
(integer)video_end_frame
(integer)clip_metadata
(object)fps
(number)num_frames
(integer)video_codec
(string)display_resolution_width
(integer)display_resolution_height
(integer)sample_resolution_width
(integer)sample_resolution_height
(integer)mp4_duration_sec
(number)video_start_sec
(null)video_duration_sec
(number)audio_start_sec
(null)audio_duration_sec
(['null', 'number'])video_start_pts
(integer)video_duration_pts
(integer)video_base_numerator
(integer)video_base_denominator
(integer)audio_start_pts
(['integer', 'null'])audio_duration_pts
(['integer', 'null'])audio_base_numerator
(['integer', 'null'])audio_base_denominator
(['integer', 'null'])
s3_path
(string)manifold_path
(string)
- Items (object)
Audio-Visual Diarization - av_<set>.json
date
(string)version
(string)description
(string)videos
(array)- Items (object)
video_uid
(string)split
(string)clips
(array)- Items (object)
clip_uid
(string)source_clip_uid
(string)video_uid
(string)video_start_sec
(number)video_end_sec
(number)video_start_frame
(integer)video_end_frame
(integer)clip_start_sec
(integer)clip_end_sec
(number)clip_start_frame
(integer)clip_end_frame
(integer)valid
(boolean)camera_wearer
(object)person_id
(string)camera_wearer
(boolean)tracking_paths
(array)voice_segments
(array)- Items (object)
start_time
(number)end_time
(number)start_frame
(integer)end_frame
(integer)video_start_time
(number)video_end_time
(number)video_start_frame
(integer)video_end_frame
(integer)person
(string)
- Items (object)
persons
(array)- Items (object)
person_id
(string)camera_wearer
(boolean)tracking_paths
(array)- Items (object)
track_id
(string)track
(array)- Items (object)
x
(number)y
(number)width
(number)height
(number)frame
(integer)video_frame
(integer)clip_frame
(null)
- Items (object)
suspect
(boolean)unmapped_frames_count
(integer)unmapped_frames
(array)- Items (integer)
- Items (object)
voice_segments
(array)- Items (object)
start_time
(number)end_time
(number)start_frame
(integer)end_frame
(integer)video_start_time
(number)video_end_time
(number)video_start_frame
(integer)video_end_frame
(integer)person
(string)
- Items (object)
- Items (object)
missing_voice_segments
(array)transcriptions
(array)- Items (object)
transcription
(string)start_time_sec
(number)end_time_sec
(number)person_id
(string)video_start_time
(number)video_start_frame
(integer)video_end_time
(number)video_end_frame
(integer)
- Items (object)
social_segments_talking
(array)- Items (object)
start_time
(number)end_time
(number)start_frame
(integer)end_frame
(integer)video_start_time
(number)video_end_time
(number)video_start_frame
(integer)video_end_frame
(integer)person
(string)target
(['null', 'string'])is_at_me
(boolean)
- Items (object)
social_segments_looking
(array)- Items (object)
start_time
(number)end_time
(number)start_frame
(integer)end_frame
(integer)video_start_time
(number)video_end_time
(number)video_start_frame
(integer)video_end_frame
(integer)person
(string)target
(null)is_at_me
(boolean)
- Items (object)
- Items (object)
- Items (object)
Forecasting Hands & Objects Master File - fho_main.json schema
version
(string)date
(string)description
(string)metadata
(string)videos
(array)- Items (object)
annotated_intervals
(array)- Items (object)
clip_id
(string)clip_uid
(['null', 'string'])start_sec
(number)end_sec
(number)clip_parent_start_sec
(number)clip_parent_end_sec
(number)narrated_actions
(array)- Items (object)
warnings
(array)uid
(['null', 'string'])start_sec
(number)end_sec
(number)start_frame
(integer)end_frame
(integer)is_valid_action
(boolean)is_partial
(boolean)clip_start_sec
(number)clip_end_sec
(number)clip_start_frame
(integer)clip_end_frame
(integer)narration_timestamp_sec
(number)clip_narration_timestamp_sec
(number)narration_text
(string)narration_annotation_uid
(string)structured_verb
(['null', 'string'])freeform_verb
(['null', 'string'])state_transition
(['null', 'string'])critical_frames
clip_critical_frames
frames
is_rejected
(boolean)is_invalid_annotation
(boolean)reject_reason
(['null', 'string'])stage
(['null', 'string'])
- Items (object)
start_frame
(integer)end_frame
(integer)clip_parent_start_frame
(integer)clip_parent_end_frame
(integer)redacted
(boolean)
- Items (object)
video_metadata
(object)video_start_pts
(integer)video_base_numerator
(integer)video_base_denominator
(integer)duration_sec
(number)num_frames
(integer)fps
(number)width
(integer)height
(integer)rotation
(null)
video_uid
(string)
- Items (object)
Forecasting Hands & Objects - fho_hands_<set>.json schema
version
(string)date
(string)description
(string)manifest
(string)split
(string)clips
(array)- Items (object)
clip_id
(integer)clip_uid
(string)video_uid
(string)frames
(array)- Items (object)
action_start_sec
(number)action_end_sec
(number)action_start_frame
(integer)action_end_frame
(integer)action_clip_start_sec
(number)action_clip_end_sec
(number)action_clip_start_frame
(integer)action_clip_end_frame
(integer)pre_45
(object)frame
(integer)clip_frame
(integer)boxes
(array)- Items (object)
right_hand
(array)- Items (number)
left_hand
(array)- Items (number)
- Items (object)
pre_30
(object)frame
(integer)clip_frame
(integer)boxes
(array)- Items (object)
right_hand
(array)- Items (number)
left_hand
(array)- Items (number)
- Items (object)
pre_15
(object)frame
(integer)clip_frame
(integer)boxes
(array)- Items (object)
right_hand
(array)- Items (number)
left_hand
(array)- Items (number)
- Items (object)
post_frame
(object)frame
(integer)clip_frame
(integer)boxes
(array)- Items (object)
left_hand
(array)- Items (number)
right_hand
(array)- Items (number)
- Items (object)
pre_frame
(object)frame
(integer)clip_frame
(integer)boxes
(array)- Items (object)
right_hand
(array)- Items (number)
left_hand
(array)- Items (number)
- Items (object)
pnr_frame
(object)frame
(integer)clip_frame
(integer)boxes
(array)- Items (object)
right_hand
(array)- Items (number)
left_hand
(array)- Items (number)
- Items (object)
contact_frame
(object)frame
(integer)clip_frame
(integer)boxes
(array)- Items (object)
left_hand
(array)- Items (number)
right_hand
(array)- Items (number)
- Items (object)
- Items (object)
- Items (object)
Long-Term Action Anticipation Taxonomy - fho_lta_taxonomy.json schema
verbs
(array)- Items (string)
nouns
(array)- Items (string)
Long-Term Action Anticipation - fho_lta_<set>.json schema
version
(string)date
(string)description
(string)split
(string)clips
(array)- Items (object)
video_uid
(string)clip_uid
(string)clip_parent_start_sec
(number)clip_parent_end_sec
(number)clip_parent_start_frame
(integer)clip_parent_end_frame
(integer)interval_start_frame
(integer)interval_end_frame
(integer)interval_start_sec
(number)interval_end_sec
(number)verb
(string)noun
(string)action_clip_start_sec
(number)action_clip_end_sec
(number)action_clip_start_frame
(integer)action_clip_end_frame
(integer)clip_id
(integer)action_idx
(integer)verb_label
(integer)noun_label
(integer)
- Items (object)
Object State Change Classification (Point of No Return) - fho_oscc-pnr_<set>.json schema
version
(string)date
(string)description
(string)split
(string)clips
(array)- Items (object)
clip_uid
(['null', 'string'])clip_id
(string)unique_id
(string)video_uid
(string)clip_start_sec
(number)clip_end_sec
(number)parent_start_sec
(number)parent_end_sec
(number)clip_start_frame
(integer)clip_end_frame
(integer)parent_start_frame
(integer)parent_end_frame
(integer)state_change
(boolean)clip_pnr_frame
(integer)parent_pnr_frame
(integer)pnr_frame
(null)
State Change Object Detection - fho_scod_<set>.json schema
version
(string)date
(string)description
(string)split
(string)clips
(array)- Items (object)
video_uid
(string)clip_id
(string)clip_uid
(string)clip_parent_start_sec
(number)clip_parent_end_sec
(number)clip_parent_start_frame
(integer)clip_parent_end_frame
(integer)pre_frame
(object)frame_number
(integer)clip_frame_number
(integer)width
(integer)height
(integer)bbox
(array)- Items (object)
object_type
(string)structured_noun
(['null', 'string'])instance_number
(['integer', 'null'])bbox
(object)x
(number)y
(number)width
(number)height
(number)
- Items (object)
pnr_frame
(object)frame_number
(integer)clip_frame_number
(integer)width
(integer)height
(integer)bbox
(array)- Items (object)
object_type
(string)structured_noun
(['null', 'string'])instance_number
(['integer', 'null'])bbox
(object)x
(number)y
(number)width
(number)height
(number)
- Items (object)
post_frame
(object)frame_number
(integer)clip_frame_number
(integer)width
(integer)height
(integer)bbox
(array)- Items (object)
object_type
(string)structured_noun
(['null', 'string'])instance_number
(['integer', 'null'])bbox
(object)x
(number)y
(number)width
(number)height
(number)
- Items (object)
- Items (object)
Short Term Action Anticipation - fho_sta_<set>.json schema
info
(object)description
(string)version
(string)split
(string)include_annotations
(boolean)video_metadata
(object)<video_uid>
(object)frame_width
(integer)frame_height
(integer)fps
(number)
year
(string)date_created
(string)
annotations
(array)- Items (object)
uid
(string)video_id
(string)frame
(integer)clip_id
(integer)clip_uid
(string)clip_frame
(integer)objects
(array)- Items (object)
box
(array)- Items (number)
verb_category_id
(integer)noun_category_id
(integer)time_to_contact
(number)
- Items (object)
- Items (object)
noun_categories
(array)- Items (object)
id
(integer)name
(string)
- Items (object)
verb_categories
(array)- Items (object)
id
(integer)name
(string)
- Items (object)
Moments Queries - moments_<set>.json schema
version
(string): Dataset specific version.date
(string): Date of generation.description
(string)manifest
(string): Top level ego4d manifest json.videos
(array)- Items (object)
video_uid
(string)split
(string)clips
(array)- Items (object)
clip_uid
(string): The exported clip clip_uid.video_start_sec
(number): Annotation start time relative to the canonical video.video_end_sec
(number): Annotation end time relative to the canonical video.video_start_frame
(integer): Annotation start frame relative to the canonical video.video_end_frame
(integer): Annotation end frame relative to the canonical video.clip_start_sec
(integer): Annotation start time relative to the canonical clip.clip_end_sec
(number): Annotation end time relative to the canonical clip.clip_start_frame
(integer): Annotation start frame relative to the canonical clip.clip_end_frame
(integer): Annotation end frame relative to the canonical clip.source_clip_uid
(string)annotations
(array)- Items (object)
annotator_uid
(string)labels
(array)- Items (object)
start_time
(number): Canonical clip label start time.end_time
(number): Canonical clip label end time.label
(string): Moments label class.video_start_time
(number)video_end_time
(number)video_start_frame
(integer)video_end_frame
(integer)primary
(boolean): Primary label used for Moments baseline task.
- Items (object)
- Items (object)
- Items (object)
- Items (object)
Narrations - narrations.json
<video_uid>
(object)narration_pass_1
(object)narrations
(array)- Items (object)
timestamp_sec
(number)timestamp_frame
(integer)_unmapped_timestamp_sec
(number)narration_text
(string)annotation_uid
(string)
- Items (object)
summaries
(array)- Items (object)
start_sec
(number)end_sec
(number)summary_text
(string)annotation_uid
(string)
- Items (object)
narration_pass_2
(object)narrations
(array)- Items (object)
timestamp_sec
(number)timestamp_frame
(integer)_unmapped_timestamp_sec
(number)narration_text
(string)annotation_uid
(string)
- Items (object)
summaries
(array)- Items (object)
start_sec
(number)end_sec
(number)summary_text
(string)annotation_uid
(string)
- Items (object)
status
(string)
Natural Language Queries - nlq_<set>.json schema
version
(string)date
(string)description
(string)manifest
(string)videos
(array)- Items (object)
video_uid
(string)clips
(array)- Items (object)
clip_uid
(string)video_start_sec
(number)video_end_sec
(number)video_start_frame
(integer)video_end_frame
(integer)clip_start_sec
(integer)clip_end_sec
(number)clip_start_frame
(integer)clip_end_frame
(integer)source_clip_uid
(string)annotations
(array)- Items (object)
language_queries
(array)- Items (object)
clip_start_sec
(number)clip_end_sec
(number)video_start_sec
(number)video_end_sec
(number)video_start_frame
(integer)video_end_frame
(integer)template
(['null', 'string'])query
(['null', 'string'])slot_x
(['null', 'string'])verb_x
(['null', 'string'])slot_y
(['null', 'string'])verb_y
(string)raw_tags
(array)- Items (['null', 'string'])
- Items (object)
annotation_uid
(string)
- Items (object)
- Items (object)
split
(string)
- Items (object)
Visual Queries - vq_<set>.json schema
version
(string)date
(string)description
(string)manifest
(string)videos
(array)- Items (object)
video_uid
(string)split
(string)clips
(array)- Items (object)
clip_uid
(string)video_start_sec
(number)video_end_sec
(number)video_start_frame
(integer)video_end_frame
(integer)clip_start_sec
(integer)clip_end_sec
(number)clip_start_frame
(integer)clip_end_frame
(integer)clip_fps
(number)annotation_complete
(boolean)source_clip_uid
(string)annotations
(array)- Items (object)
query_sets
(object)1
(object)is_valid
(boolean)errors
(array)- Items (string)
warnings
(array)- Items (string)
query_frame
(integer)query_video_frame
(integer)response_track
(array)- Items (object)
frame_number
(integer)x
(number)y
(number)width
(number)height
(number)rotation
(number)original_width
(integer)original_height
(integer)video_frame_number
(integer)
- Items (object)
object_title
(string)visual_crop
(object)frame_number
(integer)x
(number)y
(number)width
(number)height
(number)rotation
(number)original_width
(integer)original_height
(integer)video_frame_number
(integer)
2
(object)is_valid
(boolean)errors
(array)- Items (string)
warnings
(array)- Items (string)
query_frame
(integer)query_video_frame
(['integer', 'null'])response_track
(array)- Items (object)
frame_number
(integer)x
(number)y
(number)width
(number)height
(number)rotation
(number)original_width
(integer)original_height
(integer)video_frame_number
(integer)
- Items (object)
object_title
(string)visual_crop
(object)frame_number
(integer)x
(number)y
(number)width
(number)height
(number)rotation
(number)original_width
(integer)original_height
(integer)video_frame_number
(integer)
3
(object)is_valid
(boolean)errors
(array)- Items (string)
warnings
(array)- Items (string)
query_frame
(integer)query_video_frame
(['integer', 'null'])response_track
(array)- Items (object)
frame_number
(integer)x
(number)y
(number)width
(number)height
(number)rotation
(number)original_width
(integer)original_height
(integer)video_frame_number
(integer)
- Items (object)
object_title
(string)visual_crop
(object)frame_number
(integer)x
(number)y
(number)width
(number)height
(number)rotation
(number)original_width
(integer)original_height
(integer)video_frame_number
(integer)
4
(object)is_valid
(boolean)errors
(array)warnings
(array)- Items (string)
query_frame
(integer)query_video_frame
(integer)response_track
(array)- Items (object)
frame_number
(integer)x
(number)y
(number)width
(number)height
(number)rotation
(integer)original_width
(integer)original_height
(integer)video_frame_number
(integer)
- Items (object)
object_title
(string)visual_crop
(object)frame_number
(integer)x
(number)y
(number)width
(number)height
(number)rotation
(integer)original_width
(integer)original_height
(integer)video_frame_number
(integer)
warnings
(array)- Items (string)
- Items (object)
- Items (object)
- Items (object)