Annotation Schemas
Once you download the annotations with the cli, you'll have a set of json files. Here are their schemas for quick reference - see annotation guidelines and benchmark tasks for more information on what the fields represent.
Metadata - ego4d.json schema
date(string): Date of generation.version(string): Dataset specific version.description(string)videos(array)- Items (object)
video_uid(string): The unique, primary video id.duration_sec(number)scenarios(array)- Items (string)
video_metadata(object)fps(number)num_frames(integer): The number of frames in the video stream.video_codec(string)display_resolution_width(['integer', 'null'])display_resolution_height(['integer', 'null'])sample_resolution_width(['integer', 'null'])sample_resolution_height(['integer', 'null'])mp4_duration_sec(number)video_start_sec(number): The start time of the vido stream (>= 0 for sync offset).video_duration_sec(number): The duration of the video stream (<= container duration).audio_start_sec(['null', 'number']): The start time of the audio stream (>= 0 for sync offset).audio_duration_sec(['null', 'number']): The duration of the audio stream (<= container duration).video_start_pts(integer)video_duration_pts(integer)video_base_numerator(integer)video_base_denominator(integer)audio_start_pts(['integer', 'null'])audio_duration_pts(['integer', 'null'])audio_base_numerator(['integer', 'null'])audio_base_denominator(['integer', 'null'])
split_em(['null', 'string']): Split (train/test/val) for Episodic Memory benchmark tasks (per video).split_av(['null', 'string']): FHO splits are clip dependent - specified for video only where consistent (or multi).split_fho(['null', 'string']): Split (train/test/val) for AV benchmark tasks (per video).s3_path(string): Path on AWS share - for reference, download via the CLI.origin_video_id(string): A university assigned id (no standardization across universities).video_source(string): The origin university that collected the data.device(['null', 'string'])physical_setting_name(['null', 'string']): The physical setting if a 3d scan exists.fb_participant_id(['integer', 'null']): A sequentially assigned participant id - entirely unrelated to FB.is_stereo(boolean): Is the video stereoscopic.has_imu(boolean)has_gaze(boolean)imu_s3_path(['null', 'string'])imu_manifold_path(['null', 'string'])gaze_s3_path(['null', 'string'])gaze_manifold_path(['null', 'string'])video_components(array)- Items (object)
video_component_uid(string)video_uid(string)component_idx(integer)redacted(boolean)canonical_video_start_sec(number)canonical_video_end_sec(number)canonical_video_start_frame(integer)canonical_video_end_frame(integer)video_metadata(object)fps(number)num_frames(integer)video_codec(string)display_resolution_width(integer)display_resolution_height(integer)sample_resolution_width(integer)sample_resolution_height(integer)mp4_duration_sec(number)video_start_sec(['null', 'number'])video_duration_sec(['null', 'number'])audio_start_sec(['null', 'number'])audio_duration_sec(['null', 'number'])video_start_pts(integer)video_duration_pts(['integer', 'null'])video_base_numerator(integer)video_base_denominator(integer)audio_start_pts(['integer', 'null'])audio_duration_pts(['integer', 'null'])audio_base_numerator(['integer', 'null'])audio_base_denominator(['integer', 'null'])
- Items (object)
concurrent_setshas_redacted_regions(boolean)redacted_intervals(array)- Items (object)
start_sec(number)end_sec(number)start_frame(integer)end_frame(integer)
- Items (object)
gaps(null)
- Items (object)
concurrent_video_sets(array)- Items (object)
concurrent_video_set_id(integer)valid(boolean)videos(array)- Items (object)
concurrent_video_set_id(integer)video_uid(string)video_start_offset_sec(number)
- Items (object)
- Items (object)
physical_settings(array)- Items (object)
name(string)fb_physical_setting_id(integer)source(string)s3_path(string)
- Items (object)
clips(array)- Items (object)
clip_uid(string)video_uid(string)video_start_sec(number)video_end_sec(number)video_start_frame(integer)video_end_frame(integer)clip_metadata(object)fps(number)num_frames(integer)video_codec(string)display_resolution_width(integer)display_resolution_height(integer)sample_resolution_width(integer)sample_resolution_height(integer)mp4_duration_sec(number)video_start_sec(null)video_duration_sec(number)audio_start_sec(null)audio_duration_sec(['null', 'number'])video_start_pts(integer)video_duration_pts(integer)video_base_numerator(integer)video_base_denominator(integer)audio_start_pts(['integer', 'null'])audio_duration_pts(['integer', 'null'])audio_base_numerator(['integer', 'null'])audio_base_denominator(['integer', 'null'])
s3_path(string)manifold_path(string)
- Items (object)
Audio-Visual Diarization - av_<set>.json
date(string)version(string)description(string)videos(array)- Items (object)
video_uid(string)split(string)clips(array)- Items (object)
clip_uid(string)source_clip_uid(string)video_uid(string)video_start_sec(number)video_end_sec(number)video_start_frame(integer)video_end_frame(integer)clip_start_sec(integer)clip_end_sec(number)clip_start_frame(integer)clip_end_frame(integer)valid(boolean)camera_wearer(object)person_id(string)camera_wearer(boolean)tracking_paths(array)voice_segments(array)- Items (object)
start_time(number)end_time(number)start_frame(integer)end_frame(integer)video_start_time(number)video_end_time(number)video_start_frame(integer)video_end_frame(integer)person(string)
- Items (object)
persons(array)- Items (object)
person_id(string)camera_wearer(boolean)tracking_paths(array)- Items (object)
track_id(string)track(array)- Items (object)
x(number)y(number)width(number)height(number)frame(integer)video_frame(integer)clip_frame(null)
- Items (object)
suspect(boolean)unmapped_frames_count(integer)unmapped_frames(array)- Items (integer)
- Items (object)
voice_segments(array)- Items (object)
start_time(number)end_time(number)start_frame(integer)end_frame(integer)video_start_time(number)video_end_time(number)video_start_frame(integer)video_end_frame(integer)person(string)
- Items (object)
- Items (object)
missing_voice_segments(array)transcriptions(array)- Items (object)
transcription(string)start_time_sec(number)end_time_sec(number)person_id(string)video_start_time(number)video_start_frame(integer)video_end_time(number)video_end_frame(integer)
- Items (object)
social_segments_talking(array)- Items (object)
start_time(number)end_time(number)start_frame(integer)end_frame(integer)video_start_time(number)video_end_time(number)video_start_frame(integer)video_end_frame(integer)person(string)target(['null', 'string'])is_at_me(boolean)
- Items (object)
social_segments_looking(array)- Items (object)
start_time(number)end_time(number)start_frame(integer)end_frame(integer)video_start_time(number)video_end_time(number)video_start_frame(integer)video_end_frame(integer)person(string)target(null)is_at_me(boolean)
- Items (object)
- Items (object)
- Items (object)
Forecasting Hands & Objects Master File - fho_main.json schema
version(string)date(string)description(string)metadata(string)videos(array)- Items (object)
annotated_intervals(array)- Items (object)
clip_id(string)clip_uid(['null', 'string'])start_sec(number)end_sec(number)clip_parent_start_sec(number)clip_parent_end_sec(number)narrated_actions(array)- Items (object)
warnings(array)uid(['null', 'string'])start_sec(number)end_sec(number)start_frame(integer)end_frame(integer)is_valid_action(boolean)is_partial(boolean)clip_start_sec(number)clip_end_sec(number)clip_start_frame(integer)clip_end_frame(integer)narration_timestamp_sec(number)clip_narration_timestamp_sec(number)narration_text(string)narration_annotation_uid(string)structured_verb(['null', 'string'])freeform_verb(['null', 'string'])state_transition(['null', 'string'])critical_framesclip_critical_framesframesis_rejected(boolean)is_invalid_annotation(boolean)reject_reason(['null', 'string'])stage(['null', 'string'])
- Items (object)
start_frame(integer)end_frame(integer)clip_parent_start_frame(integer)clip_parent_end_frame(integer)redacted(boolean)
- Items (object)
video_metadata(object)video_start_pts(integer)video_base_numerator(integer)video_base_denominator(integer)duration_sec(number)num_frames(integer)fps(number)width(integer)height(integer)rotation(null)
video_uid(string)
- Items (object)
Forecasting Hands & Objects - fho_hands_<set>.json schema
version(string)date(string)description(string)manifest(string)split(string)clips(array)- Items (object)
clip_id(integer)clip_uid(string)video_uid(string)frames(array)- Items (object)
action_start_sec(number)action_end_sec(number)action_start_frame(integer)action_end_frame(integer)action_clip_start_sec(number)action_clip_end_sec(number)action_clip_start_frame(integer)action_clip_end_frame(integer)pre_45(object)frame(integer)clip_frame(integer)boxes(array)- Items (object)
right_hand(array)- Items (number)
left_hand(array)- Items (number)
- Items (object)
pre_30(object)frame(integer)clip_frame(integer)boxes(array)- Items (object)
right_hand(array)- Items (number)
left_hand(array)- Items (number)
- Items (object)
pre_15(object)frame(integer)clip_frame(integer)boxes(array)- Items (object)
right_hand(array)- Items (number)
left_hand(array)- Items (number)
- Items (object)
post_frame(object)frame(integer)clip_frame(integer)boxes(array)- Items (object)
left_hand(array)- Items (number)
right_hand(array)- Items (number)
- Items (object)
pre_frame(object)frame(integer)clip_frame(integer)boxes(array)- Items (object)
right_hand(array)- Items (number)
left_hand(array)- Items (number)
- Items (object)
pnr_frame(object)frame(integer)clip_frame(integer)boxes(array)- Items (object)
right_hand(array)- Items (number)
left_hand(array)- Items (number)
- Items (object)
contact_frame(object)frame(integer)clip_frame(integer)boxes(array)- Items (object)
left_hand(array)- Items (number)
right_hand(array)- Items (number)
- Items (object)
- Items (object)
- Items (object)
Long-Term Action Anticipation Taxonomy - fho_lta_taxonomy.json schema
verbs(array)- Items (string)
nouns(array)- Items (string)
Long-Term Action Anticipation - fho_lta_<set>.json schema
version(string)date(string)description(string)split(string)clips(array)- Items (object)
video_uid(string)clip_uid(string)clip_parent_start_sec(number)clip_parent_end_sec(number)clip_parent_start_frame(integer)clip_parent_end_frame(integer)interval_start_frame(integer)interval_end_frame(integer)interval_start_sec(number)interval_end_sec(number)verb(string)noun(string)action_clip_start_sec(number)action_clip_end_sec(number)action_clip_start_frame(integer)action_clip_end_frame(integer)clip_id(integer)action_idx(integer)verb_label(integer)noun_label(integer)
- Items (object)
Object State Change Classification (Point of No Return) - fho_oscc-pnr_<set>.json schema
version(string)date(string)description(string)split(string)clips(array)- Items (object)
clip_uid(['null', 'string'])clip_id(string)unique_id(string)video_uid(string)clip_start_sec(number)clip_end_sec(number)parent_start_sec(number)parent_end_sec(number)clip_start_frame(integer)clip_end_frame(integer)parent_start_frame(integer)parent_end_frame(integer)state_change(boolean)clip_pnr_frame(integer)parent_pnr_frame(integer)pnr_frame(null)
State Change Object Detection - fho_scod_<set>.json schema
version(string)date(string)description(string)split(string)clips(array)- Items (object)
video_uid(string)clip_id(string)clip_uid(string)clip_parent_start_sec(number)clip_parent_end_sec(number)clip_parent_start_frame(integer)clip_parent_end_frame(integer)pre_frame(object)frame_number(integer)clip_frame_number(integer)width(integer)height(integer)bbox(array)- Items (object)
object_type(string)structured_noun(['null', 'string'])instance_number(['integer', 'null'])bbox(object)x(number)y(number)width(number)height(number)
- Items (object)
pnr_frame(object)frame_number(integer)clip_frame_number(integer)width(integer)height(integer)bbox(array)- Items (object)
object_type(string)structured_noun(['null', 'string'])instance_number(['integer', 'null'])bbox(object)x(number)y(number)width(number)height(number)
- Items (object)
post_frame(object)frame_number(integer)clip_frame_number(integer)width(integer)height(integer)bbox(array)- Items (object)
object_type(string)structured_noun(['null', 'string'])instance_number(['integer', 'null'])bbox(object)x(number)y(number)width(number)height(number)
- Items (object)
- Items (object)
Short Term Action Anticipation - fho_sta_<set>.json schema
info(object)description(string)version(string)split(string)include_annotations(boolean)video_metadata(object)<video_uid>(object)frame_width(integer)frame_height(integer)fps(number)
year(string)date_created(string)
annotations(array)- Items (object)
uid(string)video_id(string)frame(integer)clip_id(integer)clip_uid(string)clip_frame(integer)objects(array)- Items (object)
box(array)- Items (number)
verb_category_id(integer)noun_category_id(integer)time_to_contact(number)
- Items (object)
- Items (object)
noun_categories(array)- Items (object)
id(integer)name(string)
- Items (object)
verb_categories(array)- Items (object)
id(integer)name(string)
- Items (object)
Moments Queries - moments_<set>.json schema
version(string): Dataset specific version.date(string): Date of generation.description(string)manifest(string): Top level ego4d manifest json.videos(array)- Items (object)
video_uid(string)split(string)clips(array)- Items (object)
clip_uid(string): The exported clip clip_uid.video_start_sec(number): Annotation start time relative to the canonical video.video_end_sec(number): Annotation end time relative to the canonical video.video_start_frame(integer): Annotation start frame relative to the canonical video.video_end_frame(integer): Annotation end frame relative to the canonical video.clip_start_sec(integer): Annotation start time relative to the canonical clip.clip_end_sec(number): Annotation end time relative to the canonical clip.clip_start_frame(integer): Annotation start frame relative to the canonical clip.clip_end_frame(integer): Annotation end frame relative to the canonical clip.source_clip_uid(string)annotations(array)- Items (object)
annotator_uid(string)labels(array)- Items (object)
start_time(number): Canonical clip label start time.end_time(number): Canonical clip label end time.label(string): Moments label class.video_start_time(number)video_end_time(number)video_start_frame(integer)video_end_frame(integer)primary(boolean): Primary label used for Moments baseline task.
- Items (object)
- Items (object)
- Items (object)
- Items (object)
Narrations - narrations.json
<video_uid>(object)narration_pass_1(object)narrations(array)- Items (object)
timestamp_sec(number)timestamp_frame(integer)_unmapped_timestamp_sec(number)narration_text(string)annotation_uid(string)
- Items (object)
summaries(array)- Items (object)
start_sec(number)end_sec(number)summary_text(string)annotation_uid(string)
- Items (object)
narration_pass_2(object)narrations(array)- Items (object)
timestamp_sec(number)timestamp_frame(integer)_unmapped_timestamp_sec(number)narration_text(string)annotation_uid(string)
- Items (object)
summaries(array)- Items (object)
start_sec(number)end_sec(number)summary_text(string)annotation_uid(string)
- Items (object)
status(string)
Natural Language Queries - nlq_<set>.json schema
version(string)date(string)description(string)manifest(string)videos(array)- Items (object)
video_uid(string)clips(array)- Items (object)
clip_uid(string)video_start_sec(number)video_end_sec(number)video_start_frame(integer)video_end_frame(integer)clip_start_sec(integer)clip_end_sec(number)clip_start_frame(integer)clip_end_frame(integer)source_clip_uid(string)annotations(array)- Items (object)
language_queries(array)- Items (object)
clip_start_sec(number)clip_end_sec(number)video_start_sec(number)video_end_sec(number)video_start_frame(integer)video_end_frame(integer)template(['null', 'string'])query(['null', 'string'])slot_x(['null', 'string'])verb_x(['null', 'string'])slot_y(['null', 'string'])verb_y(string)raw_tags(array)- Items (['null', 'string'])
- Items (object)
annotation_uid(string)
- Items (object)
- Items (object)
split(string)
- Items (object)
Visual Queries - vq_<set>.json schema
version(string)date(string)description(string)manifest(string)videos(array)- Items (object)
video_uid(string)split(string)clips(array)- Items (object)
clip_uid(string)video_start_sec(number)video_end_sec(number)video_start_frame(integer)video_end_frame(integer)clip_start_sec(integer)clip_end_sec(number)clip_start_frame(integer)clip_end_frame(integer)clip_fps(number)annotation_complete(boolean)source_clip_uid(string)annotations(array)- Items (object)
query_sets(object)1(object)is_valid(boolean)errors(array)- Items (string)
warnings(array)- Items (string)
query_frame(integer)query_video_frame(integer)response_track(array)- Items (object)
frame_number(integer)x(number)y(number)width(number)height(number)rotation(number)original_width(integer)original_height(integer)video_frame_number(integer)
- Items (object)
object_title(string)visual_crop(object)frame_number(integer)x(number)y(number)width(number)height(number)rotation(number)original_width(integer)original_height(integer)video_frame_number(integer)
2(object)is_valid(boolean)errors(array)- Items (string)
warnings(array)- Items (string)
query_frame(integer)query_video_frame(['integer', 'null'])response_track(array)- Items (object)
frame_number(integer)x(number)y(number)width(number)height(number)rotation(number)original_width(integer)original_height(integer)video_frame_number(integer)
- Items (object)
object_title(string)visual_crop(object)frame_number(integer)x(number)y(number)width(number)height(number)rotation(number)original_width(integer)original_height(integer)video_frame_number(integer)
3(object)is_valid(boolean)errors(array)- Items (string)
warnings(array)- Items (string)
query_frame(integer)query_video_frame(['integer', 'null'])response_track(array)- Items (object)
frame_number(integer)x(number)y(number)width(number)height(number)rotation(number)original_width(integer)original_height(integer)video_frame_number(integer)
- Items (object)
object_title(string)visual_crop(object)frame_number(integer)x(number)y(number)width(number)height(number)rotation(number)original_width(integer)original_height(integer)video_frame_number(integer)
4(object)is_valid(boolean)errors(array)warnings(array)- Items (string)
query_frame(integer)query_video_frame(integer)response_track(array)- Items (object)
frame_number(integer)x(number)y(number)width(number)height(number)rotation(integer)original_width(integer)original_height(integer)video_frame_number(integer)
- Items (object)
object_title(string)visual_crop(object)frame_number(integer)x(number)y(number)width(number)height(number)rotation(integer)original_width(integer)original_height(integer)video_frame_number(integer)
warnings(array)- Items (string)
- Items (object)
- Items (object)
- Items (object)