Skip to main content

Annotation Schemas

Once you download the annotations with the cli, you'll have a set of json files. Here are their schemas for quick reference - see annotation guidelines and benchmark tasks for more information on what the fields represent.

Metadata - ego4d.json schema
  • date (string): Date of generation.
  • version (string): Dataset specific version.
  • description (string)
  • videos (array)
    • Items (object)
      • video_uid (string): The unique, primary video id.
      • duration_sec (number)
      • scenarios (array)
        • Items (string)
      • video_metadata (object)
        • fps (number)
        • num_frames (integer): The number of frames in the video stream.
        • video_codec (string)
        • display_resolution_width (['integer', 'null'])
        • display_resolution_height (['integer', 'null'])
        • sample_resolution_width (['integer', 'null'])
        • sample_resolution_height (['integer', 'null'])
        • mp4_duration_sec (number)
        • video_start_sec (number): The start time of the vido stream (>= 0 for sync offset).
        • video_duration_sec (number): The duration of the video stream (<= container duration).
        • audio_start_sec (['null', 'number']): The start time of the audio stream (>= 0 for sync offset).
        • audio_duration_sec (['null', 'number']): The duration of the audio stream (<= container duration).
        • video_start_pts (integer)
        • video_duration_pts (integer)
        • video_base_numerator (integer)
        • video_base_denominator (integer)
        • audio_start_pts (['integer', 'null'])
        • audio_duration_pts (['integer', 'null'])
        • audio_base_numerator (['integer', 'null'])
        • audio_base_denominator (['integer', 'null'])
      • split_em (['null', 'string']): Split (train/test/val) for Episodic Memory benchmark tasks (per video).
      • split_av (['null', 'string']): FHO splits are clip dependent - specified for video only where consistent (or multi).
      • split_fho (['null', 'string']): Split (train/test/val) for AV benchmark tasks (per video).
      • s3_path (string): Path on AWS share - for reference, download via the CLI.
      • origin_video_id (string): A university assigned id (no standardization across universities).
      • video_source (string): The origin university that collected the data.
      • device (['null', 'string'])
      • physical_setting_name (['null', 'string']): The physical setting if a 3d scan exists.
      • fb_participant_id (['integer', 'null']): A sequentially assigned participant id - entirely unrelated to FB.
      • is_stereo (boolean): Is the video stereoscopic.
      • has_imu (boolean)
      • has_gaze (boolean)
      • imu_s3_path (['null', 'string'])
      • imu_manifold_path (['null', 'string'])
      • gaze_s3_path (['null', 'string'])
      • gaze_manifold_path (['null', 'string'])
      • video_components (array)
        • Items (object)
          • video_component_uid (string)
          • video_uid (string)
          • component_idx (integer)
          • redacted (boolean)
          • canonical_video_start_sec (number)
          • canonical_video_end_sec (number)
          • canonical_video_start_frame (integer)
          • canonical_video_end_frame (integer)
          • video_metadata (object)
            • fps (number)
            • num_frames (integer)
            • video_codec (string)
            • display_resolution_width (integer)
            • display_resolution_height (integer)
            • sample_resolution_width (integer)
            • sample_resolution_height (integer)
            • mp4_duration_sec (number)
            • video_start_sec (['null', 'number'])
            • video_duration_sec (['null', 'number'])
            • audio_start_sec (['null', 'number'])
            • audio_duration_sec (['null', 'number'])
            • video_start_pts (integer)
            • video_duration_pts (['integer', 'null'])
            • video_base_numerator (integer)
            • video_base_denominator (integer)
            • audio_start_pts (['integer', 'null'])
            • audio_duration_pts (['integer', 'null'])
            • audio_base_numerator (['integer', 'null'])
            • audio_base_denominator (['integer', 'null'])
      • concurrent_sets
      • has_redacted_regions (boolean)
      • redacted_intervals (array)
        • Items (object)
          • start_sec (number)
          • end_sec (number)
          • start_frame (integer)
          • end_frame (integer)
      • gaps (null)
  • concurrent_video_sets (array)
    • Items (object)
      • concurrent_video_set_id (integer)
      • valid (boolean)
      • videos (array)
        • Items (object)
          • concurrent_video_set_id (integer)
          • video_uid (string)
          • video_start_offset_sec (number)
  • physical_settings (array)
    • Items (object)
      • name (string)
      • fb_physical_setting_id (integer)
      • source (string)
      • s3_path (string)
  • clips (array)
    • Items (object)
      • clip_uid (string)
      • video_uid (string)
      • video_start_sec (number)
      • video_end_sec (number)
      • video_start_frame (integer)
      • video_end_frame (integer)
      • clip_metadata (object)
        • fps (number)
        • num_frames (integer)
        • video_codec (string)
        • display_resolution_width (integer)
        • display_resolution_height (integer)
        • sample_resolution_width (integer)
        • sample_resolution_height (integer)
        • mp4_duration_sec (number)
        • video_start_sec (null)
        • video_duration_sec (number)
        • audio_start_sec (null)
        • audio_duration_sec (['null', 'number'])
        • video_start_pts (integer)
        • video_duration_pts (integer)
        • video_base_numerator (integer)
        • video_base_denominator (integer)
        • audio_start_pts (['integer', 'null'])
        • audio_duration_pts (['integer', 'null'])
        • audio_base_numerator (['integer', 'null'])
        • audio_base_denominator (['integer', 'null'])
      • s3_path (string)
      • manifold_path (string)
Audio-Visual Diarization - av_<set>.json
  • date (string)
  • version (string)
  • description (string)
  • videos (array)
    • Items (object)
      • video_uid (string)
      • split (string)
      • clips (array)
        • Items (object)
          • clip_uid (string)
          • source_clip_uid (string)
          • video_uid (string)
          • video_start_sec (number)
          • video_end_sec (number)
          • video_start_frame (integer)
          • video_end_frame (integer)
          • clip_start_sec (integer)
          • clip_end_sec (number)
          • clip_start_frame (integer)
          • clip_end_frame (integer)
          • valid (boolean)
          • camera_wearer (object)
            • person_id (string)
            • camera_wearer (boolean)
            • tracking_paths (array)
            • voice_segments (array)
              • Items (object)
                • start_time (number)
                • end_time (number)
                • start_frame (integer)
                • end_frame (integer)
                • video_start_time (number)
                • video_end_time (number)
                • video_start_frame (integer)
                • video_end_frame (integer)
                • person (string)
          • persons (array)
            • Items (object)
              • person_id (string)
              • camera_wearer (boolean)
              • tracking_paths (array)
                • Items (object)
                  • track_id (string)
                  • track (array)
                    • Items (object)
                      • x (number)
                      • y (number)
                      • width (number)
                      • height (number)
                      • frame (integer)
                      • video_frame (integer)
                      • clip_frame (null)
                  • suspect (boolean)
                  • unmapped_frames_count (integer)
                  • unmapped_frames (array)
                    • Items (integer)
              • voice_segments (array)
                • Items (object)
                  • start_time (number)
                  • end_time (number)
                  • start_frame (integer)
                  • end_frame (integer)
                  • video_start_time (number)
                  • video_end_time (number)
                  • video_start_frame (integer)
                  • video_end_frame (integer)
                  • person (string)
          • missing_voice_segments (array)
          • transcriptions (array)
            • Items (object)
              • transcription (string)
              • start_time_sec (number)
              • end_time_sec (number)
              • person_id (string)
              • video_start_time (number)
              • video_start_frame (integer)
              • video_end_time (number)
              • video_end_frame (integer)
          • social_segments_talking (array)
            • Items (object)
              • start_time (number)
              • end_time (number)
              • start_frame (integer)
              • end_frame (integer)
              • video_start_time (number)
              • video_end_time (number)
              • video_start_frame (integer)
              • video_end_frame (integer)
              • person (string)
              • target (['null', 'string'])
              • is_at_me (boolean)
          • social_segments_looking (array)
            • Items (object)
              • start_time (number)
              • end_time (number)
              • start_frame (integer)
              • end_frame (integer)
              • video_start_time (number)
              • video_end_time (number)
              • video_start_frame (integer)
              • video_end_frame (integer)
              • person (string)
              • target (null)
              • is_at_me (boolean)
Forecasting Hands & Objects Master File - fho_main.json schema
  • version (string)
  • date (string)
  • description (string)
  • metadata (string)
  • videos (array)
    • Items (object)
      • annotated_intervals (array)
        • Items (object)
          • clip_id (string)
          • clip_uid (['null', 'string'])
          • start_sec (number)
          • end_sec (number)
          • clip_parent_start_sec (number)
          • clip_parent_end_sec (number)
          • narrated_actions (array)
            • Items (object)
              • warnings (array)
              • uid (['null', 'string'])
              • start_sec (number)
              • end_sec (number)
              • start_frame (integer)
              • end_frame (integer)
              • is_valid_action (boolean)
              • is_partial (boolean)
              • clip_start_sec (number)
              • clip_end_sec (number)
              • clip_start_frame (integer)
              • clip_end_frame (integer)
              • narration_timestamp_sec (number)
              • clip_narration_timestamp_sec (number)
              • narration_text (string)
              • narration_annotation_uid (string)
              • structured_verb (['null', 'string'])
              • freeform_verb (['null', 'string'])
              • state_transition (['null', 'string'])
              • critical_frames
              • clip_critical_frames
              • frames
              • is_rejected (boolean)
              • is_invalid_annotation (boolean)
              • reject_reason (['null', 'string'])
              • stage (['null', 'string'])
          • start_frame (integer)
          • end_frame (integer)
          • clip_parent_start_frame (integer)
          • clip_parent_end_frame (integer)
          • redacted (boolean)
      • video_metadata (object)
        • video_start_pts (integer)
        • video_base_numerator (integer)
        • video_base_denominator (integer)
        • duration_sec (number)
        • num_frames (integer)
        • fps (number)
        • width (integer)
        • height (integer)
        • rotation (null)
      • video_uid (string)
Forecasting Hands & Objects - fho_hands_<set>.json schema
  • version (string)
  • date (string)
  • description (string)
  • manifest (string)
  • split (string)
  • clips (array)
    • Items (object)
      • clip_id (integer)
      • clip_uid (string)
      • video_uid (string)
      • frames (array)
        • Items (object)
          • action_start_sec (number)
          • action_end_sec (number)
          • action_start_frame (integer)
          • action_end_frame (integer)
          • action_clip_start_sec (number)
          • action_clip_end_sec (number)
          • action_clip_start_frame (integer)
          • action_clip_end_frame (integer)
          • pre_45 (object)
            • frame (integer)
            • clip_frame (integer)
            • boxes (array)
              • Items (object)
                • right_hand (array)
                  • Items (number)
                • left_hand (array)
                  • Items (number)
          • pre_30 (object)
            • frame (integer)
            • clip_frame (integer)
            • boxes (array)
              • Items (object)
                • right_hand (array)
                  • Items (number)
                • left_hand (array)
                  • Items (number)
          • pre_15 (object)
            • frame (integer)
            • clip_frame (integer)
            • boxes (array)
              • Items (object)
                • right_hand (array)
                  • Items (number)
                • left_hand (array)
                  • Items (number)
          • post_frame (object)
            • frame (integer)
            • clip_frame (integer)
            • boxes (array)
              • Items (object)
                • left_hand (array)
                  • Items (number)
                • right_hand (array)
                  • Items (number)
          • pre_frame (object)
            • frame (integer)
            • clip_frame (integer)
            • boxes (array)
              • Items (object)
                • right_hand (array)
                  • Items (number)
                • left_hand (array)
                  • Items (number)
          • pnr_frame (object)
            • frame (integer)
            • clip_frame (integer)
            • boxes (array)
              • Items (object)
                • right_hand (array)
                  • Items (number)
                • left_hand (array)
                  • Items (number)
          • contact_frame (object)
            • frame (integer)
            • clip_frame (integer)
            • boxes (array)
              • Items (object)
                • left_hand (array)
                  • Items (number)
                • right_hand (array)
                  • Items (number)
Long-Term Action Anticipation Taxonomy - fho_lta_taxonomy.json schema
  • verbs (array)
  • Items (string)
  • nouns (array)
  • Items (string)
Long-Term Action Anticipation - fho_lta_<set>.json schema
  • version (string)
  • date (string)
  • description (string)
  • split (string)
  • clips (array)
    • Items (object)
      • video_uid (string)
      • clip_uid (string)
      • clip_parent_start_sec (number)
      • clip_parent_end_sec (number)
      • clip_parent_start_frame (integer)
      • clip_parent_end_frame (integer)
      • interval_start_frame (integer)
      • interval_end_frame (integer)
      • interval_start_sec (number)
      • interval_end_sec (number)
      • verb (string)
      • noun (string)
      • action_clip_start_sec (number)
      • action_clip_end_sec (number)
      • action_clip_start_frame (integer)
      • action_clip_end_frame (integer)
      • clip_id (integer)
      • action_idx (integer)
      • verb_label (integer)
      • noun_label (integer)
Object State Change Classification (Point of No Return) - fho_oscc-pnr_<set>.json schema
  • version (string)
  • date (string)
  • description (string)
  • split (string)
  • clips (array)
  • Items (object)
    • clip_uid (['null', 'string'])
    • clip_id (string)
    • unique_id (string)
    • video_uid (string)
    • clip_start_sec (number)
    • clip_end_sec (number)
    • parent_start_sec (number)
    • parent_end_sec (number)
    • clip_start_frame (integer)
    • clip_end_frame (integer)
    • parent_start_frame (integer)
    • parent_end_frame (integer)
    • state_change (boolean)
    • clip_pnr_frame (integer)
    • parent_pnr_frame (integer)
    • pnr_frame (null)
State Change Object Detection - fho_scod_<set>.json schema
  • version (string)
  • date (string)
  • description (string)
  • split (string)
  • clips (array)
    • Items (object)
      • video_uid (string)
      • clip_id (string)
      • clip_uid (string)
      • clip_parent_start_sec (number)
      • clip_parent_end_sec (number)
      • clip_parent_start_frame (integer)
      • clip_parent_end_frame (integer)
      • pre_frame (object)
        • frame_number (integer)
        • clip_frame_number (integer)
        • width (integer)
        • height (integer)
        • bbox (array)
          • Items (object)
            • object_type (string)
            • structured_noun (['null', 'string'])
            • instance_number (['integer', 'null'])
            • bbox (object)
              • x (number)
              • y (number)
              • width (number)
              • height (number)
      • pnr_frame (object)
        • frame_number (integer)
        • clip_frame_number (integer)
        • width (integer)
        • height (integer)
        • bbox (array)
          • Items (object)
            • object_type (string)
            • structured_noun (['null', 'string'])
            • instance_number (['integer', 'null'])
            • bbox (object)
              • x (number)
              • y (number)
              • width (number)
              • height (number)
      • post_frame (object)
        • frame_number (integer)
        • clip_frame_number (integer)
        • width (integer)
        • height (integer)
        • bbox (array)
          • Items (object)
            • object_type (string)
            • structured_noun (['null', 'string'])
            • instance_number (['integer', 'null'])
            • bbox (object)
              • x (number)
              • y (number)
              • width (number)
              • height (number)
Short Term Action Anticipation - fho_sta_<set>.json schema
  • info (object)
    • description (string)
    • version (string)
    • split (string)
    • include_annotations (boolean)
    • video_metadata (object)
      • <video_uid> (object)
        • frame_width (integer)
        • frame_height (integer)
        • fps (number)
    • year (string)
    • date_created (string)
  • annotations (array)
    • Items (object)
      • uid (string)
      • video_id (string)
      • frame (integer)
      • clip_id (integer)
      • clip_uid (string)
      • clip_frame (integer)
      • objects (array)
        • Items (object)
          • box (array)
            • Items (number)
          • verb_category_id (integer)
          • noun_category_id (integer)
          • time_to_contact (number)
  • noun_categories (array)
    • Items (object)
      • id (integer)
      • name (string)
  • verb_categories (array)
    • Items (object)
      • id (integer)
      • name (string)
Moments Queries - moments_<set>.json schema
  • version (string): Dataset specific version.
  • date (string): Date of generation.
  • description (string)
  • manifest (string): Top level ego4d manifest json.
  • videos (array)
    • Items (object)
      • video_uid (string)
      • split (string)
      • clips (array)
        • Items (object)
          • clip_uid (string): The exported clip clip_uid.
          • video_start_sec (number): Annotation start time relative to the canonical video.
          • video_end_sec (number): Annotation end time relative to the canonical video.
          • video_start_frame (integer): Annotation start frame relative to the canonical video.
          • video_end_frame (integer): Annotation end frame relative to the canonical video.
          • clip_start_sec (integer): Annotation start time relative to the canonical clip.
          • clip_end_sec (number): Annotation end time relative to the canonical clip.
          • clip_start_frame (integer): Annotation start frame relative to the canonical clip.
          • clip_end_frame (integer): Annotation end frame relative to the canonical clip.
          • source_clip_uid (string)
          • annotations (array)
            • Items (object)
              • annotator_uid (string)
              • labels (array)
                • Items (object)
                  • start_time (number): Canonical clip label start time.
                  • end_time (number): Canonical clip label end time.
                  • label (string): Moments label class.
                  • video_start_time (number)
                  • video_end_time (number)
                  • video_start_frame (integer)
                  • video_end_frame (integer)
                  • primary (boolean): Primary label used for Moments baseline task.
Narrations - narrations.json
  • <video_uid> (object)
    • narration_pass_1 (object)
      • narrations (array)
        • Items (object)
          • timestamp_sec (number)
          • timestamp_frame (integer)
          • _unmapped_timestamp_sec (number)
          • narration_text (string)
          • annotation_uid (string)
      • summaries (array)
        • Items (object)
          • start_sec (number)
          • end_sec (number)
          • summary_text (string)
          • annotation_uid (string)
    • narration_pass_2 (object)
      • narrations (array)
        • Items (object)
          • timestamp_sec (number)
          • timestamp_frame (integer)
          • _unmapped_timestamp_sec (number)
          • narration_text (string)
          • annotation_uid (string)
      • summaries (array)
        • Items (object)
          • start_sec (number)
          • end_sec (number)
          • summary_text (string)
          • annotation_uid (string)
    • status (string)
Natural Language Queries - nlq_<set>.json schema
  • version (string)
  • date (string)
  • description (string)
  • manifest (string)
  • videos (array)
    • Items (object)
      • video_uid (string)
      • clips (array)
        • Items (object)
          • clip_uid (string)
          • video_start_sec (number)
          • video_end_sec (number)
          • video_start_frame (integer)
          • video_end_frame (integer)
          • clip_start_sec (integer)
          • clip_end_sec (number)
          • clip_start_frame (integer)
          • clip_end_frame (integer)
          • source_clip_uid (string)
          • annotations (array)
            • Items (object)
              • language_queries (array)
                • Items (object)
                  • clip_start_sec (number)
                  • clip_end_sec (number)
                  • video_start_sec (number)
                  • video_end_sec (number)
                  • video_start_frame (integer)
                  • video_end_frame (integer)
                  • template (['null', 'string'])
                  • query (['null', 'string'])
                  • slot_x (['null', 'string'])
                  • verb_x (['null', 'string'])
                  • slot_y (['null', 'string'])
                  • verb_y (string)
                  • raw_tags (array)
                    • Items (['null', 'string'])
              • annotation_uid (string)
      • split (string)
Visual Queries - vq_<set>.json schema
  • version (string)
  • date (string)
  • description (string)
  • manifest (string)
  • videos (array)
    • Items (object)
      • video_uid (string)
      • split (string)
      • clips (array)
        • Items (object)
          • clip_uid (string)
          • video_start_sec (number)
          • video_end_sec (number)
          • video_start_frame (integer)
          • video_end_frame (integer)
          • clip_start_sec (integer)
          • clip_end_sec (number)
          • clip_start_frame (integer)
          • clip_end_frame (integer)
          • clip_fps (number)
          • annotation_complete (boolean)
          • source_clip_uid (string)
          • annotations (array)
            • Items (object)
              • query_sets (object)
                • 1 (object)
                  • is_valid (boolean)
                  • errors (array)
                    • Items (string)
                  • warnings (array)
                    • Items (string)
                  • query_frame (integer)
                  • query_video_frame (integer)
                  • response_track (array)
                    • Items (object)
                      • frame_number (integer)
                      • x (number)
                      • y (number)
                      • width (number)
                      • height (number)
                      • rotation (number)
                      • original_width (integer)
                      • original_height (integer)
                      • video_frame_number (integer)
                  • object_title (string)
                  • visual_crop (object)
                    • frame_number (integer)
                    • x (number)
                    • y (number)
                    • width (number)
                    • height (number)
                    • rotation (number)
                    • original_width (integer)
                    • original_height (integer)
                    • video_frame_number (integer)
                • 2 (object)
                  • is_valid (boolean)
                  • errors (array)
                    • Items (string)
                  • warnings (array)
                    • Items (string)
                  • query_frame (integer)
                  • query_video_frame (['integer', 'null'])
                  • response_track (array)
                    • Items (object)
                      • frame_number (integer)
                      • x (number)
                      • y (number)
                      • width (number)
                      • height (number)
                      • rotation (number)
                      • original_width (integer)
                      • original_height (integer)
                      • video_frame_number (integer)
                  • object_title (string)
                  • visual_crop (object)
                    • frame_number (integer)
                    • x (number)
                    • y (number)
                    • width (number)
                    • height (number)
                    • rotation (number)
                    • original_width (integer)
                    • original_height (integer)
                    • video_frame_number (integer)
                • 3 (object)
                  • is_valid (boolean)
                  • errors (array)
                    • Items (string)
                  • warnings (array)
                    • Items (string)
                  • query_frame (integer)
                  • query_video_frame (['integer', 'null'])
                  • response_track (array)
                    • Items (object)
                      • frame_number (integer)
                      • x (number)
                      • y (number)
                      • width (number)
                      • height (number)
                      • rotation (number)
                      • original_width (integer)
                      • original_height (integer)
                      • video_frame_number (integer)
                  • object_title (string)
                  • visual_crop (object)
                    • frame_number (integer)
                    • x (number)
                    • y (number)
                    • width (number)
                    • height (number)
                    • rotation (number)
                    • original_width (integer)
                    • original_height (integer)
                    • video_frame_number (integer)
                • 4 (object)
                  • is_valid (boolean)
                  • errors (array)
                  • warnings (array)
                    • Items (string)
                  • query_frame (integer)
                  • query_video_frame (integer)
                  • response_track (array)
                    • Items (object)
                      • frame_number (integer)
                      • x (number)
                      • y (number)
                      • width (number)
                      • height (number)
                      • rotation (integer)
                      • original_width (integer)
                      • original_height (integer)
                      • video_frame_number (integer)
                  • object_title (string)
                  • visual_crop (object)
                    • frame_number (integer)
                    • x (number)
                    • y (number)
                    • width (number)
                    • height (number)
                    • rotation (integer)
                    • original_width (integer)
                    • original_height (integer)
                    • video_frame_number (integer)
              • warnings (array)
                • Items (string)