AV Diarization
Benchmark Repo: https://github.com/EGO4D/audio-visual
Motivation
People communicate using spoken language, making the capture of conversational content in business meetings and social settings a problem of great scientific and practical interest. While diarization has been a standard problem in the speech recognition community, Ego4D brings in two new aspects (1) simultaneous capture of video and audio (2) the egocentric perspective of a participant in the conversation.
Task Definition
The Audio-Visual Diarization (AVD) benchmark is composed of four tasks:
Localization and tracking of the participants in the field of view: A bounding box is annotated around each participant's faces.
Active speaker detection where each tracked speaker is assigned an anonymous label, including the camera-wearer who never appears in the visual field of view.
Diarization of each speaker’s speech activity, where we provide time segments corresponding to each speaker's voice activity in a clip.
Transcription of each speakers speech content (only English speakers are considered for this version)
Annotation Schema
Audio-Visual Diarization - av_<set>.json
date(string)version(string)description(string)videos(array)- Items (object)
video_uid(string)split(string)clips(array)- Items (object)
clip_uid(string)source_clip_uid(string)video_uid(string)video_start_sec(number)video_end_sec(number)video_start_frame(integer)video_end_frame(integer)clip_start_sec(integer)clip_end_sec(number)clip_start_frame(integer)clip_end_frame(integer)valid(boolean)camera_wearer(object)person_id(string)camera_wearer(boolean)tracking_paths(array)voice_segments(array)- Items (object)
start_time(number)end_time(number)start_frame(integer)end_frame(integer)video_start_time(number)video_end_time(number)video_start_frame(integer)video_end_frame(integer)person(string)
- Items (object)
persons(array)- Items (object)
person_id(string)camera_wearer(boolean)tracking_paths(array)- Items (object)
track_id(string)track(array)- Items (object)
x(number)y(number)width(number)height(number)frame(integer)video_frame(integer)clip_frame(null)
- Items (object)
suspect(boolean)unmapped_frames_count(integer)unmapped_frames(array)- Items (integer)
- Items (object)
voice_segments(array)- Items (object)
start_time(number)end_time(number)start_frame(integer)end_frame(integer)video_start_time(number)video_end_time(number)video_start_frame(integer)video_end_frame(integer)person(string)
- Items (object)
- Items (object)
missing_voice_segments(array)transcriptions(array)- Items (object)
transcription(string)start_time_sec(number)end_time_sec(number)person_id(string)video_start_time(number)video_start_frame(integer)video_end_time(number)video_end_frame(integer)
- Items (object)
social_segments_talking(array)- Items (object)
start_time(number)end_time(number)start_frame(integer)end_frame(integer)video_start_time(number)video_end_time(number)video_start_frame(integer)video_end_frame(integer)person(string)target(['null', 'string'])is_at_me(boolean)
- Items (object)
social_segments_looking(array)- Items (object)
start_time(number)end_time(number)start_frame(integer)end_frame(integer)video_start_time(number)video_end_time(number)video_start_frame(integer)video_end_frame(integer)person(string)target(null)is_at_me(boolean)
- Items (object)
- Items (object)
- Items (object)