The object masks are provided in JSON files as RLE encoded strings. We use pycocotools
to encode/decode these masks.
The JSON file contains a dictionary which is organized as follows:
sequences:
- width: <int> # image_dims
height: <int>
id: <int>
seq_name: <str>
dataset: <str> # LaSOT, BDD, ArgoVerse, HACS, ...
fps: <int>
all_image_paths:
- <str>
...
annotated_image_paths:
- <str>
...
neg_category_ids:
- <int>
...
not_exhaustive_category_ids:
- <int>
...
track_category_ids:
... # see below for details
segmentations:
... # see below for details
categories:
- id: <int>
name: <str>
synset: <str>
def: <str>
synonyms:
- <str>
...
...
...
split: <str> # train/val/test
categories
: List of all object categories in the dataset. We use the same category IDs as the LVIS dataset.sequences
: List of all video sequences in the datasets. Each list entry is a dictionary with basic attributes e.g. image size, video ID, etc, and also the mask annotations for the object tracks in this video.split
: which split (train/val/test) the annotations belong to.
The track_category_ids
is a dict which conveys the category ID for each object track in that sequence:
track_category_ids:
track_id: category_id
...
The segmentations
field is a list with one entry per annotated video frame. Each list element is a dict with track IDs as keys and encoded masks and other attributes as values. In the example below, '1' and '2' are the track IDs
segmentations:
- 1: # first frame
rle: <str>
is_gt: <bool>
score: <float>
bbox: # only present in `first_frame_annotations` file
- x coord <int>
- y coord <int>
- width <int>
- height <int>
point: # only present in `first_frame_annotations` file
- x coord <int>
- y coord <int>
2:
- 1: ... # second frame
2: ...
- 1: ... # third frame
2: ...
For the training set, we adopted a semi-automated workflow for annotating temporally dense object masks. The is_gt
field conveys whether the given mask was annotated automatically or by a human annotator. The score
field conveys the confidence for an automatically annotated mask.
-
For the val and test sets, all annotations were done by humans, so the
is_gt
andscore
fields can be ignored. -
For the
first_frame_annotations
files (useful for exemplar-guided tasks), there are two additional fieldsbbox
andpoint
which convey the bounding box and a random point on the object for the first frame in which it occurs. -
In the
segmentations
andtrack_category_ids
fields, track IDs are encoded as strings (the JSON file format enforces that dict keys must be strings). Remember to cast them as int when parsing the annotations.
For evaluating your predicted results, the code expects a single JSON file with the same format as the ground-truth format explained above. For the exemplar-guided and open-world tasks, the track_category_ids
field is still required for parsing the prediction file, but the value is irrelevant i.e. you can assign any class ID to the tracks.