Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an operator for receiving video metadata #5630

Open
1 task done
treasan opened this issue Sep 10, 2024 · 5 comments
Open
1 task done

Add an operator for receiving video metadata #5630

treasan opened this issue Sep 10, 2024 · 5 comments
Assignees
Labels
enhancement New feature or request Video Video related feature/question

Comments

@treasan
Copy link

treasan commented Sep 10, 2024

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Should have (e.g. Adoption is possible, but the performance shortcomings make the solution inferior).

Please provide a clear description of problem this feature solves

The sample rate (fps) of videos may very and hence the time period a fixed number of frames represent also varies. Having access to either the fps, duration or even the concrete timesteps of each frame is often crucial in many tasks where the actual duration in seconds is more important than the number of frames. For example, I am decoding raw video bytes from a web dataset using the experimental video decoder and I am forced to retreat to other libraries that can give me this kind of information from the raw video bytes (specifically, pytorch's VideoReader API).

Feature Description

As a user I want to be able to extract information about the sample rate of a video alongside its decoded frames.

Describe your ideal solution

A new DALI operator that extracts the desired metadata from raw video bytes.
An example video decoding pipeline reading from a webdataset (raw video bytes could also come from an external source):

@pipeline_def
def pipeline(tar_paths):
    raw_video = fn.readers.webdataset(tar_paths, ...)
    duration, fps = fn.get_video_metadata(...)
    video = fn.experimental.decoders.video(raw_video)
    return video, duration, fps

Describe any alternatives you have considered

No response

Additional context

No response

Check for duplicates

  • I have searched the open bugs/issues and have found no duplicates for this bug report
@treasan treasan added the enhancement New feature or request label Sep 10, 2024
@JanuszL JanuszL assigned awolant and unassigned szkarpinski Sep 10, 2024
@JanuszL
Copy link
Contributor

JanuszL commented Sep 10, 2024

Hi @treasan,

Thank you for reaching out. Yes, that sounds like a good feature to add. Let us add this to our ToDo list.
Could you also tell me how do you want to utilize this data further? To drive transformations or to feed the model?

@treasan
Copy link
Author

treasan commented Sep 10, 2024

Hey @JanuszL

I am training a model, which expects video snippets with a certain duration (in seconds). Furthermore it expects a timestep for each frame, which is used for a temporal positional encoding.

@JanuszL
Copy link
Contributor

JanuszL commented Sep 10, 2024

Thank you for the clarification. In this case, I think it would be best to return this data directly from the video decoder (at least timesteps for each frame), and or extend the decoder to decode not the number of frames but the duration.

@awolant awolant added the Video Video related feature/question label Sep 10, 2024
@awolant
Copy link
Contributor

awolant commented Sep 10, 2024

Hello @treasan

thanks for creating the issue. To better understand the requirement I wanted to ask do your use case expect the samples to have the same number of frames or the number of frames varies per sample. If it varies is it due to the variable frame rates in the video or variable duration of frames in seconds or both? If it varies what is expected type and shape of the output in your desired framework?

@treasan
Copy link
Author

treasan commented Sep 10, 2024

Please have a look at another issue/question I have submitted #5626. I explain my pipeline there in more detail.

tl;dr:

  1. DALI pipeline: Loading raw video bytes from webdataset
  2. Python function: Peeking duration and fps metadata from raw video bytes and filter out unwanted videos beforehand (e.g. too short ones)
  3. DALI pipeline: Get raw video bytes, duration, fps from external source --> decode video --> return decoded video, duration, fps
  4. Python function: Cut out multiple consecutive snippets of certain duration (e.g. 3 secs) of respective videos based on fps/duration metadata. These snippets constitute one training sample. They get batched and fed to the model alongside their timesteps that were also calculated based on the fps/duration metadata.

So, optimal for my use-case would be a DALI operator that peeks this metadata from raw video bytes, as I am then able to filter them out before the decoding step (more efficient). This might be similar to the peek_image_shape operator, which gives certain information about an encoded image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Video Video related feature/question
Projects
None yet
Development

No branches or pull requests

4 participants