Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MixedBackend support for InputOperator #4603

Merged
merged 9 commits into from
Jan 25, 2023
Merged

Conversation

szalpal
Copy link
Member

@szalpal szalpal commented Jan 23, 2023

Category:

New feature (*non-breaking change

Description:

So far we never had a MixedBackend input operator. None of: ExternalSourceCpu, ExternalSourceGpu and VideoInputCpu had to inherit from InputOperator<MixedBackend>. With VideoInputGpu the situation changes, since this operator is actually a mixed operator. Therefore the support for MixedBackend operators in the InputOperator had to be added.

The mixed input operator has somehow different meaning than the other two. Both CPU and GPU have a uniform device across the execution, while Mixed operator changes the device in the process. In other words, the following pattern occurs:

OpBackend   InpB OutB
CPU         CPU  CPU
MXD         CPU  GPU
GPU         GPU  GPU

Let's take an example of the operator, that subclasses InputOperator:

  1. MyOp<CPUBackend>. This part is already handled by the InputOperator. Since operator takes the CPU input and returns a CPU output, the "intermediate" data (i.e. the data that is acquired using InputOperator API ForwardCurrentData) will also have a CPUBackend.
  2. MyOp<GPUBackend>. This part is also handled already. Since the operator takes the GPU input and returns a GPU output, the intermediate data has to be on the GPU.
  3. MyOp<MixedOperator>. This part is what the PR targets.

The input operator grabs the actual input from the top of the queue (i.e. CachingList in the implementation). Since it has a CPU input and returns a GPU output, we do not know what backend of the data the operator needs to proceed. It might be a CPU data (which I believe will be the more common case), when e.g. the operator need to conduct Huffman CPU-based decoding and then proceed with JPEG decoding on the GPU. However, maybe the operator does not need CPU data, but it needs to be a Mixed operator for any other reason - it this case, the operator would like ForwardCurrentData to return a tensor already with GPU backend.

For these reasons, there are 2 overloads of the ForwardCurrentData. One is for CPU and one for GPU target tensor. The former part would be more common and the latter part is just a convenience function, making the code a lot simpler. ForwardCurrentData<Mixed>(Tensor<CPU>) would behave in the same way as its CPU sibling - copy or swap depending on the pageability. In case of ForwardCurrentData<Mixed>(Tensor<GPU>) the data on the top of the queue is always a CPU one (since MixedOperator has a CPU input). Therefore the InputOperator needs to copy this data to the GPU.

For the CPU part, the operator has to create its own ThreadPool, since MixedOperators are not supplied with one automatically.

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

@szalpal
Copy link
Member Author

szalpal commented Jan 23, 2023

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [7086603]: BUILD STARTED

@szalpal
Copy link
Member Author

szalpal commented Jan 23, 2023

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [7088060]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [7088060]: BUILD PASSED

@szalpal szalpal marked this pull request as ready for review January 24, 2023 07:15
@szalpal szalpal marked this pull request as draft January 24, 2023 07:19
@szalpal szalpal marked this pull request as ready for review January 24, 2023 08:55

private:
void PutTogetherDaliGraph() {
pipeline_->AddOperator(OpSpec("PassthroughInput")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this PassthroughInput would be defined in a _test.* file so that it doesn't belong to DALI library

}
// if the output is pinned and input not it needs to be copied
if (target.is_pinned() && !tensor_list_elm.front()->is_pinned()) {
const auto &shapes = tensor_list_elm.front()->shape();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: auto &elem = tensor_list_elm.front(); (you use it several times in this scope)

Comment on lines +160 to +161
target.CopySample(sample_id, *tensor_list_elm.front(), sample_id,
AccessOrder::host());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
target.CopySample(sample_id, *tensor_list_elm.front(), sample_id,
AccessOrder::host());
target.CopySample(sample_id, *tensor_list_elm.front(), sample_id);

leave the order to be figured out from the arguments?

Signed-off-by: szalpal <mszolucha@nvidia.com>
Signed-off-by: szalpal <mszolucha@nvidia.com>
Signed-off-by: szalpal <mszolucha@nvidia.com>
Signed-off-by: szalpal <mszolucha@nvidia.com>
Signed-off-by: szalpal <mszolucha@nvidia.com>
Signed-off-by: szalpal <mszolucha@nvidia.com>
Copy link
Contributor

@klecki klecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, mostly nitpicks about test scope and the misleading naming of the test operator.

DALI_SCHEMA(PassthroughInput)
.DocStr(
R"code(
The operator that is a passthrough operator and also an input operator. Used for test only.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be a Passthrough operator it must contain a PassThrough or SamplewisePassThrough declaration in the schema. This one doesn't, so I would name it something along the lines of TestInputOperator or similar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to IdentityInput

Comment on lines 70 to 72
this->ForwardCurrentData(intermediate,
std::is_same_v<Backend, CPUBackend> ? ws.GetThreadPool() : *tp_);
out.Copy(intermediate, ws.stream());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we check, for the CPUBackend, if we can just ForwardCurrentData to the output without copying it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you have in mind, but let me explain what happens here.

The point of this test is to check both overloads:

  void DLL_PUBLIC ForwardCurrentData(TensorList<CPUBackend> &target, ThreadPool &tp);
  void DLL_PUBLIC ForwardCurrentData(TensorList<GPUBackend> &target, cudaStream_t stream = nullptr);

inside an InputOperator<MixedBackend> operator. We don't care for InputOperator<CPUBackend> and InputOperator<GPUBackend>, because these two are tested with the ExternalSource tests.

Therefore we have a cpu_input flag, specified and switched in tests, which picks the proper overload:

  void RunCpuInput(Workspace &ws) {
    auto &out = ws.Output<OutBackend>(0);
    TensorList<CPUBackend> intermediate;
    this->ForwardCurrentData(intermediate,  //  ForwardCurrentData(TensorList<CPUBackend>, ThreadPool &)
                             std::is_same_v<Backend, CPUBackend> ? ws.GetThreadPool() : *tp_);
    out.Copy(intermediate, ws.stream());
  }


  void RunGpuInput(Workspace &ws) {
    auto &out = ws.Output<OutBackend>(0);
    this->ForwardCurrentData(out, ws.stream());  // ForwardCurrentData(TensorList<GPUBackend> &, cudaStream_t);
  }

This makes both of them tested, inside MixedOperator, we're happy :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, as long as we don't test the CPU and GPU variants, it doesn't make sense to do it. In that case IMO it is not necessary to implement this operator as templated over Backend, it should just be implemented for MixedBackend.
I can see that we only do DALI_REGISTER_OPERATOR(IdentityInput, IdentityInput<MixedBackend>, Mixed);, but that could just be DALI_REGISTER_OPERATOR(IdentityInput, IdentityInputMixed, Mixed);

Comment on lines 45 to 48
{2, 2, true, false, true},
{3, 3, true, false, true},
{2, 2, true, true, false},
{3, 3, true, true, false},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{2, 2, true, false, true},
{3, 3, true, false, true},
{2, 2, true, true, false},
{3, 3, true, true, false},
{2, 2, true, false, true},
{3, 3, true, false, false},
{2, 2, true, true, false},
{3, 3, true, true, true},

Should we make every other usage of cpu_input different for the particular executor type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: szalpal <mszolucha@nvidia.com>
Signed-off-by: szalpal <mszolucha@nvidia.com>
@szalpal
Copy link
Member Author

szalpal commented Jan 24, 2023

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [7097931]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [7097931]: BUILD PASSED

@szalpal szalpal merged commit 11cc181 into NVIDIA:main Jan 25, 2023
aderylo pushed a commit to zpp-dali-2022/DALI that referenced this pull request Mar 17, 2023
* Add In/Out type logic to the InputOperator.
* Add ForwardCurrentData specializations for MixedBackend.
* Refactor SetExternalInput routines and add MixedBackend handling.
* Add IdentityInput test-only operator.
* Add InputOperator<MixedBackend> tests.


So far we never had a `MixedBackend` input operator. None of: `ExternalSourceCpu`, `ExternalSourceGpu` and `VideoInputCpu` had to inherit from `InputOperator<MixedBackend>`. With `VideoInputGpu` the situation changes, since this operator is actually a mixed operator. Therefore the support for `MixedBackend` operators in the `InputOperator` had to be added.

The mixed input operator has somehow different meaning than the other two. Both CPU and GPU have a uniform device across the execution, while Mixed operator changes the device in the process. In other words, the following pattern occurs:

OpBackend   InpB OutB
CPU         CPU  CPU
MXD         CPU  GPU
GPU         GPU  GPU


Let's take an example of the operator, that subclasses `InputOperator`:
1. `MyOp<CPUBackend>`. This part is already handled by the `InputOperator`. Since operator takes the CPU input and returns a CPU output, the "intermediate" data (i.e. the data that is acquired using InputOperator API `ForwardCurrentData`) will also have a CPUBackend.
2. `MyOp<GPUBackend>`. This part is also handled already. Since the operator takes the GPU input and returns a GPU output, the intermediate data has to be on the GPU.
3. `MyOp<MixedOperator>`. This part is what the PR targets.

The input operator grabs the actual input from the top of the queue (i.e. `CachingList` in the implementation). Since it has a CPU input and returns a GPU output, we do not know what backend of the data the operator needs to proceed. It might be a CPU data (which I believe will be the more common case), when e.g. the operator need to conduct Huffman CPU-based decoding and then proceed with JPEG decoding on the GPU. However, maybe the operator does not need CPU data, but it needs to be a Mixed operator for any other reason - it this case, the operator would like `ForwardCurrentData` to return a tensor already with GPU backend.

For these reasons, there are 2 overloads of the `ForwardCurrentData`. One is for CPU and one for GPU target tensor. The former part would be more common and the latter part is just a convenience function, making the code a lot simpler. `ForwardCurrentData<Mixed>(Tensor<CPU>)` would behave in the same way as its CPU sibling - copy or swap depending on the pageability. In case of `ForwardCurrentData<Mixed>(Tensor<GPU>)` the data on the top of the queue is always a CPU one (since MixedOperator has a CPU input). Therefore the InputOperator needs to copy this data to the GPU.

For the CPU part, the operator has to create its own ThreadPool, since MixedOperators are not supplied with one automatically.

Signed-off-by: Michał Szołucha <mszolucha@nvidia.com>
@szalpal szalpal deleted the mxd_input branch February 9, 2024 00:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants