Add ScatterGatherCPU and rework Copy op to batch processing #3266

klecki · 2021-08-20T11:55:14Z

Description

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Refactoring (Redesign of existing code that doesn't affect functionality)
Other (e.g. Documentation, Tests, Configuration)

What happened in this PR

Introduce ScatterGatherCPU for completeness with ScatterGatherGPU
Copy operator with batch processing using ScatterGather
Adjust ElementExtract to changes in ScatterGather.

Signed-off-by: Krzysztof Lecki klecki@nvidia.com

Additional information

Affected modules and functionalities:
ScatterGather, Copy, ElementExtract
Key points relevant for the review:
Block size for CPU copy?

Checklist

Tests

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: DALI-2258

* ScatterGatherCPU * Copy with batch processing * Adjust ElementExtract to changes Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2021-08-20T12:00:46Z

!build

dali-automaton · 2021-08-20T12:05:47Z

CI MESSAGE: [2807037]: BUILD STARTED

dali-automaton · 2021-08-20T14:12:45Z

CI MESSAGE: [2807037]: BUILD PASSED

mzient · 2021-08-22T07:44:04Z

dali/kernels/common/scatter_gather.h

+   * @param reset       - if true, calls Reset after processing is over
+   */
+  template <typename ExecutionEngine>
+  DLL_PUBLIC void Run(ExecutionEngine &exec_engine, bool reset = true) {


Suggested change

DLL_PUBLIC void Run(ExecutionEngine &exec_engine, bool reset = true) {

void Run(ExecutionEngine &exec_engine, bool reset = true) {

Not needed on an inline function, I guess?

mzient · 2021-08-22T11:00:09Z

dali/kernels/common/scatter_gather.h

  /**
   * @brief Reserves GPU memory for the description of the blocks.
   */
  void ReserveGPUBlocks();

-  size_t max_size_per_block_ = kDefaultBlockSize;
-  std::vector<CopyRange> blocks_;
  kernels::memory::KernelUniquePtr<CopyRange> blocks_dev_;


When you're at it - perhaps use DeviceBuffer<CopyRange>? Then you could allocate and copy it with simple blocks_dev_.from_host(blocks_); - it would handle buffer growth and all other stuff.

mzient · 2021-08-22T11:02:18Z

dali/kernels/common/scatter_gather.h

+    Coalesce();
+    for (auto &r : ranges_) {
+      exec_engine.AddWork([=](int thread_id) { std::memcpy(r.dst, r.src, r.size); }, r.size);
+    }
+    exec_engine.RunAll();


This method of doing things here is quite counterproductive. If you really want to leverage parallelism, then after coalescing you should split the buffers into suitably-sized blocks. Otherwise coalescing decreases parallelism.

I missed the make blocks and thought that coalesce already does split again.

Now it can split either for target number of blocks or the size.

mzient · 2021-08-22T11:05:04Z

dali/kernels/test/scatter_gather_test.cc

+  AllocType alloc =
+      std::is_same<TypeParam, ScatterGatherCPU>::value ? AllocType::Host : AllocType::GPU;
+
+  auto in_ptr = kernels::memory::alloc_unique<char>(alloc, in.size());
+  auto out_ptr = kernels::memory::alloc_unique<char>(alloc, out.size());


I think that so much of this test is guarded with if-else that we'd be better off having two tests - one for GPU, one for CPU. Note that kernels::memory is slated for removal and there will be no run-time selection of memory kind!

Moved to generic functions.

mzient · 2021-08-22T11:05:56Z

dali/operators/sequence/element_extract.h

+      kernels::ScatterGatherGPU> scatter_gather_;
+  // 1 MB per block for CPU, 256 kB per block for GPU
+  static constexpr size_t kMaxSizePerBlock =
+      std::is_same<Backend, CPUBackend>::value ? 1 << 20 : 1 << 18;


Why would we like to have larger blocks on CPU? Have you benchmarked it or is it just guessing?

Just guessing. I think I will go with trying at least number of threads * 3 tasks, and for the small buffer without the thread pool.

mzient · 2021-08-22T11:07:56Z

dali/kernels/common/scatter_gather.h

+  template <typename ExecutionEngine>
+  DLL_PUBLIC void Run(ExecutionEngine &exec_engine, bool reset = true) {
+    Coalesce();
+    for (auto &r : ranges_) {


Optimization opportunity: if the total size is small enough, it's better to use sequential execution engine and avoid the overhead of synchronization.

mzient · 2021-08-22T11:09:21Z

dali/test/python/test_operator_copy.py

+                               layout=layout)
+    if dev == "gpu":
+        input = input.gpu()
+    output = fn.copy(input)


Why do we even have such an operator!?

Don't ask me.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2021-08-23T14:21:32Z

!build

dali-automaton · 2021-08-23T14:26:27Z

CI MESSAGE: [2823621]: BUILD STARTED

dali-automaton · 2021-08-23T15:40:07Z

CI MESSAGE: [2823621]: BUILD PASSED

Add ScatterGatherCPU and rework Copy operator

6e8f301

* ScatterGatherCPU * Copy with batch processing * Adjust ElementExtract to changes Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki force-pushed the copy-rework branch from 50d229c to 6e8f301 Compare August 20, 2021 11:56

mzient reviewed Aug 22, 2021

View reviewed changes

mzient self-assigned this Aug 22, 2021

jantonguirao assigned awolant Aug 23, 2021

Review fixes and improvements

2638f8b

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki changed the title ~~Add ScatterGatherCPU and rework Copy operator~~ Add ScatterGatherCPU and remove contiguous acces from Copy Aug 23, 2021

klecki changed the title ~~Add ScatterGatherCPU and remove contiguous acces from Copy~~ Add ScatterGatherCPU and rework Copy operator Aug 23, 2021

mzient approved these changes Aug 24, 2021

View reviewed changes

awolant approved these changes Aug 25, 2021

View reviewed changes

klecki changed the title ~~Add ScatterGatherCPU and rework Copy operator~~ Add ScatterGatherCPU and rework Copy op to batch processing Aug 25, 2021

klecki merged commit edc148d into NVIDIA:main Aug 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ScatterGatherCPU and rework Copy op to batch processing #3266

Add ScatterGatherCPU and rework Copy op to batch processing #3266

klecki commented Aug 20, 2021 •

edited

Loading

klecki commented Aug 20, 2021

dali-automaton commented Aug 20, 2021

dali-automaton commented Aug 20, 2021

mzient Aug 22, 2021

klecki Aug 23, 2021

mzient Aug 22, 2021

klecki Aug 23, 2021

mzient Aug 22, 2021

klecki Aug 23, 2021

klecki Aug 23, 2021

mzient Aug 22, 2021

klecki Aug 23, 2021

mzient Aug 22, 2021 •

edited

Loading

klecki Aug 23, 2021

mzient Aug 22, 2021

klecki Aug 23, 2021

mzient Aug 22, 2021

klecki Aug 23, 2021

klecki commented Aug 23, 2021

dali-automaton commented Aug 23, 2021

dali-automaton commented Aug 23, 2021

	DLL_PUBLIC void Run(ExecutionEngine &exec_engine, bool reset = true) {
	void Run(ExecutionEngine &exec_engine, bool reset = true) {

Add ScatterGatherCPU and rework Copy op to batch processing #3266

Add ScatterGatherCPU and rework Copy op to batch processing #3266

Conversation

klecki commented Aug 20, 2021 • edited Loading

Description

What happened in this PR

Additional information

Checklist

Tests

Documentation

DALI team only

Requirements

klecki commented Aug 20, 2021

dali-automaton commented Aug 20, 2021

dali-automaton commented Aug 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzient Aug 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klecki commented Aug 23, 2021

dali-automaton commented Aug 23, 2021

dali-automaton commented Aug 23, 2021

klecki commented Aug 20, 2021 •

edited

Loading

mzient Aug 22, 2021 •

edited

Loading