Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ScatterGatherCPU and rework Copy op to batch processing #3266

Merged
merged 2 commits into from
Aug 25, 2021

Conversation

klecki
Copy link
Contributor

@klecki klecki commented Aug 20, 2021

Description

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactoring (Redesign of existing code that doesn't affect functionality)
  • Other (e.g. Documentation, Tests, Configuration)

What happened in this PR

  • Introduce ScatterGatherCPU for completeness with ScatterGatherGPU
  • Copy operator with batch processing using ScatterGather
  • Adjust ElementExtract to changes in ScatterGather.

Signed-off-by: Krzysztof Lecki klecki@nvidia.com

Additional information

  • Affected modules and functionalities:
    ScatterGather, Copy, ElementExtract

  • Key points relevant for the review:
    Block size for CPU copy?

Checklist

Tests

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: DALI-2258

* ScatterGatherCPU
* Copy with batch processing
* Adjust ElementExtract to changes

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@klecki
Copy link
Contributor Author

klecki commented Aug 20, 2021

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [2807037]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [2807037]: BUILD PASSED

* @param reset - if true, calls Reset after processing is over
*/
template <typename ExecutionEngine>
DLL_PUBLIC void Run(ExecutionEngine &exec_engine, bool reset = true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DLL_PUBLIC void Run(ExecutionEngine &exec_engine, bool reset = true) {
void Run(ExecutionEngine &exec_engine, bool reset = true) {

Not needed on an inline function, I guess?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

/**
* @brief Reserves GPU memory for the description of the blocks.
*/
void ReserveGPUBlocks();

size_t max_size_per_block_ = kDefaultBlockSize;
std::vector<CopyRange> blocks_;
kernels::memory::KernelUniquePtr<CopyRange> blocks_dev_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you're at it - perhaps use DeviceBuffer<CopyRange>? Then you could allocate and copy it with simple blocks_dev_.from_host(blocks_); - it would handle buffer growth and all other stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 184 to 188
Coalesce();
for (auto &r : ranges_) {
exec_engine.AddWork([=](int thread_id) { std::memcpy(r.dst, r.src, r.size); }, r.size);
}
exec_engine.RunAll();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method of doing things here is quite counterproductive. If you really want to leverage parallelism, then after coalescing you should split the buffers into suitably-sized blocks. Otherwise coalescing decreases parallelism.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed the make blocks and thought that coalesce already does split again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it can split either for target number of blocks or the size.

Comment on lines +81 to +85
AllocType alloc =
std::is_same<TypeParam, ScatterGatherCPU>::value ? AllocType::Host : AllocType::GPU;

auto in_ptr = kernels::memory::alloc_unique<char>(alloc, in.size());
auto out_ptr = kernels::memory::alloc_unique<char>(alloc, out.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that so much of this test is guarded with if-else that we'd be better off having two tests - one for GPU, one for CPU. Note that kernels::memory is slated for removal and there will be no run-time selection of memory kind!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to generic functions.

kernels::ScatterGatherGPU> scatter_gather_;
// 1 MB per block for CPU, 256 kB per block for GPU
static constexpr size_t kMaxSizePerBlock =
std::is_same<Backend, CPUBackend>::value ? 1 << 20 : 1 << 18;
Copy link
Contributor

@mzient mzient Aug 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we like to have larger blocks on CPU? Have you benchmarked it or is it just guessing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just guessing. I think I will go with trying at least number of threads * 3 tasks, and for the small buffer without the thread pool.

template <typename ExecutionEngine>
DLL_PUBLIC void Run(ExecutionEngine &exec_engine, bool reset = true) {
Coalesce();
for (auto &r : ranges_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimization opportunity: if the total size is small enough, it's better to use sequential execution engine and avoid the overhead of synchronization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

layout=layout)
if dev == "gpu":
input = input.gpu()
output = fn.copy(input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we even have such an operator!?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't ask me.

@mzient mzient self-assigned this Aug 22, 2021
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@klecki
Copy link
Contributor Author

klecki commented Aug 23, 2021

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [2823621]: BUILD STARTED

@klecki klecki changed the title Add ScatterGatherCPU and rework Copy operator Add ScatterGatherCPU and remove contiguous acces from Copy Aug 23, 2021
@klecki klecki changed the title Add ScatterGatherCPU and remove contiguous acces from Copy Add ScatterGatherCPU and rework Copy operator Aug 23, 2021
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [2823621]: BUILD PASSED

@klecki klecki changed the title Add ScatterGatherCPU and rework Copy operator Add ScatterGatherCPU and rework Copy op to batch processing Aug 25, 2021
@klecki klecki merged commit edc148d into NVIDIA:main Aug 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants