Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add deferred deallocation to cuda_vm_resource. #3154

Merged
merged 4 commits into from
Jul 20, 2021

Conversation

mzient
Copy link
Contributor

@mzient mzient commented Jul 15, 2021

Signed-off-by: Michał Zientkiewicz mzient@gmail.com

Description

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactoring (Redesign of existing code that doesn't affect functionality)
  • Other (e.g. Documentation, Tests, Configuration)

What happened in this PR

  • Factor out deferred dealloction from pool resource to a standalone file.
  • Add deferred deallocation to cuda_vm_resource.
  • Add more asserts to free_tree.
  • Improve free_tree API
  • Flush deallocations in tests, where necessary.

Additional information

  • Affected modules and functionalities:
  • Key points relevant for the review:

Checklist

Tests

  • Existing tests apply (with adjustments)
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: DALI-2190

* Factor out deferred dealloction from pool resource to a standalone file.
* Add deferred deallocation to `cuda_vm_resource`.
* Add more asserts to `free_tree`.
* Improve free_tree API
* Flush deallocations in tests, where necessary.

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [2592782]: BUILD STARTED

Comment on lines +169 to +172
ptr = free_mapped_.get_specific_block(va, size);
assert(ptr == va);
stat_take_free(size);
mark_as_unavailable(ptr, size);
Copy link
Contributor Author

@mzient mzient Jul 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the bug - previously this code was ptr = try_get_mapped(size, alignment) - if somehow it got a block that wasn't entirely covered by va there was inconsistency in what's free and what is not and the integrity of the pool was destroyed. The bug was exposed when flush_deferred() wasn't followed by another attempt to get mapped already memory. Now this is hidden again (by the second attempt to get memory from free_mapped), but it was dangerous anyway, since the agreement of blocks from free_va and free_mapped is coincidental.

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [2592782]: BUILD PASSED

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@@ -28,7 +28,7 @@ namespace test {
class VMResourceTest : public ::testing::Test {
public:
void TestAlloc() {
cuda_vm_resource res;
cuda_vm_resource_base res;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought: when I read cuda_vm_resource_base it reads to me as if it was some kind of base implementation that is not meant to be instantiated directly. I don't have a much better suggestion for a name, so feel free to ignore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's not meant in the sense that it doesn't have a very important feature (deferred deallocation) - but it's nonetheless usable and testable.

@@ -187,6 +187,7 @@ class VMResourceTest : public ::testing::Test {
void *p1 = res.allocate(block_size); // allocate one block
void *p2 = res.allocate(block_size); // allocate another block
res.deallocate(p2, block_size); // now free the second block
res.flush_deferred();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now looking at this, is res meant to be cuda_vm_resource_base or the one with deferred allocation (cuda_vm_resource)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test uses the one with deferred deallocation - that's why this line is needed - otherwise we might not see the result of the deallocation and the test expectations will fail.

void synchronize(span<const dealloc_params> params) {
assert(device_ordinal_ >= 0 && "synchronize called before the resource initialization");
for (auto &p : params) {
if (p.sync_device >= 0 && p.sync_device < device_ordinal_)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (p.sync_device >= 0 && p.sync_device < device_ordinal_)
if (p.sync_device >= 0 && p.sync_device != device_ordinal_)

Shouldn't it be equal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should.

void *try_get_mapped(size_t size, size_t alignment) {
char *ptr = static_cast<char*>(free_mapped_.get(size, alignment));
if (ptr) {
stat_take_free(size);
free_va_.get_specific_block(ptr, ptr + size);
auto *va = free_va_.get_specific_block(ptr, size);
(void)va;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this to silence a warning in release build?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - the assertion will compile to nothing.

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [2609268]: BUILD STARTED

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [2609654]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [2609268]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [2609654]: BUILD PASSED

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [2610050]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [2610158]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [2610158]: BUILD PASSED

@mzient mzient merged commit 52e36da into NVIDIA:main Jul 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants