Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA VM resource bugfixes #3545

Merged
merged 4 commits into from
Dec 1, 2021
Merged

CUDA VM resource bugfixes #3545

merged 4 commits into from
Dec 1, 2021

Conversation

mzient
Copy link
Contributor

@mzient mzient commented Nov 30, 2021

Signed-off-by: Michał Zientkiewicz mzient@gmail.com

Description

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactoring (Redesign of existing code that doesn't affect functionality)
  • Other (e.g. Documentation, Tests, Configuration)

What happened in this PR

  • Fixed exception safety bug in cuda_vm_resource.
    • bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable
    • fix: reorganize the code and apply changes to internal structures after potential exception has been thrown
    • fix: restore temporarily modified availability mask
  • Removed a trait that marked cuda_vm_resource as merge-capable.
    • bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption
    • fix: remove the trait
  • Remove debug dump upon every OOM in cuda_vm_resource
  • Fix stats
    • bug: missing stat_add_free in deallocate_impl

Additional information

Affected modules and functionalities:

CUDA VM resource

Checklist

Tests

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: DALI-2097, DALI-2450

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3499776]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3499776]: BUILD FAILED

@JanuszL JanuszL self-assigned this Nov 30, 2021
print(std::cerr, "Could not allocate physical storage of size ", size, " on device ",
device_ordinal_);
dbg_dump(std::cerr);
free_va_.put(va, size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still should provide a message like - Attempted to allocate XXX, currently allocated YY, available ZZZ

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to change the original exception type - now I'd have to rethrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can adjust at least adjust dbg_dump(std::cerr); to print shorter and easier to digest message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done for CUDABadAlloc - others are rethrown directly.

assert(static_cast<int>(free_blocks.size()) == blocks_to_map);
free_blocks.resize(blocks_to_map);
for (int i = to_unmap.size(); i < blocks_to_map; i++) {
// This can throw, but we should leave the object in a consistent state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This can throw, but we should leave the object in a consistent state
// This can throw, but should leave the object in a consistent state

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant:
This can throw, but (this) should leave the object in a consistent state ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that is why we sounds strange here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter - both are fine and it's just an internal comment - not even doxygen.

res.deallocate(p1, size1);
auto &region = res.va_regions_[0];
EXPECT_EQ(region.available_blocks, 4);
EXPECT_EQ(region.mapped.find(false), 6);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
EXPECT_EQ(region.mapped.find(false), 6);
EXPECT_EQ(region.mapped.find(true), 0);
EXPECT_EQ(region.mapped.find(false), 6);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@JanuszL
Copy link
Contributor

JanuszL commented Nov 30, 2021

Can you describe in the PR the bugs fixed and the solution?

* Don't mark cuda_vm_resource as merge-capable.

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3501860]: BUILD STARTED

@mzient
Copy link
Contributor Author

mzient commented Nov 30, 2021

Can you describe in the PR the bugs fixed and the solution?

Described in the PR message.

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3501927]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3501860]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3501927]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3501860]: BUILD PASSED

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3505387]: BUILD STARTED

@mzient mzient added the important-fix Fixes an important issue in the software or development environment. label Dec 1, 2021
char *ptr = r.block_ptr<char>(block_idx);
free_mapped_.get_specific_block(ptr, block_size_);
stat_take_free(block_size_);
out.push_back({ &r, block_idx });
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out is a template - it changed the meaning now.

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3505387]: BUILD PASSED

@jantonguirao jantonguirao self-assigned this Dec 1, 2021
}
assert(free_blocks.size() == static_cast<size_t>(blocks_to_map));

stat_allocate_blocks(new_blocks);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to line 691?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope - if unmapping throws for any reason, the blocks will just go away.

size_t free = 0, total = 0;
CUDA_CALL(cudaFree(nullptr)); // initialize the default context
CUDA_CALL(cuMemGetInfo(&free, &total));
size_t va_size = next_pow2(total * 3 / 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you could add a comment here explaining the meaning of those sizes, why it throws or doesn't when we allocate this or that. not a must, just a thought

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3507145]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3507145]: BUILD PASSED

@mzient mzient merged commit edbc608 into NVIDIA:main Dec 1, 2021
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Fixed exception safety bug in cuda_vm_resource.
        bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable
        fix: reorganize the code and apply changes to internal structures after potential exception has been thrown
        fix: restore temporarily modified availability mask
* Removed a trait that marked cuda_vm_resource as merge-capable.
        bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption
        fix: remove the trait
* Remove debug dump upon every OOM in cuda_vm_resource
* Fix stats
        bug: missing stat_add_free in deallocate_impl

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Fixed exception safety bug in cuda_vm_resource.
        bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable
        fix: reorganize the code and apply changes to internal structures after potential exception has been thrown
        fix: restore temporarily modified availability mask
* Removed a trait that marked cuda_vm_resource as merge-capable.
        bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption
        fix: remove the trait
* Remove debug dump upon every OOM in cuda_vm_resource
* Fix stats
        bug: missing stat_add_free in deallocate_impl

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Fixed exception safety bug in cuda_vm_resource.
        bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable
        fix: reorganize the code and apply changes to internal structures after potential exception has been thrown
        fix: restore temporarily modified availability mask
* Removed a trait that marked cuda_vm_resource as merge-capable.
        bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption
        fix: remove the trait
* Remove debug dump upon every OOM in cuda_vm_resource
* Fix stats
        bug: missing stat_add_free in deallocate_impl

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Fixed exception safety bug in cuda_vm_resource.
        bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable
        fix: reorganize the code and apply changes to internal structures after potential exception has been thrown
        fix: restore temporarily modified availability mask
* Removed a trait that marked cuda_vm_resource as merge-capable.
        bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption
        fix: remove the trait
* Remove debug dump upon every OOM in cuda_vm_resource
* Fix stats
        bug: missing stat_add_free in deallocate_impl

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Fixed exception safety bug in cuda_vm_resource.
        bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable
        fix: reorganize the code and apply changes to internal structures after potential exception has been thrown
        fix: restore temporarily modified availability mask
* Removed a trait that marked cuda_vm_resource as merge-capable.
        bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption
        fix: remove the trait
* Remove debug dump upon every OOM in cuda_vm_resource
* Fix stats
        bug: missing stat_add_free in deallocate_impl

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Fixed exception safety bug in cuda_vm_resource.
        bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable
        fix: reorganize the code and apply changes to internal structures after potential exception has been thrown
        fix: restore temporarily modified availability mask
* Removed a trait that marked cuda_vm_resource as merge-capable.
        bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption
        fix: remove the trait
* Remove debug dump upon every OOM in cuda_vm_resource
* Fix stats
        bug: missing stat_add_free in deallocate_impl

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Fixed exception safety bug in cuda_vm_resource.
        bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable
        fix: reorganize the code and apply changes to internal structures after potential exception has been thrown
        fix: restore temporarily modified availability mask
* Removed a trait that marked cuda_vm_resource as merge-capable.
        bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption
        fix: remove the trait
* Remove debug dump upon every OOM in cuda_vm_resource
* Fix stats
        bug: missing stat_add_free in deallocate_impl

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Fixed exception safety bug in cuda_vm_resource.
        bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable
        fix: reorganize the code and apply changes to internal structures after potential exception has been thrown
        fix: restore temporarily modified availability mask
* Removed a trait that marked cuda_vm_resource as merge-capable.
        bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption
        fix: remove the trait
* Remove debug dump upon every OOM in cuda_vm_resource
* Fix stats
        bug: missing stat_add_free in deallocate_impl

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Fixed exception safety bug in cuda_vm_resource.
        bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable
        fix: reorganize the code and apply changes to internal structures after potential exception has been thrown
        fix: restore temporarily modified availability mask
* Removed a trait that marked cuda_vm_resource as merge-capable.
        bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption
        fix: remove the trait
* Remove debug dump upon every OOM in cuda_vm_resource
* Fix stats
        bug: missing stat_add_free in deallocate_impl

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Fixed exception safety bug in cuda_vm_resource.
        bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable
        fix: reorganize the code and apply changes to internal structures after potential exception has been thrown
        fix: restore temporarily modified availability mask
* Removed a trait that marked cuda_vm_resource as merge-capable.
        bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption
        fix: remove the trait
* Remove debug dump upon every OOM in cuda_vm_resource
* Fix stats
        bug: missing stat_add_free in deallocate_impl

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Fixed exception safety bug in cuda_vm_resource.
        bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable
        fix: reorganize the code and apply changes to internal structures after potential exception has been thrown
        fix: restore temporarily modified availability mask
* Removed a trait that marked cuda_vm_resource as merge-capable.
        bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption
        fix: remove the trait
* Remove debug dump upon every OOM in cuda_vm_resource
* Fix stats
        bug: missing stat_add_free in deallocate_impl

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Fixed exception safety bug in cuda_vm_resource.
        bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable
        fix: reorganize the code and apply changes to internal structures after potential exception has been thrown
        fix: restore temporarily modified availability mask
* Removed a trait that marked cuda_vm_resource as merge-capable.
        bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption
        fix: remove the trait
* Remove debug dump upon every OOM in cuda_vm_resource
* Fix stats
        bug: missing stat_add_free in deallocate_impl

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Fixed exception safety bug in cuda_vm_resource.
        bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable
        fix: reorganize the code and apply changes to internal structures after potential exception has been thrown
        fix: restore temporarily modified availability mask
* Removed a trait that marked cuda_vm_resource as merge-capable.
        bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption
        fix: remove the trait
* Remove debug dump upon every OOM in cuda_vm_resource
* Fix stats
        bug: missing stat_add_free in deallocate_impl

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Feb 21, 2022
* Fixed exception safety bug in cuda_vm_resource.
        bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable
        fix: reorganize the code and apply changes to internal structures after potential exception has been thrown
        fix: restore temporarily modified availability mask
* Removed a trait that marked cuda_vm_resource as merge-capable.
        bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption
        fix: remove the trait
* Remove debug dump upon every OOM in cuda_vm_resource
* Fix stats
        bug: missing stat_add_free in deallocate_impl

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
cyyever pushed a commit to cyyever/DALI that referenced this pull request May 13, 2022
* Fixed exception safety bug in cuda_vm_resource.
        bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable
        fix: reorganize the code and apply changes to internal structures after potential exception has been thrown
        fix: restore temporarily modified availability mask
* Removed a trait that marked cuda_vm_resource as merge-capable.
        bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption
        fix: remove the trait
* Remove debug dump upon every OOM in cuda_vm_resource
* Fix stats
        bug: missing stat_add_free in deallocate_impl

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jun 7, 2022
* Fixed exception safety bug in cuda_vm_resource.
        bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable
        fix: reorganize the code and apply changes to internal structures after potential exception has been thrown
        fix: restore temporarily modified availability mask
* Removed a trait that marked cuda_vm_resource as merge-capable.
        bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption
        fix: remove the trait
* Remove debug dump upon every OOM in cuda_vm_resource
* Fix stats
        bug: missing stat_add_free in deallocate_impl

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
important-fix Fixes an important issue in the software or development environment.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants