-
Notifications
You must be signed in to change notification settings - Fork 615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA VM resource bugfixes #3545
Conversation
CI MESSAGE: [3499776]: BUILD STARTED |
CI MESSAGE: [3499776]: BUILD FAILED |
print(std::cerr, "Could not allocate physical storage of size ", size, " on device ", | ||
device_ordinal_); | ||
dbg_dump(std::cerr); | ||
free_va_.put(va, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still should provide a message like - Attempted to allocate XXX, currently allocated YY, available ZZZ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to change the original exception type - now I'd have to rethrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can adjust at least adjust dbg_dump(std::cerr);
to print shorter and easier to digest message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done for CUDABadAlloc - others are rethrown directly.
assert(static_cast<int>(free_blocks.size()) == blocks_to_map); | ||
free_blocks.resize(blocks_to_map); | ||
for (int i = to_unmap.size(); i < blocks_to_map; i++) { | ||
// This can throw, but we should leave the object in a consistent state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// This can throw, but we should leave the object in a consistent state | |
// This can throw, but should leave the object in a consistent state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant:
This can throw, but (this) should leave the object in a consistent state ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that is why we
sounds strange here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't matter - both are fine and it's just an internal comment - not even doxygen.
res.deallocate(p1, size1); | ||
auto ®ion = res.va_regions_[0]; | ||
EXPECT_EQ(region.available_blocks, 4); | ||
EXPECT_EQ(region.mapped.find(false), 6); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EXPECT_EQ(region.mapped.find(false), 6); | |
EXPECT_EQ(region.mapped.find(true), 0); | |
EXPECT_EQ(region.mapped.find(false), 6); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Can you describe in the PR the bugs fixed and the solution? |
* Don't mark cuda_vm_resource as merge-capable. Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
9f81351
to
8a476a5
Compare
CI MESSAGE: [3501860]: BUILD STARTED |
Described in the PR message. |
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
CI MESSAGE: [3501927]: BUILD STARTED |
CI MESSAGE: [3501860]: BUILD FAILED |
CI MESSAGE: [3501927]: BUILD FAILED |
CI MESSAGE: [3501860]: BUILD PASSED |
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
CI MESSAGE: [3505387]: BUILD STARTED |
char *ptr = r.block_ptr<char>(block_idx); | ||
free_mapped_.get_specific_block(ptr, block_size_); | ||
stat_take_free(block_size_); | ||
out.push_back({ &r, block_idx }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out
is a template - it changed the meaning now.
CI MESSAGE: [3505387]: BUILD PASSED |
} | ||
assert(free_blocks.size() == static_cast<size_t>(blocks_to_map)); | ||
|
||
stat_allocate_blocks(new_blocks); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this to line 691?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope - if unmapping
throws for any reason, the blocks will just go away.
size_t free = 0, total = 0; | ||
CUDA_CALL(cudaFree(nullptr)); // initialize the default context | ||
CUDA_CALL(cuMemGetInfo(&free, &total)); | ||
size_t va_size = next_pow2(total * 3 / 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you could add a comment here explaining the meaning of those sizes, why it throws or doesn't when we allocate this or that. not a must, just a thought
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
CI MESSAGE: [3507145]: BUILD STARTED |
CI MESSAGE: [3507145]: BUILD PASSED |
* Fixed exception safety bug in cuda_vm_resource. bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable fix: reorganize the code and apply changes to internal structures after potential exception has been thrown fix: restore temporarily modified availability mask * Removed a trait that marked cuda_vm_resource as merge-capable. bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption fix: remove the trait * Remove debug dump upon every OOM in cuda_vm_resource * Fix stats bug: missing stat_add_free in deallocate_impl Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
* Fixed exception safety bug in cuda_vm_resource. bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable fix: reorganize the code and apply changes to internal structures after potential exception has been thrown fix: restore temporarily modified availability mask * Removed a trait that marked cuda_vm_resource as merge-capable. bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption fix: remove the trait * Remove debug dump upon every OOM in cuda_vm_resource * Fix stats bug: missing stat_add_free in deallocate_impl Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
* Fixed exception safety bug in cuda_vm_resource. bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable fix: reorganize the code and apply changes to internal structures after potential exception has been thrown fix: restore temporarily modified availability mask * Removed a trait that marked cuda_vm_resource as merge-capable. bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption fix: remove the trait * Remove debug dump upon every OOM in cuda_vm_resource * Fix stats bug: missing stat_add_free in deallocate_impl Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
* Fixed exception safety bug in cuda_vm_resource. bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable fix: reorganize the code and apply changes to internal structures after potential exception has been thrown fix: restore temporarily modified availability mask * Removed a trait that marked cuda_vm_resource as merge-capable. bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption fix: remove the trait * Remove debug dump upon every OOM in cuda_vm_resource * Fix stats bug: missing stat_add_free in deallocate_impl Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
* Fixed exception safety bug in cuda_vm_resource. bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable fix: reorganize the code and apply changes to internal structures after potential exception has been thrown fix: restore temporarily modified availability mask * Removed a trait that marked cuda_vm_resource as merge-capable. bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption fix: remove the trait * Remove debug dump upon every OOM in cuda_vm_resource * Fix stats bug: missing stat_add_free in deallocate_impl Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
* Fixed exception safety bug in cuda_vm_resource. bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable fix: reorganize the code and apply changes to internal structures after potential exception has been thrown fix: restore temporarily modified availability mask * Removed a trait that marked cuda_vm_resource as merge-capable. bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption fix: remove the trait * Remove debug dump upon every OOM in cuda_vm_resource * Fix stats bug: missing stat_add_free in deallocate_impl Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
* Fixed exception safety bug in cuda_vm_resource. bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable fix: reorganize the code and apply changes to internal structures after potential exception has been thrown fix: restore temporarily modified availability mask * Removed a trait that marked cuda_vm_resource as merge-capable. bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption fix: remove the trait * Remove debug dump upon every OOM in cuda_vm_resource * Fix stats bug: missing stat_add_free in deallocate_impl Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
* Fixed exception safety bug in cuda_vm_resource. bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable fix: reorganize the code and apply changes to internal structures after potential exception has been thrown fix: restore temporarily modified availability mask * Removed a trait that marked cuda_vm_resource as merge-capable. bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption fix: remove the trait * Remove debug dump upon every OOM in cuda_vm_resource * Fix stats bug: missing stat_add_free in deallocate_impl Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
* Fixed exception safety bug in cuda_vm_resource. bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable fix: reorganize the code and apply changes to internal structures after potential exception has been thrown fix: restore temporarily modified availability mask * Removed a trait that marked cuda_vm_resource as merge-capable. bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption fix: remove the trait * Remove debug dump upon every OOM in cuda_vm_resource * Fix stats bug: missing stat_add_free in deallocate_impl Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
* Fixed exception safety bug in cuda_vm_resource. bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable fix: reorganize the code and apply changes to internal structures after potential exception has been thrown fix: restore temporarily modified availability mask * Removed a trait that marked cuda_vm_resource as merge-capable. bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption fix: remove the trait * Remove debug dump upon every OOM in cuda_vm_resource * Fix stats bug: missing stat_add_free in deallocate_impl Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
* Fixed exception safety bug in cuda_vm_resource. bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable fix: reorganize the code and apply changes to internal structures after potential exception has been thrown fix: restore temporarily modified availability mask * Removed a trait that marked cuda_vm_resource as merge-capable. bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption fix: remove the trait * Remove debug dump upon every OOM in cuda_vm_resource * Fix stats bug: missing stat_add_free in deallocate_impl Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
* Fixed exception safety bug in cuda_vm_resource. bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable fix: reorganize the code and apply changes to internal structures after potential exception has been thrown fix: restore temporarily modified availability mask * Removed a trait that marked cuda_vm_resource as merge-capable. bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption fix: remove the trait * Remove debug dump upon every OOM in cuda_vm_resource * Fix stats bug: missing stat_add_free in deallocate_impl Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
* Fixed exception safety bug in cuda_vm_resource. bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable fix: reorganize the code and apply changes to internal structures after potential exception has been thrown fix: restore temporarily modified availability mask * Removed a trait that marked cuda_vm_resource as merge-capable. bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption fix: remove the trait * Remove debug dump upon every OOM in cuda_vm_resource * Fix stats bug: missing stat_add_free in deallocate_impl Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
* Fixed exception safety bug in cuda_vm_resource. bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable fix: reorganize the code and apply changes to internal structures after potential exception has been thrown fix: restore temporarily modified availability mask * Removed a trait that marked cuda_vm_resource as merge-capable. bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption fix: remove the trait * Remove debug dump upon every OOM in cuda_vm_resource * Fix stats bug: missing stat_add_free in deallocate_impl Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
* Fixed exception safety bug in cuda_vm_resource. bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable fix: reorganize the code and apply changes to internal structures after potential exception has been thrown fix: restore temporarily modified availability mask * Removed a trait that marked cuda_vm_resource as merge-capable. bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption fix: remove the trait * Remove debug dump upon every OOM in cuda_vm_resource * Fix stats bug: missing stat_add_free in deallocate_impl Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
* Fixed exception safety bug in cuda_vm_resource. bug: when map_storage threw, free_va was already allocated from and blocks were marked as unavailable fix: reorganize the code and apply changes to internal structures after potential exception has been thrown fix: restore temporarily modified availability mask * Removed a trait that marked cuda_vm_resource as merge-capable. bug: async_pool attempted to free a buffer in parts, but cuda_vm_resource enlarge these to meet internal alignment, causing corruption fix: remove the trait * Remove debug dump upon every OOM in cuda_vm_resource * Fix stats bug: missing stat_add_free in deallocate_impl Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz mzient@gmail.com
Description
What happened in this PR
Additional information
Affected modules and functionalities:
CUDA VM resource
Checklist
Tests
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: DALI-2097, DALI-2450