Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix large mini-batch handling in parallel external source #3768

Merged
merged 4 commits into from
Apr 1, 2022

Conversation

stiepan
Copy link
Member

@stiepan stiepan commented Mar 29, 2022

Category:

Bug fix (non-breaking change which fixes an issue)

Description:

Parallel external source passes data from workers to the main process through shared memory buffers.
If the capacity of a buffer exceeds max_int, worker process fails to serialize the minibatch meta-data (which includes the capacity) when it is written into C-like structure (python's struct package).

This PR:

  1. Makes serialization error message more verbose, so that it contains values that the worker attempted to put in the struct.
  2. Changes size-related members to be unsigned long long int instead of int.
  3. Adds test for shared queue class if it handles the values properly
  4. Adds a test that uses samples of 2GB in parallel external source.

Additional information:

Affected modules and functionalities:

  1. _multiproc module and tests.

Key points relevant for the review:

  1. Maybe unsigned long long int is too much and simply unsigned int there will be good enough?

Checklist

Tests

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: DALI-2679

…for storing capacity

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
@JanuszL JanuszL self-assigned this Mar 29, 2022
if not proc.exitcode:
task_queue.close()
proc.join()
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: no enter/exit to use with proc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For plain proc not really, but good point that it looked like a good candidate for ctx manager.

@mzient mzient self-assigned this Mar 29, 2022
… large batch test

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [4281330]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [4281330]: BUILD PASSED

@stiepan
Copy link
Member Author

stiepan commented Mar 31, 2022

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [4284870]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [4284870]: BUILD PASSED

@stiepan stiepan merged commit ea3dc40 into NVIDIA:main Apr 1, 2022
cyyever pushed a commit to cyyever/DALI that referenced this pull request May 13, 2022
* Add better error messaging on c-struct serialization
* Use wider type for storing capacity
* Add test for handling large sample in parallel external source

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jun 7, 2022
* Add better error messaging on c-struct serialization
* Use wider type for storing capacity
* Add test for handling large sample in parallel external source

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants