Synchronize CUDA stream once in operator benchmark #3525

szkarpinski · 2021-11-23T21:40:56Z

Description

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Refactoring (Redesign of existing code that doesn't affect functionality)
Other (e.g. Documentation, Tests, Configuration)

What happened in this PR

CUDA stream was synchronized after each iteration in an operator benchmark, which introduced an overhead to the measurements, especially for small data and small batch sizes. In a real pipeline the synchronization would not happen after each operator but only at the end, so the introduced overhead was benchmark-specific and would not occur in real life.

This overhead was quite significant, when the operator execution was fast: the throughput of Copy measured on Titan V with copy_bench.cc for batch_size=1 and 3MiB images was decreased by 25% when synchronizing the stream after each iteration. The main source of the overhead was that the synchronization made it impossible to schedule copy with cudaMemcpyAsync while another copy was still in progress.

This PR moves the stream synchronization out of the loop, synchronizing the stream only once in a benchmark, so that the measured operator performance is closer to the actual one.

Additional information

Key points relevant for the review:

As synchronizing each loop introduces a benchmark-specific overhead, I consider it a bug that should be fixed. I am assuming that:

Google Benchmark measures and reports the total execution time of all iterations, so spending most of the time in the last iteration does not impact the results
There are no use cases in which synchronizing the stream each iterations is desired

If any of the above is false, making synchronization behaviour configurable with some optional parameter to RunGPU would be a better solution.

Checklist

Tests

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: N/A

CUDA stream was synchronized after each iteration in operator benchmark, which introduced an error to the measurements, especially for small data and small batch sizes. In a real pipeline the synchronization would not happen after each operation. This commit moves the synchronization out of the loop, synchronizing the stream only once in a benchmark. Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

dali/benchmark/operator_bench.h

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

jantonguirao · 2021-12-07T11:56:12Z

!build

dali-automaton · 2021-12-07T12:01:29Z

CI MESSAGE: [3544555]: BUILD STARTED

dali-automaton · 2021-12-07T13:08:58Z

CI MESSAGE: [3544555]: BUILD PASSED

* Synchronize CUDA stream once in operator benchmark CUDA stream was synchronized after each iteration in operator benchmark, which introduced an error to the measurements, especially for small data and small batch sizes. In a real pipeline the synchronization would not happen after each operation. This commit moves the synchronization out of the loop, synchronizing the stream only once in a benchmark. Added sync_each_n parameter. Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

JanuszL assigned JanuszL and unassigned JanuszL Nov 23, 2021

JanuszL reviewed Nov 23, 2021

View reviewed changes

dali/benchmark/operator_bench.h Show resolved Hide resolved

jantonguirao assigned szalpal, awolant and jantonguirao Nov 24, 2021

szalpal removed their assignment Nov 30, 2021

Add sync_each_n parameter

ce2e317

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

jantonguirao approved these changes Dec 2, 2021

View reviewed changes

awolant approved these changes Dec 6, 2021

View reviewed changes

jantonguirao merged commit 8970881 into NVIDIA:main Dec 9, 2021

szkarpinski mentioned this pull request Jan 22, 2022

Fix synchronization bug in operator benchmark #3638

Merged

23 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronize CUDA stream once in operator benchmark #3525

Synchronize CUDA stream once in operator benchmark #3525

szkarpinski commented Nov 23, 2021 •

edited

Loading

jantonguirao commented Dec 7, 2021

dali-automaton commented Dec 7, 2021

dali-automaton commented Dec 7, 2021

Synchronize CUDA stream once in operator benchmark #3525

Synchronize CUDA stream once in operator benchmark #3525

Conversation

szkarpinski commented Nov 23, 2021 • edited Loading

Description

What happened in this PR

Additional information

Checklist

Tests

Documentation

DALI team only

Requirements

jantonguirao commented Dec 7, 2021

dali-automaton commented Dec 7, 2021

dali-automaton commented Dec 7, 2021

szkarpinski commented Nov 23, 2021 •

edited

Loading