Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CPU hang on multithreaded benchmarks overriding the memory manager #1849

Open
jlecavelier opened this issue Sep 6, 2024 · 2 comments

Comments

@jlecavelier
Copy link

Describe the bug
Consider a multithreaded benchmark scenario which overrides the memory manager.
In this scenario we want all the benchmark threads to synchronize at a given point in time and we use benchmark::State::threads() to fetch the number of threads that need to be synced.
This leads to a CPU hang because when the memory manager is overriden, BenchmarkRunner::DoOneRepetition() runs a final pass to gather memory statistics on a single thread while recycling a fixture that was designed for multithreaded execution.

Here is a minimal repro scenario:

#include <stdio.h>
#include <benchmark/benchmark.h>

#include <condition_variable>
#include <mutex>
#include <thread>

// Set this to 0 to remove the hang
#define MEMORY_MANAGER_OVERRIDE 1

/*
Minimal repro:
When performing a benchmark and overriding the memory manager, state.threads() becomes unreliable;
This happens because a memory benchmark test will be ran on a single thread, recycling a fixture that was created for a multi-threaded scenario
*/

using cscoped_lock = std::unique_lock<std::mutex>;

class memory_manager final : public benchmark::MemoryManager {
public:
	memory_manager() = default;
	void Start() override final {};
	void Stop(Result&) override final {};
};

class benchmark_fixture : public benchmark::Fixture {
	std::mutex m_mutex;
	std::condition_variable m_cv;
	uint32_t m_num_threads = 0;

private:
	void SetUp(const benchmark::State& _state) override final;	// Waits on every thread to enter before calling to OnBenchmarkSetup
	void TearDown(const benchmark::State& _state) override final;	// Waits on every thread to enter before calling to OnBenchmarkTearDown

private:
	void OnBenchmarkSetup() {};
	void OnBenchmarkTearDown() {};

public:
	virtual ~benchmark_fixture() = default;
};

void benchmark_fixture::SetUp(const benchmark::State& _state) {
	const uint32_t thread_count = _state.threads();

	cscoped_lock lock(m_mutex);
	++m_num_threads;

	const bool startup = m_num_threads == thread_count;
	if (startup) {
		OnBenchmarkSetup();	// The last thread to enter will perform the benchmark setup
		m_cv.notify_all();
	} else {
		m_cv.wait(lock);
	}
}

void benchmark_fixture::TearDown(const benchmark::State& _state) {
	const uint32_t thread_count = _state.threads();

	cscoped_lock lock(m_mutex);
	--m_num_threads;
	const bool teardown = m_num_threads == 0;
	if (teardown) {
		OnBenchmarkTearDown(); // The last thread to enter will perform the benchmark teardown
		m_cv.notify_all();
	} else {
		m_cv.wait(lock);
	}
}

BENCHMARK_DEFINE_F(benchmark_fixture, run)
(benchmark::State& _state) {
	constexpr uint32_t wait_time[] = { 16, 14, 12, 10, 8, 6, 4, 2, 0 };

	// Wait an arbitrary period of time to simulate work
	for (auto _ : _state) {
		const uint32_t thread_index = _state.thread_index();
		const uint32_t wait_time_ms = wait_time[thread_index / 2];
		std::this_thread::sleep_for(std::chrono::milliseconds(wait_time_ms));
	}
}
BENCHMARK_REGISTER_F(benchmark_fixture, run)->Iterations(4)->ThreadRange(1, 16);

int main(int argc, char** argv) {
	char arg0_default[] = "benchmark";
	char* args_default = arg0_default;
	if (!argv) {
		argc = 1;
		argv = &args_default;
	}
	::benchmark::Initialize(&argc, argv);
	if (::benchmark::ReportUnrecognizedArguments(argc, argv)) {
		return 1;
	}

#if MEMORY_MANAGER_OVERRIDE
	memory_manager mem_mgr;
	::benchmark::RegisterMemoryManager(&mem_mgr);
#endif

	::benchmark::RunSpecifiedBenchmarks();
	::benchmark::Shutdown();
	return 0;
}

Why are memory statistics gathered in a multithreaded scenario not following the same path as regular fixture runs?
Why are they recycling a fixture that was built to run in a multithreaded fashion to perform a single threaded test?

@dmah42
Copy link
Member

dmah42 commented Sep 9, 2024

you shouldn't need a fixture for this i believe, as the section before the benchmarked loop is only run on thread 0: https://google.github.io/benchmark/user_guide.html#multithreaded-benchmarks "it is guaranteed that none of the threads will start until all have reached the start of the benchmark loop, and all will have finished before any thread exits the benchmark loop".

i think you can hugely simplify your code.

instead of answering your questions about why the library works the way that it does, what exactly are you trying to do?

@jlecavelier
Copy link
Author

jlecavelier commented Sep 9, 2024

Yes, you are correct, the code to achieve this very specific behavior could be much more simple. This code's purpose is to provide a minimal repro of an API issue. This issue will be encountered in any scenario where you might need to rely on benchmark::State::threads to perform a sync of all of your threads at any point in time in you benchmark. Let me provide more context with a different hypothetical scenario.
Imagine that I am making a rendering framework and I want to profile some tasks that happen before and after I kick off a frame ( for instance: recording command buffers, creating view descriptors, etc, etc. ). But I do not want to profile the frame kick-off itself.
To do that I'm going to perform the work I want to profile and measure it and then I will synchronize my threads and pause the timing of my test while my frame is being processed and until I want profiling to resume.
In this scenario I will encounter the same issue as in the minimal example because benchmark::State::threads is simply not behaving as described in the documentation.

Now, regardless of any scenario, wouldn't you agree that benchmark::State::threads not behaving as documented and returning an incorrect value (because that is what is is doing) is a flaw here?
Wouldn't you agree that it is reasonable for someone who registers a benchmark to run in a multithreaded fashion, explicitly overriding the memory manager, to expect gathering of memory statistics happening as they set it up, in a multi threaded fashion, and not in serial?

Thank you for your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants