-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pandaproxy: Fix a crash when publishing multiple protobof schema #3596
pandaproxy: Fix a crash when publishing multiple protobof schema #3596
Conversation
co_return co_await make_protobuf_schema_definition( | ||
*this, std::move(schema)); | ||
co_return co_await make_protobuf_schema_definition(*this, schema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the undefined behavior this looks ok. unless I'm missing something, does this change fixing the issue point at the possibility that there is a different issue being masked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made all the CP.51 and CP.53 changes to eliminate me assuming this is a miscompile. I've probably missed some, but this appears to reliably fix the crash, for now. Very similar to the issue discussed here #3214 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can reliably create the crash:
- With the test cases.
- Manually running a curl 10 times.
- In the Materialize test suite.
This one change fixes them all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically what happens is that it appears as though the schema
type()
is lost, the variant inside valid_schema_def
then assumes the first type, avro
, then a valid_protobuf
schema is effectively reinterpret_cast
as valid_avro
, and it all goes very wrong from there, usually deep in the memory allocator.
It's a Heisenbug though, it only occurs in release mode, and adding instrumentation often fixes it. It does reproduce in the debugger, but lots of stuff has been inlined, so it's hard to figure out what went wrong.
As per [cpp core guideline CP.53](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#cp53-parameters-to-coroutines-should-not-be-passed-by-reference\) Signed-off-by: Ben Pope <ben@vectorized.io>
As per [cpp core guideline CP.51](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#cp51-do-not-use-capturing-lambdas-that-are-coroutines) Signed-off-by: Ben Pope <ben@vectorized.io>
As per [cpp core guideline CP.51](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#cp51-do-not-use-capturing-lambdas-that-are-coroutines) Signed-off-by: Ben Pope <ben@vectorized.io>
Signed-off-by: Ben Pope <ben@vectorized.io>
6633940
to
913f19d
Compare
@BenPope which test will trigger the bug? |
Yeah, pretty reliably. I usually just run a couple of iterations of this:
|
I built this PR in release mode and applied this patch nwatkins@gordon:~/src/redpanda/vbuild/release/clang$ git diff
diff --git a/src/v/pandaproxy/schema_registry/sharded_store.cc b/src/v/pandaproxy/schema_registry/sharded_store.cc
index ccb6df891..b1e85126e 100644
--- a/src/v/pandaproxy/schema_registry/sharded_store.cc
+++ b/src/v/pandaproxy/schema_registry/sharded_store.cc
@@ -102,7 +102,7 @@ sharded_store::make_valid_schema(canonical_schema schema) {
case schema_type::avro:
co_return make_avro_schema_definition(schema.def().raw()()).value();
case schema_type::protobuf:
- co_return co_await make_protobuf_schema_definition(*this, schema);
+ co_return co_await make_protobuf_schema_definition(*this, std::move(schema));
case schema_type::json:
throw as_exception(invalid_schema_type(schema.type()));
} then I ran this a bunch of times but things seemed to be ok
is there more to get it to reproduce? |
No, that was pretty reliable for me. |
Backtrace:
Which is showing that it's destructing an Worthy of note is that the avro type is the first one in the variant inside |
I can't reproduce with clang-13. Only broken on release build with clang-12.0.1
--- a/src/v/pandaproxy/schema_registry/sharded_store.cc
+++ b/src/v/pandaproxy/schema_registry/sharded_store.cc
@@ -105,9 +105,9 @@ sharded_store::make_valid_schema(canonical_schema schema) {
co_return co_await make_protobuf_schema_definition(
*this, std::move(schema));
case schema_type::json:
- throw as_exception(invalid_schema_type(schema.type()));
+ break;
}
- __builtin_unreachable();
+ throw as_exception(invalid_schema_type(schema.type()));
}
--- a/src/v/pandaproxy/schema_registry/sharded_store.cc
+++ b/src/v/pandaproxy/schema_registry/sharded_store.cc
@@ -104,10 +104,9 @@ sharded_store::make_valid_schema(canonical_schema schema) {
case schema_type::protobuf:
co_return co_await make_protobuf_schema_definition(
*this, std::move(schema));
- case schema_type::json:
+ default:
throw as_exception(invalid_schema_type(schema.type()));
}
- __builtin_unreachable();
}
--- a/src/v/pandaproxy/schema_registry/sharded_store.cc
+++ b/src/v/pandaproxy/schema_registry/sharded_store.cc
@@ -99,8 +99,9 @@ ss::future<> sharded_store::validate_schema(canonical_schema schema) {
ss::future<valid_schema>
sharded_store::make_valid_schema(canonical_schema schema) {
switch (schema.type()) {
- case schema_type::avro:
+ case schema_type::avro: {
co_return make_avro_schema_definition(schema.def().raw()()).value();
+ }
case schema_type::protobuf:
co_return co_await make_protobuf_schema_definition(
*this, std::move(schema));
--- a/src/v/pandaproxy/schema_registry/sharded_store.cc
+++ b/src/v/pandaproxy/schema_registry/sharded_store.cc
@@ -102,8 +102,7 @@ sharded_store::make_valid_schema(canonical_schema schema) {
case schema_type::avro:
co_return make_avro_schema_definition(schema.def().raw()()).value();
case schema_type::protobuf:
- co_return co_await make_protobuf_schema_definition(
- *this, std::move(schema));
+ co_return co_await make_protobuf_schema_definition(*this, schema);
case schema_type::json:
throw as_exception(invalid_schema_type(schema.type()));
}
--- a/src/v/pandaproxy/schema_registry/sharded_store.cc
+++ b/src/v/pandaproxy/schema_registry/sharded_store.cc
@@ -96,7 +96,7 @@ ss::future<> sharded_store::validate_schema(canonical_schema schema) {
__builtin_unreachable();
}
-ss::future<valid_schema>
+__attribute__((optnone)) ss::future<valid_schema>
sharded_store::make_valid_schema(canonical_schema schema) {
switch (schema.type()) {
case schema_type::avro: |
I've failed to meaningfully reduce this. Not sure what to do next. |
The previous commits address Core CPP guideline CP.51 and CP.53, but there is still a crash when producing schema in quick succession. There seems to be a miscompilation in sharded_store::make_valid_schema. See redpanda-data#3596 for details. Copying the schema is a workaround. Fix redpanda-data#3559 Signed-off-by: Ben Pope <ben@vectorized.io>
913f19d
to
5b5b9cf
Compare
The previous commits address Core CPP guideline CP.51 and CP.53, but there is still a crash when producing schema in quick succession. There seems to be a miscompilation in sharded_store::make_valid_schema. See redpanda-data#3596 for details. Copying the schema is a workaround. Fix redpanda-data#3559 Signed-off-by: Ben Pope <ben@vectorized.io> (cherry picked from commit 5b5b9cf)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ___ ___ ┃
┃ / /\ ___ /__/\ ┃
┃ / /:/_ / /\ | |::\ ┃
┃ ___ ___ / /:/ /\ / /:/ | |:|:\ ┃
┃ /__/\ / /\ / /:/_/::\ / /:/ __|__|:|\:\ ┃
┃ \ \:\ / /:/ /__/:/__\/\:\ / /::\ /__/::::| \:\ ┃
┃ \ \:\ /:/ \ \:\ /~~/:/ /__/:/\:\ \ \:\~~\__\/ ┃
┃ \ \:\/:/ \ \:\ /:/ \__\/ \:\ \ \:\ ┃
┃ \ \::/ \ \:\/:/ \ \:\ \ \:\ ┃
┃ \__\/ \ \::/ \__\/ \ \:\ ┃
┃ \__\/ \__\/ ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
shard_for(sub), | ||
_smp_opts, | ||
[marker, sub{std::move(sub)}, permanent](store& s) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use after move?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's change this in a separate PR
Using a belt-and-braces approach to convince clang not to miscompile this function. * Introduce an artificial scope for co_await blocks * Avoid `__builtin_unreachable()` The backtrace on arm is: ``` void std::__1::__libcpp_operator_delete<void*>(void*) at /vectorized/llvm/bin/../include/c++/v1/new:245 (inlined by) void std::__1::__do_deallocate_handle_size<>(void*, unsigned long) at /vectorized/llvm/bin/../include/c++/v1/new:269 (inlined by) std::__1::__libcpp_deallocate(void*, unsigned long, unsigned long) at /vectorized/llvm/bin/../include/c++/v1/new:285 (inlined by) std::__1::allocator<char>::deallocate(char*, unsigned long) at /vectorized/llvm/bin/../include/c++/v1/memory:874 (inlined by) std::__1::allocator_traits<std::__1::allocator<char> >::deallocate(std::__1::allocator<char>&, char*, unsigned long) at /vectorized/llvm/bin/../include/c++/v1/__memory/allocator_traits.h:280 (inlined by) ~basic_string at /vectorized/llvm/bin/../include/c++/v1/string:2237 (inlined by) ~error_info at /var/lib/buildkite-agent/builds/buildkite-arm64-builders-i-058d4ead6b38020f3-1/vectorized/redpanda/vbuild/release/clang/../../../src/v/pandaproxy/schema_registry/errors.h:25 (inlined by) ~basic_result_storage at /vectorized/include/boost/outcome/detail/basic_result_storage.hpp:113 (inlined by) pandaproxy::schema_registry::sharded_store::make_valid_schema(pandaproxy::schema_registry::typed_schema<pandaproxy::schema_registry::canonical_schema_defnition_tag>) at /var/lib/buildkite-agent/builds/buildkite-arm64-builders-i-058d4ead6b38020f3-1/vectorized/redpanda/vbuild/release/clang/../../../src/v/pandaproxy/schema_registry/sharded_store.cc:110 ``` Which implies it's destructing the frame for the avro coroutine, not the protobuf one. See redpanda-data#3596 for details Signed-off-by: Ben Pope <ben@vectorized.io>
The previous commits address Core CPP guideline CP.51 and CP.53, but there is still a crash when producing schema in quick succession. There seems to be a miscompilation in sharded_store::make_valid_schema. See redpanda-data#3596 for details. Copying the schema is a workaround. Fix redpanda-data#3559 Signed-off-by: Ben Pope <ben@vectorized.io>
Using a belt-and-braces approach to convince clang not to miscompile this function. * Introduce an artificial scope for co_await blocks * Avoid `__builtin_unreachable()` The backtrace on arm is: ``` void std::__1::__libcpp_operator_delete<void*>(void*) at /vectorized/llvm/bin/../include/c++/v1/new:245 (inlined by) void std::__1::__do_deallocate_handle_size<>(void*, unsigned long) at /vectorized/llvm/bin/../include/c++/v1/new:269 (inlined by) std::__1::__libcpp_deallocate(void*, unsigned long, unsigned long) at /vectorized/llvm/bin/../include/c++/v1/new:285 (inlined by) std::__1::allocator<char>::deallocate(char*, unsigned long) at /vectorized/llvm/bin/../include/c++/v1/memory:874 (inlined by) std::__1::allocator_traits<std::__1::allocator<char> >::deallocate(std::__1::allocator<char>&, char*, unsigned long) at /vectorized/llvm/bin/../include/c++/v1/__memory/allocator_traits.h:280 (inlined by) ~basic_string at /vectorized/llvm/bin/../include/c++/v1/string:2237 (inlined by) ~error_info at /var/lib/buildkite-agent/builds/buildkite-arm64-builders-i-058d4ead6b38020f3-1/vectorized/redpanda/vbuild/release/clang/../../../src/v/pandaproxy/schema_registry/errors.h:25 (inlined by) ~basic_result_storage at /vectorized/include/boost/outcome/detail/basic_result_storage.hpp:113 (inlined by) pandaproxy::schema_registry::sharded_store::make_valid_schema(pandaproxy::schema_registry::typed_schema<pandaproxy::schema_registry::canonical_schema_defnition_tag>) at /var/lib/buildkite-agent/builds/buildkite-arm64-builders-i-058d4ead6b38020f3-1/vectorized/redpanda/vbuild/release/clang/../../../src/v/pandaproxy/schema_registry/sharded_store.cc:110 ``` Which implies it's destructing the frame for the avro coroutine, not the protobuf one. See redpanda-data#3596 for details Signed-off-by: Ben Pope <ben@vectorized.io> (cherry picked from commit 231febc)
Cover letter
Fix #3559
This is a logical continuation of work done in #3214 and addresses a similar bug.
Address some CPP Core guidelines:
Fix the bug:
schema
to prevent a crash.Notes for reviewer
CP.51 was fixed by copying captures to the coroutine frame.
It's probably worth pointing out that this trick works because seastar awaitable are
std::suspend_never initial_suspend() noexcept
.And for posterity: But what if I told you there was an easier way, where you can have your capturing lambda be a coroutine?
Changes in force-push
std::move
s to thesub
captures.Release notes
Improvements