Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use RUBY_ASSERT when available #12216

Closed

Conversation

casperisfine
Copy link
Contributor

It should be present from Ruby 2.7 onwards.

The advantage over a raw assert is that it will trigger MRI's crash report on failure, making it much easier to debug at a glance.

For context I'm trying to debug an assertion failure we've been seeing on Ruby nigthly builds:

ruby: protobuf.c:404: ObjectCache_Add: Assertion `ObjectCache_Get(key) == val' failed.

@casperisfine casperisfine requested a review from a team as a code owner March 13, 2023 12:05
@casperisfine casperisfine requested review from esorot and removed request for a team March 13, 2023 12:05
It should be present from Ruby 2.7 onwards.

The advantage over a raw `assert` is that it will trigger MRI's
crash report on failure, making it much easier to debug at
a glance.

For context I'm trying to debug an assertion failure we've been seeing
on Ruby nigthly builds:

```
ruby: protobuf.c:404: ObjectCache_Add: Assertion `ObjectCache_Get(key) == val' failed.
```
@haberman haberman added the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Mar 13, 2023
@github-actions github-actions bot removed the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Mar 13, 2023
@haberman
Copy link
Member

Looks like this PR triggers the same assertion failure you are referring to:

Assertion Failed: ruby/ext/google/protobuf_c/protobuf.c:406:ObjectCache_Add:ObjectCache_Get(key) == val

@casperisfine
Copy link
Contributor Author

Oh, that's very interesting. I was suspecting it was because I recently added write barriers to WeakMap, but if it also fail on 2.7, then that's not it.

On our CI it failed in a bout 3% of workers, so is suspect it's caused by GC triggering at a very specifc place, I may be able to repro with GC.stress = true.

@casperisfine
Copy link
Contributor Author

I was indeed able to reproduce on first try with:

diff --git a/conformance/conformance_ruby.rb b/conformance/conformance_ruby.rb
index 0fa43391d..bf9c00d15 100755
--- a/conformance/conformance_ruby.rb
+++ b/conformance/conformance_ruby.rb
@@ -31,7 +31,10 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 require 'conformance/conformance_pb'
+p [:here]
+GC.stress = true
 require 'google/protobuf/test_messages_proto3_pb'
+GC.stress = false
 require 'google/protobuf/test_messages_proto2_pb'
 
 $test_count = 0
diff --git a/ruby/ext/google/protobuf_c/protobuf.c b/ruby/ext/google/protobuf_c/protobuf.c
index bee8ac288..bc318366a 100644
--- a/ruby/ext/google/protobuf_c/protobuf.c
+++ b/ruby/ext/google/protobuf_c/protobuf.c
@@ -403,6 +403,9 @@ void ObjectCache_Add(const void *key, VALUE val) {
 #if USE_SECONDARY_MAP
   rb_mutex_unlock(secondary_map_mutex);
 #endif
+  if (ObjectCache_Get(key) != val) {
+      rb_bug("ObjectCache_Get(key) != val");
+  }
   PBRUBY_ASSERT(ObjectCache_Get(key) == val);
 }

@casperisfine
Copy link
Contributor Author

I have a good repro loop, but I'm having a hard time figuring out why it's happening.

Clearly either val isn't added to the WeakMap, or it's ejected right after, but that can't possibly be via the finalizer since the object is still alive.

What is certain is that it happens frequently with GC.stress.

This may be a bug in Ruby. I'll try to find some time to dig deeper with @peterzhu2118

@haberman
Copy link
Member

@casperisfine
Copy link
Contributor Author

I wonder if we might need an RB_GC_GUARD(val)

That was my initial thought, but no, val isn't GCed, it's either not added to the WeakMap, or evicted:

but that can't possibly be via the finalizer since the object is still alive.

I tried nonetheless just to be sure, and it's still failing.

At this stage I suspect it might be a bug in WeakMap, as it uses two st_tables: https://github.com/ruby/ruby/blob/ac65ce16e904695ba45888d3fba641d12caf733a/weakmap.c#L396-L397, so perhaps it fails if GC triggers in the middle.

I'll try to reproduce this in a standalone script.

@casperisfine
Copy link
Contributor Author

I'll try to reproduce this in a standalone script.

Seems like I can't repro, in isolation :/

@casperisfine
Copy link
Contributor Author

I found an unrelated crash though -_-

wm = ObjectSpace::WeakMap.new
GC.stress = true
GC.auto_compact = true
obj = Object.new
1000.times do |i|
  wm[i] = obj #  [BUG] Segmentation fault at 0x0000000000000003
  raise "wat" unless wm[i].equal?(obj)
end

@casperisfine
Copy link
Contributor Author

Ok, so I have a good news and a bad news.

The good new is that I have figured out the bug.

The bad news is that it's a bug in Ruby. Here's the reproduction:

wmap = ObjectSpace::WeakMap.new

a = "A"
b = "B"

wmap[1] = a
wmap[1] = b # the table entry with 1 is still in the list of entries to clear when `a` is GCed

a = nil

GC.start

p wmap[1] # Should be `"B"`, but is `nil`

The bug is here: https://github.com/ruby/ruby/blob/f613c18912f4bf5492a8dc02ab5e3661f4ae907c/weakmap.c#L193-L194

When an object is added in a WeakMap as a value, the weak map keeps a list of all keys associated with that object, e.g.

class WeakMap
  def []=(key, value)
    @hash[key] = value
    @reverse[value] << key
  end
end

However when we store another value for that given key, that reverse list isn't updated, so when the original value is GCed, it delete the new value too.

I think I can fix it upstream, but no idea how to make this protobuf code work on versions that won't receive a backported fix.

@casperisfine
Copy link
Contributor Author

Upstream bug report: https://bugs.ruby-lang.org/issues/19531

@haberman
Copy link
Member

Excellent work on the repro. Thanks so much for tracking this down, filing the upstream bug, and working on a fix!

As a side note, it would be so nice if WeakMap was capable of making only the values weak, but not the keys. Then we could remove the code that uses a secondary map.

Any chance this could be added to WeakMap upstream?

@casperisfine
Copy link
Contributor Author

Any chance this could be added to WeakMap upstream?

We discussed this last year: https://bugs.ruby-lang.org/issues/16038, WeakKeysMap is added in 3.3, we could ask for WeakValuesMap too, but ultimately that wouldn't simplify this project code base until support for Rubies older than 3.3 is dropped, given that this project still support 2.3, that would take a decade before you could cleanup anything.

At this stage I wonder if the simplest wouldn't be to implement you own WeakMap, which I don't think would be too hard, especially given that your keys are pointer types, so a simple st_table and finalizer could do.

@haberman
Copy link
Member

At this stage I wonder if the simplest wouldn't be to implement you own WeakMap, which I don't think would be too hard

I always figured that WeakMap had privileged access to the GC, and that its efficiency depended on this. But it looks like it is just defining finalizers for every map element.

In that case, I agree we could just define our own weak map that makes only the values weak. It would probably be more efficient, because we could use plain void* as the key instead of converting it to a VALUE. And we could dispense with the secondary map entirely.

@casperisfine
Copy link
Contributor Author

It would probably be more efficient, because we could use plain void* as the key instead of converting it to a VALUE

We could also potentially avoid the finalizer by clearing the map from the free callbacks. Not sure if worth it, I'd have to review a bit how many objects actually end up in that map. If it's all of them, it would be a nice improvements, as finalizers are a bit annoying, they pin objects, and basically disable the FREE_IMMEDIATELY flag.

@casperisfine
Copy link
Contributor Author

It's not high priority for me, but I'll see if I can find some time to try that around next week.

@haberman
Copy link
Member

We could also potentially avoid the finalizer by clearing the map from the free callbacks.

I was going to say that these objects don't currently have free callbacks. But I guess that RUBY_DEFAULT_FREE is still a free callback.

Is there any guarantee about how promptly the free callbacks will be called? If the free() was delayed, it could run after a different insert had already happened, and delete the map entry inappropriately.

@haberman
Copy link
Member

btw, an alternative to st_table would be to use upb's internal hash table: https://github.com/protocolbuffers/upb/blob/main/upb/hash/int_table.h

It could make us less dependent on the internals of Ruby, if that is a possible concern.

@casperisfine
Copy link
Contributor Author

I guess that RUBY_DEFAULT_FREE is still a free callback.

Yes. And even if they didn't have one, it wouldn't be a problem, we could just define one without any worry.

Is there any guarantee about how promptly the free callbacks will be called?

If RUBY_TYPED_FREE_IMMEDIATELY is set, and the object has no finalizer defined, then it is freed as soon as it's collected.

If either the flag isn't set or a finalizer is set, then the callback is delayed to the dedicated "sweep" phase.

But either way we should be able to handle "race conditions" here, by cheking the map entry still points the GCed object.

The problem is that the free callback only receive the pointer to the C struct, not the VALUE reference, so it may be tricky and not worth it.

It could make us less dependent on the internals of Ruby, if that is a possible concern.

At this stage, ruby/st.h is public API, tons of gems depends on it, it's not gonna go away any time soon.

But either way, this is all conjecture, I'll have to dig more into the gem to figure what would be the most practical (finalizer vs free callback, ruby st_table vs upb int_table).

@haberman
Copy link
Member

The problem is that the free callback only receive the pointer to the C struct, not the VALUE reference, so it may be tricky and not worth it.

The C struct contains the pointer to the upb object which is the cache key to the weak map, so I don't think this should be a problem.

@casperisfine
Copy link
Contributor Author

Ok, so I tried implementing a custom weakmap, however It appears it's not possible, as we'd need access to rb_objspace_garbage_object_p to know the reference is scheduled to be swept.

@@ -432,7 +432,7 @@ static VALUE RepeatedField_dup(VALUE _self) {
*/
VALUE RepeatedField_to_ary(VALUE _self) {
RepeatedField* self = ruby_to_RepeatedField(_self);
int size = upb_Array_Size(self->array);
int size = self->array ? upb_Array_Size(self->array) : 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I'll submit individually, with similar fixes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you get a chance to submit a separate PR for this? I didn't see one.

What was the root cause that led you to this fix? I ask because it looks very similar to what I've been debugging in #11968

There is a similar symptom; self->array ends up being NULL. But in that case it's happening in RepeatedField_each. I've isolated it to a race condition involving the object cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, I was working on it at some point but never submitted it. I think at some point I assumed this case wasn't supposed to happen and was just a fallout of another bug, but I might have been wrong not sure.

I can submit a PR if you want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this case is supposed to happen, and I do think it is the fallout of another bug. I think this is probably not the right fix.

But I am curious what triggered this bug for you. I'm wondering if it's a similar root cause to #11968 (a race condition on the object cache) or something else.

@casperisfine
Copy link
Contributor Author

As for the Ruby bug, it's fixed upstream but hasn't been backported yet unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants