add/remove local ref on ObjectRef construction/finalize, scaffolding for testing reference counts #126

kleinschmidt · 2023-09-18T17:18:40Z

Low level JLL: wraps

CoreWorker::AddLocalReference
CoreWorker::RemoveLocalReference
CoreWorker::GetAllReferenceCounts

The last one returns a std::unordered_map which requires a custom wrapper for the specific type params (based on the resource request map wrap).

High level Ray API:

adds local reference on construction of ObjectRef (can be disabled by flag)
skips adding local reference when ray core worker initializes local ref to be 1 (i.e., Put and SubmitTask)
removes local reference in finalizer (cannot be disabled)
handles deepcopy appropriately (routes through constructor in order to increment ref count and install finalizer)

In order to get the high-level support working without too much pain, I decided to chagne teh ray_jll.ObjectID (points to C++ managed memory) to the String-formatted hex ID, and overloaded the getproperty to return a new instance of the ObjectID when it's needed (for passing off to Ray C++ code). I'm not totally happy with this; an alternative would be ot keep the oid field and change the handling in the deep copy, but I'm a bit wary about multiple-deallocation/segfaults on finalization...

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1205510215892649

codecov · 2023-09-18T17:24:51Z

Codecov Report

Merging #126 (cef9f32) into main (9a12fed) will increase coverage by 0.40%.
Report is 2 commits behind head on main.
The diff coverage is 97.56%.

❗ Current head cef9f32 differs from pull request most recent head fe6312c. Consider uploading reports for the commit fe6312c to get more accurate results

@@            Coverage Diff             @@
##             main     #126      +/-   ##
==========================================
+ Coverage   95.37%   95.77%   +0.40%     
==========================================
  Files           8        9       +1     
  Lines         389      426      +37     
==========================================
+ Hits          371      408      +37     
  Misses         18       18

Flag	Coverage Δ
Ray.jl	`95.77% <97.56%> (+0.40%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
src/object_ref.jl	`90.00% <96.96%> (+4.81%)`	⬆️
src/object_store.jl	`94.28% <100.00%> (+0.95%)`	⬆️
src/runtime.jl	`97.48% <100.00%> (-0.07%)`	⬇️

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

kleinschmidt · 2023-09-18T18:02:53Z

@omus this isn't quite ready for review yet

omus

Took a quick look

omus · 2023-09-18T18:02:09Z

src/object_ref.jl

+        if add_local_ref
+            worker = ray_jll.GetCoreWorker()
+            ray_jll.AddLocalReference(worker, oid)
+        end


Why not just add the reference after the constructor?

omus · 2023-09-18T18:03:49Z

src/object_ref.jl

+        return finalizer(objref) do objref
+            # putting finalizer behind `@async` may not be necessary since docs
+            # suggest that you should `ccall` IO functions.  But doing it this
+            # way allows us to do things like debug logging...
+            errormonitor(@async finalize_object_ref(objref))
+            return nothing
+        end


Do you want to conditionally add the finalizer based upon add_local_ref? Also, since you can add finalizers outside of the constructor it may nicer to separate reference tracking hooks from creating ObjectRef instances.

(gonna respond to both questions here since they're good ones and related IMO)

I think the only reason to not call AddLocalReference during construction is when you get the ID from a core worker operation that also increments the local ref; however, in that case, you still want to decrement the ref when the instance is GCed IIUC.

for instance, the object IDs created during task submission are initialized with ref count 1, but the doc string says that the frontend is still responsible for decrementing them when they go out of scope in the application:

/// Add a task that is pending execution. /// /// The local ref count for all return refs (excluding actor creation tasks) /// will be initialized to 1 so that the ref is considered in scope before /// returning to the language frontend. The caller is responsible for /// decrementing the ref count once the frontend ref has gone out of scope. /// /// \param[in] caller_address The rpc address of the calling task. /// \param[in] spec The spec of the pending task. /// \param[in] max_retries Number of times this task may be retried /// on failure. /// \return ObjectRefs returned by this task. std::vector<rpc::ObjectReference> AddPendingTask(const rpc::Address &caller_address, const TaskSpecification &spec, const std::string &call_site, int max_retries = 0);

(it's a bit unfortunate that teh CoreWorker::SubmitTask docstring does not mention this, you have to dig a bit into the task manager)

Another case: CoreWorker::Put calls AddOwnedReference and sets the add_local_ref arg to true:

Status CoreWorker::Put(const RayObject &object, const std::vector<ObjectID> &contained_object_ids, ObjectID *object_id) { *object_id = ObjectID::FromIndex(worker_context_.GetCurrentInternalTaskId(), worker_context_.GetNextPutIndex()); reference_counter_->AddOwnedObject(*object_id, contained_object_ids, rpc_address_, CurrentCallSite(), object.GetSize(), /*is_reconstructable=*/false, /*add_local_ref=*/true, NodeID::FromBinary(rpc_address_.raylet_id()));

https://github.com/beacon-biosignals/ray/blob/4ceb62daaad05124713ff9d94ffbdad35ee19f86/src/ray/core_worker/core_worker.cc#L1110-L1122

/// Add an object that we own. The object may depend on other objects. /// Dependencies for each ObjectID must be set at most once. The local /// reference count for the ObjectID is set to zero, which assumes that an /// ObjectID for it will be created in the language frontend after this call. /// /// TODO(swang): We could avoid copying the owner_address since /// we are the owner, but it is easier to store a copy for now, since the /// owner ID will change for workers executing normal tasks and it is /// possible to have leftover references after a task has finished. /// /// \param[in] object_id The ID of the object that we own. /// \param[in] contained_ids ObjectIDs that are contained in the object's value. /// As long as the object_id is in scope, the inner objects should not be GC'ed. /// \param[in] owner_address The address of the object's owner. /// \param[in] call_site Description of the call site where the reference was created. /// \param[in] object_size Object size if known, otherwise -1; /// \param[in] is_reconstructable Whether the object can be reconstructed /// through lineage re-execution. /// \param[in] add_local_ref Whether to initialize the local ref count to 1. /// This is used to ensure that the ref is considered in scope before the /// corresponding ObjectRef has been returned to the language frontend. /// \param[in] pinned_at_raylet_id The primary location for the object, if it /// is already known. This is only used for ray.put calls. void AddOwnedObject(const ObjectID &object_id, const std::vector<ObjectID> &contained_ids, const rpc::Address &owner_address, const std::string &call_site, const int64_t object_size, bool is_reconstructable, bool add_local_ref, const absl::optional<NodeID> &pinned_at_raylet_id = absl::optional<NodeID>()) LOCKS_EXCLUDED(mutex_);

https://github.com/beacon-biosignals/ray/blob/4ceb62daaad05124713ff9d94ffbdad35ee19f86/src/ray/core_worker/reference_count.h#L160-L163

in both those cases, we need the finalizer to remove the local ref even though we don't manually add the local ref when that object is created. I think that, in general, every instance of ObjectRef needs to "clean up" after itself like this, but not all of them need to add a local reference on initialization. So, should we move the add local ref out of hte constructor? I don't think so, because we don't necessarily control every code path by which an ObjectRef might be created (i.e., copy), so for safety's sake I think its best to default to always adding a local ref on construction.

kleinschmidt · 2023-09-19T22:28:07Z

Something's causing the eval stuff to segfault, will look into it tomorrow.

omus · 2023-09-20T20:12:24Z

ray_julia_jll/deps/wrapper.cc

+        .method("ObjectIDFromNil", []() {
+            auto id = ObjectID::Nil();
+            ObjectID id_deref = id;
+            return id_deref;
+        })


Should add a test for this

what's going on here exactly?

ObjectID::Nil returns a ref to an ObjectID (unlike all the otehr FromX functions). that means taht when comparing two objectIDs we need to handle BOTH values AND refs, and that was even more annoying than I epxected (was getting weird method errors even using @cxxdereference because of slightly less generic fallbacks generated by CxxWrap itself). So I just decided to punt here and return a value even though it's a bit less efficient.

ah I see - thanks!

Still needs a test

it'll eventually be covered by the tests for teh ownership registration (there weren't any other uses of it or pre-existing tests)

...and none of the other FromX have direct tests either

ray_julia_jll/src/wrappers/any.jl

ray_julia_jll/test/reference_counting.jl

src/object_ref.jl

omus · 2023-09-20T20:30:11Z

src/object_ref.jl

+    # oid::ray_jll.ObjectIDAllocated
+    oid_hex::String


Can you elaborate on why we should store the hex string instead of the oid? maybe we should store both?

yup. was getting some wacky segfaults that went away when I did it this way. I think probably they were due to double deallocation after copying. So, we could keep the ObjectID around but we'd need to special case handling of that every time we create an instance. Even on normal construction, if you construct >1 object ref from a single ObjectID instance, if you don't create a new instance then when you go to finalize the last one living you'll be accessing memory that's already been freed.

Basically it just feels safer to construct an ObjectID on demand when we need it, rather than trying to save a tiny amount of extra allocations by holding onto an instance that's managed by C++.

test/object_ref.jl

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>

kleinschmidt · 2023-09-21T19:28:03Z

Getting segfault/deadlock weirdness on #138 (during remove local ref calls in finalizer) even though it seems like the tests all passed, which makes me think the finalizer stuff is a bit fragile unfortunately. I'm re-running the CI to see if I can get failures to occur here too or if there's something meaningful about the changes made on that branch or if it's general flakiness

kleinschmidt · 2023-09-21T21:28:17Z

The segfaults are definitely flaky/non-deterministic but seem to be happening during the test teardown process, which makes me think there's a race condition between the async objectref finalizers and the core worker cleanup stuff. The C++ stack trace is different each time which is sus. I've tried inserting a GC/yield before tearing down the core worker, which seems to have fixed the segfaults on #138 , and cherry picked that to this branch. The other thing that's probably worth trying at some point is checking whether the core worker is actually initialized in the finalizer, but I'm not sure that would actually solve the problem....

omus · 2023-09-22T18:13:46Z

ray_julia_jll/deps/wrapper.cc

+        .method("ObjectIDFromNil", []() {
+            auto id = ObjectID::Nil();
+            ObjectID id_deref = id;
+            return id_deref;
+        })


Still needs a test

ray_julia_jll/test/reference_counting.jl

src/object_ref.jl

test/object_ref.jl

ray_julia_jll/test/reference_counting.jl

test/utils.jl

test/object_ref.jl

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>

omus

Couple of minor things. Also see: #126 (comment)

omus · 2023-09-22T20:26:12Z

src/object_store.jl

+#####
+##### Reference counting
+#####
+
+"""
+    get_all_reference_counts()
+
+For testing/debugging purposes, returns a
+`Dict{ray_jll.ObjectID,Tuple{Int,Int}}` containing the reference counts for each
+object ID that the local raylet knows about.  The first count is the "local
+reference" count, and the second is the count of submitted tasks depending on
+the object.
+"""
+function get_all_reference_counts()
+    worker = ray_jll.GetCoreWorker()
+    counts_raw = ray_jll.GetAllReferenceCounts(worker)
+
+    # we need to convert this to a dict we can actually work with.  we use the
+    # hex representation of the ID so we can avoid messing with the internal
+    # ObjectID representation...
+    counts = Dict(ray_jll.Hex(k) => Tuple(Int.(ray_jll._getindex(counts_raw, k)))
+                  for k in ray_jll._keys(counts_raw))
+    return counts
+end


Should be moved to ray_julia_jll

why? we have lots of functions in Ray.jl that wrap API calls like this and provide the high-level julia interface (i.e., Ray.put, Ray.get, Ray.get_job_id, etc.). as far as I can tell there aren't really clear criteria for what belongs in Ray.jl and what belongs in the jll.

as far as I can tell there aren't really clear criteria for what belongs in Ray.jl and what belongs in the jll.

we should probably do something about that at some point

In this particular case we mainly use this for:

local_count(oid_hex) = first(get(Ray.get_all_reference_counts(), oid_hex, 0))

Which is something I thought we'd want to reuse in the ray_julia_jll.

something I thought we'd want to reuse in the ray_julia_jll

maybe? I don't know off the top of my head what we'd use it for, but if it does become more convenient to define there we can revisit.

if we do need it, it might be more direct to just directly define that via something like

local_count(oid_hex) = first(_getindex(GetAllReferenceCounts(GetCoreWorker()), FromHex(ObjectID, oid_hex)))

kleinschmidt added 5 commits September 18, 2023 10:43

add/remove local ref in Ray.jl code

1566dae

let's try this...

f8eb36c

use lambda because of template inference failure

82a1a3d

GetAllLocalReferences with wrapped return map type

c7e9dfb

ref counting tests to exercise add/remove local reference

a3e86cb

kleinschmidt requested a review from omus September 18, 2023 17:18

kleinschmidt changed the title ~~add/remove local ref, scaffolding for testing reference counts~~ add/remove local ref on ObjectRef construction/finalize, scaffolding for testing reference counts Sep 18, 2023

kleinschmidt marked this pull request as draft September 18, 2023 18:03

omus reviewed Sep 18, 2023

View reviewed changes

kleinschmidt added 4 commits September 19, 2023 11:47

add local ref false

7d4936c

==, hash, and show for ObjectID; dont' return ref from Nil

ee5c0ae

get all reference counts in Ray.jl; specialized deepcopy

bef86fa

use hex string in objectref, finalize not async

79d391d

kleinschmidt added 2 commits September 20, 2023 14:15

restore async to finaliezr and yield in tests

a4e3869

also exercise the task return and construction counting

90ee48c

kleinschmidt marked this pull request as ready for review September 20, 2023 19:03

kleinschmidt requested a review from omus September 20, 2023 19:03

omus reviewed Sep 20, 2023

View reviewed changes

kleinschmidt mentioned this pull request Sep 20, 2023

include nested IDs in put/task args/returns #138

Merged

kleinschmidt requested a review from omus September 21, 2023 14:20

kleinschmidt and others added 4 commits September 21, 2023 12:28

Update ray_julia_jll/src/wrappers/any.jl

9f1271f

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>

Update ray_julia_jll/src/wrappers/any.jl

5eb670b

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>

Update test/object_ref.jl

3ec5f64

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>

Merge remote-tracking branch 'origin/main' into dfk/ownership

6dcc5e4

attempt to fix segault on tests

2cb36cf

omus reviewed Sep 22, 2023

View reviewed changes

test/object_ref.jl Outdated Show resolved Hide resolved

omus reviewed Sep 22, 2023

View reviewed changes

test/object_ref.jl Outdated Show resolved Hide resolved

kleinschmidt and others added 4 commits September 22, 2023 15:53

Update ray_julia_jll/test/reference_counting.jl

7b3dbeb

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>

Update src/object_ref.jl

a6d20c9

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>

Apply suggestions from code review

6eab0c4

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>

the ol' test shuffle

cef9f32

kleinschmidt requested a review from omus September 22, 2023 20:13

revert un hexing

fe6312c

omus approved these changes Sep 22, 2023

View reviewed changes

kleinschmidt merged commit 6f0db4e into main Sep 22, 2023
4 checks passed

kleinschmidt deleted the dfk/ownership branch September 22, 2023 20:30

glennmoy mentioned this pull request Sep 25, 2023

Define API bounday between ray_julia_jll and Ray.jl #144

Open

This was referenced Sep 27, 2023

Support passing object references between workers #77

Closed

Add tests for having multiple tasks in flight #50

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add/remove local ref on ObjectRef construction/finalize, scaffolding for testing reference counts #126

add/remove local ref on ObjectRef construction/finalize, scaffolding for testing reference counts #126

kleinschmidt commented Sep 18, 2023 •

edited

Loading

codecov bot commented Sep 18, 2023 •

edited

Loading

kleinschmidt commented Sep 18, 2023

omus left a comment

omus Sep 18, 2023

omus Sep 18, 2023

kleinschmidt Sep 18, 2023

kleinschmidt Sep 18, 2023

kleinschmidt Sep 18, 2023

kleinschmidt Sep 18, 2023

kleinschmidt commented Sep 19, 2023

omus Sep 20, 2023

glennmoy Sep 21, 2023

kleinschmidt Sep 21, 2023

glennmoy Sep 21, 2023

omus Sep 22, 2023

kleinschmidt Sep 22, 2023

kleinschmidt Sep 22, 2023

omus Sep 20, 2023

kleinschmidt Sep 21, 2023

kleinschmidt commented Sep 21, 2023

kleinschmidt commented Sep 21, 2023

omus Sep 22, 2023

omus left a comment •

edited

Loading

omus Sep 22, 2023

kleinschmidt Sep 22, 2023

glennmoy Sep 25, 2023

omus Sep 25, 2023

omus Sep 25, 2023

kleinschmidt Sep 25, 2023

add/remove local ref on ObjectRef construction/finalize, scaffolding for testing reference counts #126

add/remove local ref on ObjectRef construction/finalize, scaffolding for testing reference counts #126

Conversation

kleinschmidt commented Sep 18, 2023 • edited Loading

codecov bot commented Sep 18, 2023 • edited Loading

Codecov Report

kleinschmidt commented Sep 18, 2023

omus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kleinschmidt commented Sep 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kleinschmidt commented Sep 21, 2023

kleinschmidt commented Sep 21, 2023

Choose a reason for hiding this comment

omus left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kleinschmidt commented Sep 18, 2023 •

edited

Loading

codecov bot commented Sep 18, 2023 •

edited

Loading

omus left a comment •

edited

Loading