Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add/remove local ref on ObjectRef construction/finalize, scaffolding for testing reference counts #126

Merged
merged 21 commits into from
Sep 22, 2023

Conversation

kleinschmidt
Copy link
Member

@kleinschmidt kleinschmidt commented Sep 18, 2023

Low level JLL: wraps

  • CoreWorker::AddLocalReference
  • CoreWorker::RemoveLocalReference
  • CoreWorker::GetAllReferenceCounts

The last one returns a std::unordered_map which requires a custom wrapper for the specific type params (based on the resource request map wrap).

High level Ray API:

  • adds local reference on construction of ObjectRef (can be disabled by flag)
  • skips adding local reference when ray core worker initializes local ref to be 1 (i.e., Put and SubmitTask)
  • removes local reference in finalizer (cannot be disabled)
  • handles deepcopy appropriately (routes through constructor in order to increment ref count and install finalizer)

In order to get the high-level support working without too much pain, I decided to chagne teh ray_jll.ObjectID (points to C++ managed memory) to the String-formatted hex ID, and overloaded the getproperty to return a new instance of the ObjectID when it's needed (for passing off to Ray C++ code). I'm not totally happy with this; an alternative would be ot keep the oid field and change the handling in the deep copy, but I'm a bit wary about multiple-deallocation/segfaults on finalization...


@kleinschmidt kleinschmidt changed the title add/remove local ref, scaffolding for testing reference counts add/remove local ref on ObjectRef construction/finalize, scaffolding for testing reference counts Sep 18, 2023
@codecov
Copy link

codecov bot commented Sep 18, 2023

Codecov Report

Merging #126 (cef9f32) into main (9a12fed) will increase coverage by 0.40%.
Report is 2 commits behind head on main.
The diff coverage is 97.56%.

❗ Current head cef9f32 differs from pull request most recent head fe6312c. Consider uploading reports for the commit fe6312c to get more accurate results

@@            Coverage Diff             @@
##             main     #126      +/-   ##
==========================================
+ Coverage   95.37%   95.77%   +0.40%     
==========================================
  Files           8        9       +1     
  Lines         389      426      +37     
==========================================
+ Hits          371      408      +37     
  Misses         18       18              
Flag Coverage Δ
Ray.jl 95.77% <97.56%> (+0.40%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
src/object_ref.jl 90.00% <96.96%> (+4.81%) ⬆️
src/object_store.jl 94.28% <100.00%> (+0.95%) ⬆️
src/runtime.jl 97.48% <100.00%> (-0.07%) ⬇️

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@kleinschmidt
Copy link
Member Author

@omus this isn't quite ready for review yet

@kleinschmidt kleinschmidt marked this pull request as draft September 18, 2023 18:03
Copy link
Member

@omus omus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick look

Comment on lines 8 to 11
if add_local_ref
worker = ray_jll.GetCoreWorker()
ray_jll.AddLocalReference(worker, oid)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just add the reference after the constructor?

Comment on lines 13 to 19
return finalizer(objref) do objref
# putting finalizer behind `@async` may not be necessary since docs
# suggest that you should `ccall` IO functions. But doing it this
# way allows us to do things like debug logging...
errormonitor(@async finalize_object_ref(objref))
return nothing
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to conditionally add the finalizer based upon add_local_ref? Also, since you can add finalizers outside of the constructor it may nicer to separate reference tracking hooks from creating ObjectRef instances.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(gonna respond to both questions here since they're good ones and related IMO)

I think the only reason to not call AddLocalReference during construction is when you get the ID from a core worker operation that also increments the local ref; however, in that case, you still want to decrement the ref when the instance is GCed IIUC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for instance, the object IDs created during task submission are initialized with ref count 1, but the doc string says that the frontend is still responsible for decrementing them when they go out of scope in the application:

  /// Add a task that is pending execution.
  ///
  /// The local ref count for all return refs (excluding actor creation tasks)
  /// will be initialized to 1 so that the ref is considered in scope before
  /// returning to the language frontend. The caller is responsible for
  /// decrementing the ref count once the frontend ref has gone out of scope.
  ///
  /// \param[in] caller_address The rpc address of the calling task.
  /// \param[in] spec The spec of the pending task.
  /// \param[in] max_retries Number of times this task may be retried
  /// on failure.
  /// \return ObjectRefs returned by this task.
  std::vector<rpc::ObjectReference> AddPendingTask(const rpc::Address &caller_address,
                                                   const TaskSpecification &spec,
                                                   const std::string &call_site,
                                                   int max_retries = 0);

(it's a bit unfortunate that teh CoreWorker::SubmitTask docstring does not mention this, you have to dig a bit into the task manager)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another case: CoreWorker::Put calls AddOwnedReference and sets the add_local_ref arg to true:

Status CoreWorker::Put(const RayObject &object,
                       const std::vector<ObjectID> &contained_object_ids,
                       ObjectID *object_id) {
  *object_id = ObjectID::FromIndex(worker_context_.GetCurrentInternalTaskId(),
                                   worker_context_.GetNextPutIndex());
  reference_counter_->AddOwnedObject(*object_id,
                                     contained_object_ids,
                                     rpc_address_,
                                     CurrentCallSite(),
                                     object.GetSize(),
                                     /*is_reconstructable=*/false,
                                     /*add_local_ref=*/true,
                                     NodeID::FromBinary(rpc_address_.raylet_id()));

https://github.com/beacon-biosignals/ray/blob/4ceb62daaad05124713ff9d94ffbdad35ee19f86/src/ray/core_worker/core_worker.cc#L1110-L1122

  /// Add an object that we own. The object may depend on other objects.
  /// Dependencies for each ObjectID must be set at most once. The local
  /// reference count for the ObjectID is set to zero, which assumes that an
  /// ObjectID for it will be created in the language frontend after this call.
  ///
  /// TODO(swang): We could avoid copying the owner_address since
  /// we are the owner, but it is easier to store a copy for now, since the
  /// owner ID will change for workers executing normal tasks and it is
  /// possible to have leftover references after a task has finished.
  ///
  /// \param[in] object_id The ID of the object that we own.
  /// \param[in] contained_ids ObjectIDs that are contained in the object's value.
  /// As long as the object_id is in scope, the inner objects should not be GC'ed.
  /// \param[in] owner_address The address of the object's owner.
  /// \param[in] call_site Description of the call site where the reference was created.
  /// \param[in] object_size Object size if known, otherwise -1;
  /// \param[in] is_reconstructable Whether the object can be reconstructed
  /// through lineage re-execution.
  /// \param[in] add_local_ref Whether to initialize the local ref count to 1.
  /// This is used to ensure that the ref is considered in scope before the
  /// corresponding ObjectRef has been returned to the language frontend.
  /// \param[in] pinned_at_raylet_id The primary location for the object, if it
  /// is already known. This is only used for ray.put calls.
  void AddOwnedObject(const ObjectID &object_id,
                      const std::vector<ObjectID> &contained_ids,
                      const rpc::Address &owner_address,
                      const std::string &call_site,
                      const int64_t object_size,
                      bool is_reconstructable,
                      bool add_local_ref,
                      const absl::optional<NodeID> &pinned_at_raylet_id =
                          absl::optional<NodeID>()) LOCKS_EXCLUDED(mutex_);

https://github.com/beacon-biosignals/ray/blob/4ceb62daaad05124713ff9d94ffbdad35ee19f86/src/ray/core_worker/reference_count.h#L160-L163

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in both those cases, we need the finalizer to remove the local ref even though we don't manually add the local ref when that object is created. I think that, in general, every instance of ObjectRef needs to "clean up" after itself like this, but not all of them need to add a local reference on initialization. So, should we move the add local ref out of hte constructor? I don't think so, because we don't necessarily control every code path by which an ObjectRef might be created (i.e., copy), so for safety's sake I think its best to default to always adding a local ref on construction.

@kleinschmidt
Copy link
Member Author

Something's causing the eval stuff to segfault, will look into it tomorrow.

@kleinschmidt kleinschmidt marked this pull request as ready for review September 20, 2023 19:03
Comment on lines +479 to +483
.method("ObjectIDFromNil", []() {
auto id = ObjectID::Nil();
ObjectID id_deref = id;
return id_deref;
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add a test for this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's going on here exactly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ObjectID::Nil returns a ref to an ObjectID (unlike all the otehr FromX functions). that means taht when comparing two objectIDs we need to handle BOTH values AND refs, and that was even more annoying than I epxected (was getting weird method errors even using @cxxdereference because of slightly less generic fallbacks generated by CxxWrap itself). So I just decided to punt here and return a value even though it's a bit less efficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see - thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still needs a test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'll eventually be covered by the tests for teh ownership registration (there weren't any other uses of it or pre-existing tests)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and none of the other FromX have direct tests either

ray_julia_jll/src/wrappers/any.jl Show resolved Hide resolved
ray_julia_jll/src/wrappers/any.jl Outdated Show resolved Hide resolved
ray_julia_jll/src/wrappers/any.jl Outdated Show resolved Hide resolved
ray_julia_jll/test/reference_counting.jl Show resolved Hide resolved
ray_julia_jll/test/reference_counting.jl Show resolved Hide resolved
src/object_ref.jl Outdated Show resolved Hide resolved
src/object_ref.jl Show resolved Hide resolved
Comment on lines 2 to 3
# oid::ray_jll.ObjectIDAllocated
oid_hex::String
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on why we should store the hex string instead of the oid? maybe we should store both?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup. was getting some wacky segfaults that went away when I did it this way. I think probably they were due to double deallocation after copying. So, we could keep the ObjectID around but we'd need to special case handling of that every time we create an instance. Even on normal construction, if you construct >1 object ref from a single ObjectID instance, if you don't create a new instance then when you go to finalize the last one living you'll be accessing memory that's already been freed.

Basically it just feels safer to construct an ObjectID on demand when we need it, rather than trying to save a tiny amount of extra allocations by holding onto an instance that's managed by C++.

test/object_ref.jl Outdated Show resolved Hide resolved
kleinschmidt and others added 4 commits September 21, 2023 12:28
Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
@kleinschmidt
Copy link
Member Author

Getting segfault/deadlock weirdness on #138 (during remove local ref calls in finalizer) even though it seems like the tests all passed, which makes me think the finalizer stuff is a bit fragile unfortunately. I'm re-running the CI to see if I can get failures to occur here too or if there's something meaningful about the changes made on that branch or if it's general flakiness

@kleinschmidt
Copy link
Member Author

The segfaults are definitely flaky/non-deterministic but seem to be happening during the test teardown process, which makes me think there's a race condition between the async objectref finalizers and the core worker cleanup stuff. The C++ stack trace is different each time which is sus. I've tried inserting a GC/yield before tearing down the core worker, which seems to have fixed the segfaults on #138 , and cherry picked that to this branch. The other thing that's probably worth trying at some point is checking whether the core worker is actually initialized in the finalizer, but I'm not sure that would actually solve the problem....

Comment on lines +479 to +483
.method("ObjectIDFromNil", []() {
auto id = ObjectID::Nil();
ObjectID id_deref = id;
return id_deref;
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still needs a test

ray_julia_jll/test/reference_counting.jl Show resolved Hide resolved
ray_julia_jll/test/reference_counting.jl Show resolved Hide resolved
src/object_ref.jl Show resolved Hide resolved
src/object_ref.jl Outdated Show resolved Hide resolved
src/object_ref.jl Outdated Show resolved Hide resolved
src/object_ref.jl Outdated Show resolved Hide resolved
test/object_ref.jl Outdated Show resolved Hide resolved
ray_julia_jll/test/reference_counting.jl Show resolved Hide resolved
test/utils.jl Show resolved Hide resolved
test/object_ref.jl Outdated Show resolved Hide resolved
test/object_ref.jl Outdated Show resolved Hide resolved
kleinschmidt and others added 4 commits September 22, 2023 15:53
Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
Copy link
Member

@omus omus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor things. Also see: #126 (comment)

Comment on lines +84 to +107
#####
##### Reference counting
#####

"""
get_all_reference_counts()

For testing/debugging purposes, returns a
`Dict{ray_jll.ObjectID,Tuple{Int,Int}}` containing the reference counts for each
object ID that the local raylet knows about. The first count is the "local
reference" count, and the second is the count of submitted tasks depending on
the object.
"""
function get_all_reference_counts()
worker = ray_jll.GetCoreWorker()
counts_raw = ray_jll.GetAllReferenceCounts(worker)

# we need to convert this to a dict we can actually work with. we use the
# hex representation of the ID so we can avoid messing with the internal
# ObjectID representation...
counts = Dict(ray_jll.Hex(k) => Tuple(Int.(ray_jll._getindex(counts_raw, k)))
for k in ray_jll._keys(counts_raw))
return counts
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be moved to ray_julia_jll

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? we have lots of functions in Ray.jl that wrap API calls like this and provide the high-level julia interface (i.e., Ray.put, Ray.get, Ray.get_job_id, etc.). as far as I can tell there aren't really clear criteria for what belongs in Ray.jl and what belongs in the jll.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as I can tell there aren't really clear criteria for what belongs in Ray.jl and what belongs in the jll.

we should probably do something about that at some point

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case we mainly use this for:

local_count(oid_hex) = first(get(Ray.get_all_reference_counts(), oid_hex, 0))

Which is something I thought we'd want to reuse in the ray_julia_jll.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something I thought we'd want to reuse in the ray_julia_jll

maybe? I don't know off the top of my head what we'd use it for, but if it does become more convenient to define there we can revisit.

if we do need it, it might be more direct to just directly define that via something like

local_count(oid_hex) = first(_getindex(GetAllReferenceCounts(GetCoreWorker()), FromHex(ObjectID, oid_hex)))

@kleinschmidt kleinschmidt merged commit 6f0db4e into main Sep 22, 2023
4 checks passed
@kleinschmidt kleinschmidt deleted the dfk/ownership branch September 22, 2023 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants