-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add/remove local ref on ObjectRef construction/finalize, scaffolding for testing reference counts #126
Conversation
Codecov Report
@@ Coverage Diff @@
## main #126 +/- ##
==========================================
+ Coverage 95.37% 95.77% +0.40%
==========================================
Files 8 9 +1
Lines 389 426 +37
==========================================
+ Hits 371 408 +37
Misses 18 18
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 1 file with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@omus this isn't quite ready for review yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick look
src/object_ref.jl
Outdated
if add_local_ref | ||
worker = ray_jll.GetCoreWorker() | ||
ray_jll.AddLocalReference(worker, oid) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just add the reference after the constructor?
src/object_ref.jl
Outdated
return finalizer(objref) do objref | ||
# putting finalizer behind `@async` may not be necessary since docs | ||
# suggest that you should `ccall` IO functions. But doing it this | ||
# way allows us to do things like debug logging... | ||
errormonitor(@async finalize_object_ref(objref)) | ||
return nothing | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to conditionally add the finalizer
based upon add_local_ref
? Also, since you can add finalizer
s outside of the constructor it may nicer to separate reference tracking hooks from creating ObjectRef
instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(gonna respond to both questions here since they're good ones and related IMO)
I think the only reason to not call AddLocalReference
during construction is when you get the ID from a core worker operation that also increments the local ref; however, in that case, you still want to decrement the ref when the instance is GCed IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for instance, the object IDs created during task submission are initialized with ref count 1, but the doc string says that the frontend is still responsible for decrementing them when they go out of scope in the application:
/// Add a task that is pending execution.
///
/// The local ref count for all return refs (excluding actor creation tasks)
/// will be initialized to 1 so that the ref is considered in scope before
/// returning to the language frontend. The caller is responsible for
/// decrementing the ref count once the frontend ref has gone out of scope.
///
/// \param[in] caller_address The rpc address of the calling task.
/// \param[in] spec The spec of the pending task.
/// \param[in] max_retries Number of times this task may be retried
/// on failure.
/// \return ObjectRefs returned by this task.
std::vector<rpc::ObjectReference> AddPendingTask(const rpc::Address &caller_address,
const TaskSpecification &spec,
const std::string &call_site,
int max_retries = 0);
(it's a bit unfortunate that teh CoreWorker::SubmitTask
docstring does not mention this, you have to dig a bit into the task manager)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another case: CoreWorker::Put
calls AddOwnedReference
and sets the add_local_ref
arg to true
:
Status CoreWorker::Put(const RayObject &object,
const std::vector<ObjectID> &contained_object_ids,
ObjectID *object_id) {
*object_id = ObjectID::FromIndex(worker_context_.GetCurrentInternalTaskId(),
worker_context_.GetNextPutIndex());
reference_counter_->AddOwnedObject(*object_id,
contained_object_ids,
rpc_address_,
CurrentCallSite(),
object.GetSize(),
/*is_reconstructable=*/false,
/*add_local_ref=*/true,
NodeID::FromBinary(rpc_address_.raylet_id()));
/// Add an object that we own. The object may depend on other objects.
/// Dependencies for each ObjectID must be set at most once. The local
/// reference count for the ObjectID is set to zero, which assumes that an
/// ObjectID for it will be created in the language frontend after this call.
///
/// TODO(swang): We could avoid copying the owner_address since
/// we are the owner, but it is easier to store a copy for now, since the
/// owner ID will change for workers executing normal tasks and it is
/// possible to have leftover references after a task has finished.
///
/// \param[in] object_id The ID of the object that we own.
/// \param[in] contained_ids ObjectIDs that are contained in the object's value.
/// As long as the object_id is in scope, the inner objects should not be GC'ed.
/// \param[in] owner_address The address of the object's owner.
/// \param[in] call_site Description of the call site where the reference was created.
/// \param[in] object_size Object size if known, otherwise -1;
/// \param[in] is_reconstructable Whether the object can be reconstructed
/// through lineage re-execution.
/// \param[in] add_local_ref Whether to initialize the local ref count to 1.
/// This is used to ensure that the ref is considered in scope before the
/// corresponding ObjectRef has been returned to the language frontend.
/// \param[in] pinned_at_raylet_id The primary location for the object, if it
/// is already known. This is only used for ray.put calls.
void AddOwnedObject(const ObjectID &object_id,
const std::vector<ObjectID> &contained_ids,
const rpc::Address &owner_address,
const std::string &call_site,
const int64_t object_size,
bool is_reconstructable,
bool add_local_ref,
const absl::optional<NodeID> &pinned_at_raylet_id =
absl::optional<NodeID>()) LOCKS_EXCLUDED(mutex_);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in both those cases, we need the finalizer to remove the local ref even though we don't manually add the local ref when that object is created. I think that, in general, every instance of ObjectRef
needs to "clean up" after itself like this, but not all of them need to add a local reference on initialization. So, should we move the add local ref out of hte constructor? I don't think so, because we don't necessarily control every code path by which an ObjectRef might be created (i.e., copy), so for safety's sake I think its best to default to always adding a local ref on construction.
Something's causing the eval stuff to segfault, will look into it tomorrow. |
.method("ObjectIDFromNil", []() { | ||
auto id = ObjectID::Nil(); | ||
ObjectID id_deref = id; | ||
return id_deref; | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add a test for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's going on here exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ObjectID::Nil
returns a ref to an ObjectID
(unlike all the otehr FromX
functions). that means taht when comparing two objectIDs we need to handle BOTH values AND refs, and that was even more annoying than I epxected (was getting weird method errors even using @cxxdereference
because of slightly less generic fallbacks generated by CxxWrap itself). So I just decided to punt here and return a value even though it's a bit less efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see - thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still needs a test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it'll eventually be covered by the tests for teh ownership registration (there weren't any other uses of it or pre-existing tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and none of the other FromX
have direct tests either
src/object_ref.jl
Outdated
# oid::ray_jll.ObjectIDAllocated | ||
oid_hex::String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on why we should store the hex string instead of the oid
? maybe we should store both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup. was getting some wacky segfaults that went away when I did it this way. I think probably they were due to double deallocation after copying. So, we could keep the ObjectID around but we'd need to special case handling of that every time we create an instance. Even on normal construction, if you construct >1 object ref from a single ObjectID instance, if you don't create a new instance then when you go to finalize the last one living you'll be accessing memory that's already been freed.
Basically it just feels safer to construct an ObjectID on demand when we need it, rather than trying to save a tiny amount of extra allocations by holding onto an instance that's managed by C++.
Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
Getting segfault/deadlock weirdness on #138 (during remove local ref calls in finalizer) even though it seems like the tests all passed, which makes me think the finalizer stuff is a bit fragile unfortunately. I'm re-running the CI to see if I can get failures to occur here too or if there's something meaningful about the changes made on that branch or if it's general flakiness |
The segfaults are definitely flaky/non-deterministic but seem to be happening during the test teardown process, which makes me think there's a race condition between the async objectref finalizers and the core worker cleanup stuff. The C++ stack trace is different each time which is sus. I've tried inserting a GC/yield before tearing down the core worker, which seems to have fixed the segfaults on #138 , and cherry picked that to this branch. The other thing that's probably worth trying at some point is checking whether the core worker is actually initialized in the finalizer, but I'm not sure that would actually solve the problem.... |
.method("ObjectIDFromNil", []() { | ||
auto id = ObjectID::Nil(); | ||
ObjectID id_deref = id; | ||
return id_deref; | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still needs a test
Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of minor things. Also see: #126 (comment)
##### | ||
##### Reference counting | ||
##### | ||
|
||
""" | ||
get_all_reference_counts() | ||
|
||
For testing/debugging purposes, returns a | ||
`Dict{ray_jll.ObjectID,Tuple{Int,Int}}` containing the reference counts for each | ||
object ID that the local raylet knows about. The first count is the "local | ||
reference" count, and the second is the count of submitted tasks depending on | ||
the object. | ||
""" | ||
function get_all_reference_counts() | ||
worker = ray_jll.GetCoreWorker() | ||
counts_raw = ray_jll.GetAllReferenceCounts(worker) | ||
|
||
# we need to convert this to a dict we can actually work with. we use the | ||
# hex representation of the ID so we can avoid messing with the internal | ||
# ObjectID representation... | ||
counts = Dict(ray_jll.Hex(k) => Tuple(Int.(ray_jll._getindex(counts_raw, k))) | ||
for k in ray_jll._keys(counts_raw)) | ||
return counts | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be moved to ray_julia_jll
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? we have lots of functions in Ray.jl that wrap API calls like this and provide the high-level julia interface (i.e., Ray.put
, Ray.get
, Ray.get_job_id
, etc.). as far as I can tell there aren't really clear criteria for what belongs in Ray.jl and what belongs in the jll.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as far as I can tell there aren't really clear criteria for what belongs in Ray.jl and what belongs in the jll.
we should probably do something about that at some point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case we mainly use this for:
local_count(oid_hex) = first(get(Ray.get_all_reference_counts(), oid_hex, 0))
Which is something I thought we'd want to reuse in the ray_julia_jll
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something I thought we'd want to reuse in the ray_julia_jll
maybe? I don't know off the top of my head what we'd use it for, but if it does become more convenient to define there we can revisit.
if we do need it, it might be more direct to just directly define that via something like
local_count(oid_hex) = first(_getindex(GetAllReferenceCounts(GetCoreWorker()), FromHex(ObjectID, oid_hex)))
Low level JLL: wraps
CoreWorker::AddLocalReference
CoreWorker::RemoveLocalReference
CoreWorker::GetAllReferenceCounts
The last one returns a
std::unordered_map
which requires a custom wrapper for the specific type params (based on the resource request map wrap).High level Ray API:
ObjectRef
(can be disabled by flag)Put
andSubmitTask
)deepcopy
appropriately (routes through constructor in order to increment ref count and install finalizer)In order to get the high-level support working without too much pain, I decided to chagne teh
ray_jll.ObjectID
(points to C++ managed memory) to theString
-formatted hex ID, and overloaded thegetproperty
to return a new instance of theObjectID
when it's needed (for passing off to Ray C++ code). I'm not totally happy with this; an alternative would be ot keep theoid
field and change the handling in the deep copy, but I'm a bit wary about multiple-deallocation/segfaults on finalization...