Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault when accessing metadata buffer #212

Open
kleinschmidt opened this issue Oct 20, 2023 · 5 comments
Open

segfault when accessing metadata buffer #212

kleinschmidt opened this issue Oct 20, 2023 · 5 comments

Comments

@kleinschmidt
Copy link
Member

This was observed during some internal Beacon benchmarking of a large job (~25k tasks + a reduce step) with kuberay. I haven't been able to reproduce (re-submitting exactly the same job with no changes has been running smoothly).

Ray.get here is being called in the context of an async reduction step. The tasks being reduced over are generated via map a task that returns a DataFrame, and then reduced like this:

    out = DataFrame()
    progress = Progress(length(ray_objects); desc="Assembling segments...", showspeed=true)
    @sync for object in ray_objects
        # fetch objects as they are ready (may be out of order)
        @async begin
            result = Ray.get(object)
            append!(out, result; cols=:union)
            next!(progress)
        end
    end

My hunch is that there may be some kind of race condition here, where the julia async Tasks are somehow yielding in such a way as to cause the underlying memory to be freed. But really only a hunch. Full stacktrace from the segfault is below. The other thing I could think of off the top of my head is that there's something we're not handling around non-localmemorybuffer buffers, but it's hard to say. The mysterious thing is that we're basically only interacting with the metadata via CoreWorker API methods (gate a call to GetMetadata behind a call to HasMetadata:

https://github.com/beacon-biosignals/ray_core_worker_julia_jll.jl/blob/7f3aec0a39b6db86356ad658ab89d29531303caa/src/ray_julia_jll/common.jl#L258-L267

both of which are directly wrapping the C++ methods:

https://github.com/beacon-biosignals/ray_core_worker_julia_jll.jl/blob/7f3aec0a39b6db86356ad658ab89d29531303caa/build/wrapper.cc#L649-L661

Logs
Assembling segments...   4%|▋              |  ETA: 1:51:31 ( 0.28  s/it)
Assembling segments...   4%|▋              |  ETA: 1:50:47 ( 0.28  s/it)
Assembling segments...   4%|▋              |  ETA: 1:50:04 ( 0.28  s/it)
Assembling segments...   4%|▋              |  ETA: 1:49:17 ( 0.28  s/it)
Assembling segments...   4%|▋              |  ETA: 1:48:35 ( 0.28  s/it)
Assembling segments...   4%|▋              |  ETA: 1:52:43 ( 0.29  s/it)
Assembling segments...   5%|▊              |  ETA: 1:43:25 ( 0.27  s/it)
Assembling segments...   5%|▊              |  ETA: 1:40:53 ( 0.26  s/it)
Assembling segments...   5%|▊              |  ETA: 1:40:15 ( 0.26  s/it)
[4357] signal (11.1): Segmentation fault
in expression starting at /tmp/ray/session_2023-10-20_14-17-46_414642_1/runtime_resources/working_dir_files/_ray_pkg_47145e2793bb888e/migration.jl:7
_ZNSt17_Function_handlerIFPhRKN3ray6BufferEEZN5jlcxx11TypeWrapperIS2_E6methodIS0_S2_JEEERS8_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEMT0_KFT_DpT1_EEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_ at /usr/local/share/julia-depot/ab14e38af3/dev/Ray/build/bin/julia_core_worker_lib.so (unknown line)
_ZN5jlcxx6detail11CallFunctorIPhJRKN3ray6BufferEEE5applyEPKvNS_13WrappedCppPtrE at /usr/local/share/julia-depot/ab14e38af3/dev/Ray/build/bin/julia_core_worker_lib.so (unknown line)
Data at /usr/local/share/julia-depot/ab14e38af3/packages/CxxWrap/aXNBY/src/CxxWrap.jl:624 [inlined]
take! at /usr/local/share/julia-depot/4c99afcd78/dev/<REDACTED>/dev/Ray/src/ray_julia_jll/common.jl:284
get_metadata at /usr/local/share/julia-depot/4c99afcd78/dev/<REDACTED>/dev/Ray/src/ray_julia_jll/common.jl:263
deserialize_from_ray_object at /usr/local/share/julia-depot/4c99afcd78/dev/<REDACTED>/dev/Ray/src/ray_serializer.jl:117
get at /usr/local/share/julia-depot/4c99afcd78/dev/<REDACTED>/dev/Ray/src/object_store.jl:35
macro expansion at /usr/local/share/julia-depot/4c99afcd78/dev/<REDACTED>/src/v0_4_4.jl:172 [inlined]
#12 at ./task.jl:514
unknown function (ip: 0x7fce32938d6f)
_jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
start_task at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/task.c:1092
Allocations: 555720146 (Pool: 555458877; Big: 261269); GC: 73
Segmentation fault (core dumped)

---------------------------------------
Job 'raysubmit_uYMShKRzc8EEcqKJ' failed
---------------------------------------

Status message: Job failed due to an application error, last available logs (truncated to 20,000 chars):
get at /usr/local/share/julia-depot/4c99afcd78/dev/<REDACTED>/dev/Ray/src/object_store.jl:35
macro expansion at /usr/local/share/julia-depot/4c99afcd78/dev/<REDACTED>/src/v0_4_4.jl:172 [inlined]
#12 at ./task.jl:514
unknown function (ip: 0x7fce32938d6f)
_jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
start_task at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/task.c:1092
Allocations: 555720146 (Pool: 555458877; Big: 261269); GC: 73
Segmentation fault (core dumped)
@omus
Copy link
Member

omus commented Oct 20, 2023

@kleinschmidt this was using Ray.jl v0.0.2 right?

@kleinschmidt
Copy link
Member Author

It was on 7f3aec0a39b6db86356ad658ab89d29531303caa (the commit before teh workflows landed), and everything built from source in the docker image (using the Ray.jl-provided dockerfile as the base)

@kleinschmidt
Copy link
Member Author

Just hit this again, same code, same circusmstances (just started a fresh cluster and submitted job, got through ~4% of the work)

@kleinschmidt
Copy link
Member Author

...and again

@kleinschmidt
Copy link
Member Author

Can reproduce this pretty regularly (unfortunately not in a way that I can share since it's internal data).

I also ruled out the async reducer as the root cause; using a channel to make sure the gets happen synchronously didn't fix it.

I did a bit of poking around at whether we can check isnull(buffer[]) in take! as a guard against this, but when buffer[] is null pointer, we get the RayBuffer deleted error, not a segfault (at least on my M1 mac; maybe manifests differently on linux where I was running this...)

The next step for debugging this would probably be to just print the entire bytes during deserialization if HasMetadata returns true to see if we can reconstruct what's going on in a self-contained way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants