segfault when accessing metadata buffer #212

kleinschmidt · 2023-10-20T16:06:13Z

This was observed during some internal Beacon benchmarking of a large job (~25k tasks + a reduce step) with kuberay. I haven't been able to reproduce (re-submitting exactly the same job with no changes has been running smoothly).

Ray.get here is being called in the context of an async reduction step. The tasks being reduced over are generated via map a task that returns a DataFrame, and then reduced like this:

    out = DataFrame()
    progress = Progress(length(ray_objects); desc="Assembling segments...", showspeed=true)
    @sync for object in ray_objects
        # fetch objects as they are ready (may be out of order)
        @async begin
            result = Ray.get(object)
            append!(out, result; cols=:union)
            next!(progress)
        end
    end

My hunch is that there may be some kind of race condition here, where the julia async Tasks are somehow yielding in such a way as to cause the underlying memory to be freed. But really only a hunch. Full stacktrace from the segfault is below. The other thing I could think of off the top of my head is that there's something we're not handling around non-localmemorybuffer buffers, but it's hard to say. The mysterious thing is that we're basically only interacting with the metadata via CoreWorker API methods (gate a call to GetMetadata behind a call to HasMetadata:

https://github.com/beacon-biosignals/ray_core_worker_julia_jll.jl/blob/7f3aec0a39b6db86356ad658ab89d29531303caa/src/ray_julia_jll/common.jl#L258-L267

both of which are directly wrapping the C++ methods:

https://github.com/beacon-biosignals/ray_core_worker_julia_jll.jl/blob/7f3aec0a39b6db86356ad658ab89d29531303caa/build/wrapper.cc#L649-L661

Logs

Assembling segments...   4%|▋              |  ETA: 1:51:31 ( 0.28  s/it)
Assembling segments...   4%|▋              |  ETA: 1:50:47 ( 0.28  s/it)
Assembling segments...   4%|▋              |  ETA: 1:50:04 ( 0.28  s/it)
Assembling segments...   4%|▋              |  ETA: 1:49:17 ( 0.28  s/it)
Assembling segments...   4%|▋              |  ETA: 1:48:35 ( 0.28  s/it)
Assembling segments...   4%|▋              |  ETA: 1:52:43 ( 0.29  s/it)
Assembling segments...   5%|▊              |  ETA: 1:43:25 ( 0.27  s/it)
Assembling segments...   5%|▊              |  ETA: 1:40:53 ( 0.26  s/it)
Assembling segments...   5%|▊              |  ETA: 1:40:15 ( 0.26  s/it)
[4357] signal (11.1): Segmentation fault
in expression starting at /tmp/ray/session_2023-10-20_14-17-46_414642_1/runtime_resources/working_dir_files/_ray_pkg_47145e2793bb888e/migration.jl:7
_ZNSt17_Function_handlerIFPhRKN3ray6BufferEEZN5jlcxx11TypeWrapperIS2_E6methodIS0_S2_JEEERS8_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEMT0_KFT_DpT1_EEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_ at /usr/local/share/julia-depot/ab14e38af3/dev/Ray/build/bin/julia_core_worker_lib.so (unknown line)
_ZN5jlcxx6detail11CallFunctorIPhJRKN3ray6BufferEEE5applyEPKvNS_13WrappedCppPtrE at /usr/local/share/julia-depot/ab14e38af3/dev/Ray/build/bin/julia_core_worker_lib.so (unknown line)
Data at /usr/local/share/julia-depot/ab14e38af3/packages/CxxWrap/aXNBY/src/CxxWrap.jl:624 [inlined]
take! at /usr/local/share/julia-depot/4c99afcd78/dev/<REDACTED>/dev/Ray/src/ray_julia_jll/common.jl:284
get_metadata at /usr/local/share/julia-depot/4c99afcd78/dev/<REDACTED>/dev/Ray/src/ray_julia_jll/common.jl:263
deserialize_from_ray_object at /usr/local/share/julia-depot/4c99afcd78/dev/<REDACTED>/dev/Ray/src/ray_serializer.jl:117
get at /usr/local/share/julia-depot/4c99afcd78/dev/<REDACTED>/dev/Ray/src/object_store.jl:35
macro expansion at /usr/local/share/julia-depot/4c99afcd78/dev/<REDACTED>/src/v0_4_4.jl:172 [inlined]
#12 at ./task.jl:514
unknown function (ip: 0x7fce32938d6f)
_jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
start_task at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/task.c:1092
Allocations: 555720146 (Pool: 555458877; Big: 261269); GC: 73
Segmentation fault (core dumped)

---------------------------------------
Job 'raysubmit_uYMShKRzc8EEcqKJ' failed
---------------------------------------

Status message: Job failed due to an application error, last available logs (truncated to 20,000 chars):
get at /usr/local/share/julia-depot/4c99afcd78/dev/<REDACTED>/dev/Ray/src/object_store.jl:35
macro expansion at /usr/local/share/julia-depot/4c99afcd78/dev/<REDACTED>/src/v0_4_4.jl:172 [inlined]
#12 at ./task.jl:514
unknown function (ip: 0x7fce32938d6f)
_jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
start_task at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/task.c:1092
Allocations: 555720146 (Pool: 555458877; Big: 261269); GC: 73
Segmentation fault (core dumped)

The text was updated successfully, but these errors were encountered:

omus · 2023-10-20T16:25:33Z

@kleinschmidt this was using Ray.jl v0.0.2 right?

kleinschmidt · 2023-10-20T17:10:40Z

It was on 7f3aec0a39b6db86356ad658ab89d29531303caa (the commit before teh workflows landed), and everything built from source in the docker image (using the Ray.jl-provided dockerfile as the base)

kleinschmidt · 2023-10-20T19:22:52Z

Just hit this again, same code, same circusmstances (just started a fresh cluster and submitted job, got through ~4% of the work)

kleinschmidt · 2023-10-20T19:50:04Z

...and again

kleinschmidt · 2023-10-20T22:03:51Z

Can reproduce this pretty regularly (unfortunately not in a way that I can share since it's internal data).

I also ruled out the async reducer as the root cause; using a channel to make sure the gets happen synchronously didn't fix it.

I did a bit of poking around at whether we can check isnull(buffer[]) in take! as a guard against this, but when buffer[] is null pointer, we get the RayBuffer deleted error, not a segfault (at least on my M1 mac; maybe manifests differently on linux where I was running this...)

The next step for debugging this would probably be to just print the entire bytes during deserialization if HasMetadata returns true to see if we can reconstruct what's going on in a self-contained way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segfault when accessing metadata buffer #212

segfault when accessing metadata buffer #212

kleinschmidt commented Oct 20, 2023

omus commented Oct 20, 2023

kleinschmidt commented Oct 20, 2023

kleinschmidt commented Oct 20, 2023

kleinschmidt commented Oct 20, 2023

kleinschmidt commented Oct 20, 2023

segfault when accessing metadata buffer #212

segfault when accessing metadata buffer #212

Comments

kleinschmidt commented Oct 20, 2023

omus commented Oct 20, 2023

kleinschmidt commented Oct 20, 2023

kleinschmidt commented Oct 20, 2023

kleinschmidt commented Oct 20, 2023

kleinschmidt commented Oct 20, 2023