Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSX Travis fails spawn test with InterruptException #17626

Closed
tkelman opened this issue Jul 26, 2016 · 25 comments · Fixed by #20073 or #20174
Closed

OSX Travis fails spawn test with InterruptException #17626

tkelman opened this issue Jul 26, 2016 · 25 comments · Fixed by #20073 or #20174
Assignees
Labels
domain:ci Continuous integration domain:io Involving the I/O subsystem: libuv, read, write, etc. kind:heisenbug This bug occurs unpredictably system:mac Affects only macOS test This change adds or pertains to unit tests

Comments

@tkelman
Copy link
Contributor

tkelman commented Jul 26, 2016

e.g. https://travis-ci.org/JuliaLang/julia/jobs/147328890
or https://travis-ci.org/JuliaLang/julia/jobs/146808733

    From worker 6:       * spawn                       [stdio passthrough ok]
    From worker 6:  Error During Test
    From worker 6:    Test threw an exception of type InterruptException
    From worker 6:    Expression: success(truecmd)
    From worker 6:    InterruptException:
    From worker 6:     in push! at ./array.jl:444 [inlined]
    From worker 6:     in wait(::Condition) at ./event.jl:14
    From worker 6:     in stream_wait(::Base.Process, ::Condition, ::Vararg{Condition,N}) at ./stream.jl:44
    From worker 6:     in wait at ./process.jl:712 [inlined]
    From worker 6:     in success(::Base.Process) at ./process.jl:602
    From worker 6:     in success(::Cmd) at ./process.jl:608
    From worker 6:     in include_string(::String, ::String) at ./loading.jl:380
    From worker 6:     in include_from_node1(::String) at ./loading.jl:429
    From worker 6:     in macro expansion at ./util.jl:180 [inlined]
    From worker 6:     in runtests(::String) at /private/tmp/julia/share/julia/test/testdefs.jl:7
    From worker 6:     in (::Base.Serializer.__deserialized_types__.##277)(::String) at /private/tmp/julia/share/julia/test/runtests.jl:44
    From worker 6:     in (::Base.##499#501{Base.CallMsg{:call_fetch}})() at ./multi.jl:1193
    From worker 6:     in run_work_thunk(::Base.##499#501{Base.CallMsg{:call_fetch}}, ::Bool) at ./multi.jl:844
    From worker 6:     in macro expansion at ./multi.jl:1193 [inlined]
    From worker 6:     in (::Base.##498#500{Base.CallMsg{:call_fetch},Base.MsgHeader,TCPSocket})() at ./event.jl:46
@tkelman tkelman added system:mac Affects only macOS test This change adds or pertains to unit tests kind:heisenbug This bug occurs unpredictably labels Jul 26, 2016
@vtjnash
Copy link
Sponsor Member

vtjnash commented Jul 26, 2016

@yuyichao Could this be a SIGSEGV happening, but getting picked up by the catch_exception_raise, jl_addr_is_safepoint fiter? The backtrace seems to be too consistent, unless that ccall is really just the first safepoint?

@yuyichao
Copy link
Contributor

Unless there's actually a SIGINT or further corruption (i.e. the signal pending flag is corrupted), the SEGV at safepoint should throw a interruption. If the signal pending flag is not set, a safepoint that didn't wait for the GC is assumed to be a GC safepoint that is triggered after the GC is finished and shouldn't throw anything.

@JeffBezanson JeffBezanson added the status:priority This should be addressed urgently label Jul 29, 2016
@JeffBezanson
Copy link
Sponsor Member

This is happening fairly often. Will need to fix this to get reliable CI.

@tkelman
Copy link
Contributor Author

tkelman commented Aug 10, 2016

slightly different backtrace on this one, but probably same underlying cause

From worker 6:       * cmdlineargs          Error During Test
    From worker 6:    Test threw an exception of type InterruptException
    From worker 6:    Expression: readchomp(pipeline(ignorestatus(@cmd("\$exename --startup-file=no -p")),stderr=catcmd)) == "ERROR: option `-p/--procs` is missing an argument"
    From worker 6:    InterruptException:
    From worker 6:     in push! at ./array.jl:476 [inlined]
    From worker 6:     in #spawn#417(::Nullable{Base.ProcessChain}, ::Function, ::Cmd, ::Tuple{Base.DevNullStream,Pipe,Ptr{Void}}) at ./process.jl:472
    From worker 6:     in (::Base.#kw##spawn)(::Array{Any,1}, ::Base.#spawn, ::Cmd, ::Tuple{Base.DevNullStream,Pipe,Ptr{Void}}) at ./<missing>:0
    From worker 6:     in #spawn#416(::Nullable{Base.ProcessChain}, ::Function, ::Base.ErrOrCmds, ::Tuple{Base.DevNullStream,Pipe,Base.TTY}) at ./process.jl:385
    From worker 6:     in spawn(::Base.ErrOrCmds, ::Tuple{Base.DevNullStream,Pipe,Base.TTY}) at ./process.jl:378
    From worker 6:     in open(::Base.ErrOrCmds, ::String, ::Base.DevNullStream) at ./process.jl:524
    From worker 6:     in read(::Base.ErrOrCmds, ::Base.DevNullStream) at ./process.jl:559
    From worker 6:     in readstring; at ./process.jl:566 [inlined] (repeats 2 times)
    From worker 6:     in readchomp; at ./io.jl:324 [inlined]
    From worker 6:     in macro expansion; at /private/tmp/julia/share/julia/test/cmdlineargs.jl:250 [inlined]
    From worker 6:     in anonymous at ./<missing>:?
    From worker 6:     in include_string(::String, ::String) at ./loading.jl:380
    From worker 6:     in include_from_node1(::String) at ./loading.jl:429
    From worker 6:     in macro expansion at ./util.jl:226 [inlined]
    From worker 6:     in runtests(::String) at /private/tmp/julia/share/julia/test/testdefs.jl:7
    From worker 6:     in (::Base.Serializer.__deserialized_types__.##16#24)(::String) at /private/tmp/julia/share/julia/test/runtests.jl:44
    From worker 6:     in (::Base.##627#629{Base.CallMsg{:call_fetch}})() at ./multi.jl:1421
    From worker 6:     in run_work_thunk(::Base.##627#629{Base.CallMsg{:call_fetch}}, ::Bool) at ./multi.jl:1001
    From worker 6:     in macro expansion at ./multi.jl:1421 [inlined]
    From worker 6:     in (::Base.##626#628{Base.CallMsg{:call_fetch},Base.MsgHeader,TCPSocket})() at ./event.jl:68
    From worker 7:       * examples              in  23.67 seconds, maxrss  579.75 MB
Exception running test cmdlineargs :
On worker 6:
LoadError: There was an error during testing
 in record at ./test.jl:397
 in do_test at ./test.jl:281
 in macro expansion; at /private/tmp/julia/share/julia/test/cmdlineargs.jl:250 [inlined]
 in anonymous at ./<missing>:?
 in include_string at ./loading.jl:380
 in include_from_node1 at ./loading.jl:429
 in macro expansion at ./util.jl:226 [inlined]
 in runtests at /private/tmp/julia/share/julia/test/testdefs.jl:7
 in #16 at /private/tmp/julia/share/julia/test/runtests.jl:44
 in #627 at ./multi.jl:1421
 in run_work_thunk at ./multi.jl:1001
 in macro expansion at ./multi.jl:1421 [inlined]
 in #626 at ./event.jl:68
while loading /private/tmp/julia/share/julia/test/cmdlineargs.jl, in expression starting on line 11
ERROR: LoadError: Some tests exited with errors.
 in (::##11#19)() at /private/tmp/julia/share/julia/test/runtests.jl:72
 in cd(::##11#19, ::String) at ./file.jl:59
 in include_from_node1(::String) at ./loading.jl:426
 in process_options(::Base.JLOptions) at ./client.jl:262
 in _start() at ./client.jl:318
while loading /private/tmp/julia/share/julia/test/runtests.jl, in expression starting on line 26

@tkelman
Copy link
Contributor Author

tkelman commented Sep 14, 2016

not Travis specific: buildbot log https://build.julialang.org/builders/build_osx10.9-x64/builds/1751/steps/shell_2/logs/stdio

Exception running test spawn :
On worker 7:
LoadError: InterruptException:
 in typeinf_edge at ./inference.jl:1627
 in typeinf_ext at ./inference.jl:1650
 in #pipeline at ./<missing>:0
 in jlcall_#pipeline_19864 at /Users/osx/buildbot/slave/build_osx10_9-x64/build/usr/lib/julia/sys.dylib:?
 in include_string at ./loading.jl:478
 in jlcall_include_string_33102 at /Users/osx/buildbot/slave/build_osx10_9-x64/build/usr/lib/julia/sys.dylib:?
 in include_from_node1 at ./loading.jl:544
 in jlcall_include_from_node1_18844 at /Users/osx/buildbot/slave/build_osx10_9-x64/build/usr/lib/julia/sys.dylib:?
 in macro expansion at ./util.jl:232 [inlined]
 in runtests at /Users/osx/buildbot/slave/build_osx10_9-x64/build/test/testdefs.jl:7
 in #619 at ./multi.jl:1425
 in run_work_thunk at ./multi.jl:1005
 in macro expansion at ./multi.jl:1425 [inlined]
 in #618 at ./event.jl:68
while loading /Users/osx/buildbot/slave/build_osx10_9-x64/build/test/spawn.jl, in expression starting on line 207
ERROR: LoadError: Some tests exited with errors.
 in (::##11#18)() at /Users/osx/buildbot/slave/build_osx10_9-x64/build/test/runtests.jl:72

@vtjnash ideas?

@staticfloat
Copy link
Sponsor Member

Yep, just wanted to pop in and say this is still happening on the buildtest buildbots.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 21, 2017

shucks, on the merge commit too: https://travis-ci.org/JuliaLang/julia/jobs/194054914

@tkelman
Copy link
Contributor Author

tkelman commented Jan 24, 2017

still failing https://travis-ci.org/JuliaLang/julia/jobs/194719508

Error in testset spawn:
Error During Test
  Test threw an exception of type TypeError
  Expression: ex::Base.UVError.code == Base.UV_EMFILE
  TypeError: anonymous: in typeassert, expected Base.UVError, got InterruptException

@tkelman tkelman reopened this Jan 24, 2017
@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 24, 2017

Seems to be a new / different failure, in that it consistently occurs in a different test.

@tkelman
Copy link
Contributor Author

tkelman commented Jan 27, 2017

@JeffBezanson
Copy link
Sponsor Member

@iblislin
Copy link
Member

https://travis-ci.org/JuliaLang/julia/jobs/200079892

I got similar failure on FreeBSD (4GB ram + 4GB swap on it).
https://julia.iblis.cnmc.tw/#/builders/1/builds/178/steps/6/logs/stdio
Seems this is caused by OS's OOM killer.

└─[iblis@ns]% uname -a
FreeBSD ns 12.0-CURRENT FreeBSD 12.0-CURRENT #3 r313500M: Fri Feb 10 22:39:13 CST 2017     root@ns:/usr/obj/usr/src/sys/GENERIC  amd64
┌─[~]
| [-- INSERT --]
└─[iblis@ns]% dmesg | tail
swap_pager: out of swap space
swap_pager_getswapspace(16): failed
pid 48631 (julia), uid 1001, was killed: out of swap space
swap_pager: out of swap space
swap_pager_getswapspace(16): failed
pid 70675 (julia), uid 1001, was killed: out of swap space
pid 70676 (julia), uid 1001, was killed: out of swap space
swap_pager: out of swap space
swap_pager_getswapspace(16): failed
pid 13530 (julia), uid 1001, was killed: out of swap space

@StefanKarpinski
Copy link
Sponsor Member

Seems this is caused by OS's OOM killer.

I suspected it was something like this. I guess the solution would be to get Travis to raise some limit.

@JeffBezanson JeffBezanson added domain:ci Continuous integration domain:io Involving the I/O subsystem: libuv, read, write, etc. labels Sep 24, 2017
@ViralBShah ViralBShah added the status:triage This should be discussed on a triage call label Oct 6, 2017
@ViralBShah ViralBShah modified the milestones: 0.6.x, 1.0 Oct 6, 2017
@ViralBShah
Copy link
Member

ViralBShah commented Oct 6, 2017

I think this should be fixed for 1.0 since it is breaking tests on the buildbot and @vtjnash says he knows what to do.

cc @ararslan

@ararslan
Copy link
Member

ararslan commented Oct 6, 2017

Once there's a fix, it should be backported to 0.6. Luckily the failure is only spurious, so in the meantime we can keep running the macOS builds on the buildbot repeatedly until one of them goes through...

@ViralBShah
Copy link
Member

@vtjnash said the fix may be only for 1.0. Anyways shouldn't hold 0.6.1 on it.

@StefanKarpinski
Copy link
Sponsor Member

"things you wanted fixed soon" ≠ "release blocker"; throw the "priority" label on it for that.

@ararslan
Copy link
Member

ararslan commented Oct 6, 2017

If Jameson knows how to fix it already, is there anything to triage?

@StefanKarpinski StefanKarpinski removed this from the 1.0 milestone Oct 8, 2017
@StefanKarpinski StefanKarpinski removed the status:triage This should be discussed on a triage call label Oct 12, 2017
@StefanKarpinski
Copy link
Sponsor Member

Can I assume that this is no longer relevant?

@ararslan
Copy link
Member

It still happens occasionally though less often than it used to.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Oct 3, 2018

fixed by #22041

@vtjnash vtjnash closed this as completed Oct 3, 2018
vtjnash added a commit that referenced this issue Dec 4, 2019
I am observing this value get set to EINTR sometimes on Linux. It is not
generally supposed to do that (though not prohibited either). That is
causing us to sometimes drop signals (usually SIGINT). Since we just
need this code for Darwin (ref #17626, #22041, #20174), we can remove
this from everywhere else.
vtjnash added a commit that referenced this issue Dec 11, 2019
I am observing this value get set to EINTR sometimes on Linux. It is not
generally supposed to do that (though not prohibited either). That is
causing us to sometimes drop signals (usually SIGINT). Since we just
need this code for Darwin (ref #17626, #22041, #20174), we can remove
this from everywhere else.
KristofferC pushed a commit that referenced this issue Apr 11, 2020
I am observing this value get set to EINTR sometimes on Linux. It is not
generally supposed to do that (though not prohibited either). That is
causing us to sometimes drop signals (usually SIGINT). Since we just
need this code for Darwin (ref #17626, #22041, #20174), we can remove
this from everywhere else.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:ci Continuous integration domain:io Involving the I/O subsystem: libuv, read, write, etc. kind:heisenbug This bug occurs unpredictably system:mac Affects only macOS test This change adds or pertains to unit tests
Projects
None yet
9 participants