Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

makes_jobserver_used hangs randomly #7858

Open
ehuss opened this issue Feb 3, 2020 · 6 comments
Open

makes_jobserver_used hangs randomly #7858

ehuss opened this issue Feb 3, 2020 · 6 comments
Labels
A-jobserver Area: jobserver, concurrency, parallelism A-testing-cargo-itself Area: cargo's tests C-bug Category: bug O-macos OS: macOS S-needs-mentor Status: Issue or feature is accepted, but needs a team member to commit to helping and reviewing.

Comments

@ehuss
Copy link
Contributor

ehuss commented Feb 3, 2020

Problem
The makes_jobserver_used test is occasionally hanging. (azure log).

I'm able to reproduce on macos with a failure rate a little under 1%. I wasn't able to repro on windows or linux.

It seems unrelated to #7731 or any of the recent jobserver releases. I was able to reproduce it with older versions.

Steps
Repeatedly run the makes_jobserver_used test on macos until it hangs.

Notes
I haven't been able to make sense of what is happening. Best I can tell, the sequence of events is:

  1. cargo builds the 3 build scripts.
  2. one of the rustc jobs gets stuck after it finishes.
  3. one of the build scripts runs.
  4. the test is waiting for the second build script to run, but it never starts. cargo build already has 2 jobs running (the waiting build script, and the stuck rustc job), and thus can't spawn the next build script.

The stuck rustc job is a bit confusing. The rustc process has finished and is in the zombie state. On the Cargo side, it seems to be caught in a loop in read2. One of the fd's is closed (either stdout or stderr, it seems random), but somehow the other one seems to still be open, but every attempt to read on it returns WouldBlock.

@ehuss ehuss added C-bug Category: bug A-testing-cargo-itself Area: cargo's tests labels Feb 3, 2020
@alexcrichton
Copy link
Member

Hm so one thing about these sorts of tests I've seen historically is that they have awful error reporting if something goes wrong internally. For example rustc keeled over somehow (maybe?) but we wouldn't be able to see it due to how the test is constructed.

Assuming you were able to reproduce this locally, what do you mean how a rustc job was stuck? Was there literally a rustc process hanging around that didn't want to exit? Or was Cargo not reading the end result correctly?

(also, if stuck around, is a stack trace of rustc available?)

My gues is that some form of fatal error is happening unrelated to this test and this test just isn't handling it well. Hard to imagine what such an error would be though...

@ehuss
Copy link
Contributor Author

ehuss commented Feb 4, 2020

By "stuck" I mean rustc finished (successfully) and exited, but Cargo never called waitpid so it is stuck as a zombie process. I'll dump some details here:

Cargo process backtrace details

Thread 1: Looping in JobQueue::wait_for_events every 0.5 seconds.

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007fff7be86866 libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007fff7bf4556e libsystem_pthread.dylib`_pthread_cond_wait + 722
    frame #2: 0x000000010f020fba cargo`std::sys::unix::condvar::Condvar::wait_timeout::h9a144851bd2e03ee at condvar.rs:153:16 [opt]
    frame #3: 0x000000010f00a784 cargo`std::thread::park_timeout::h720fb68cea6bbedc [inlined] std::sys_common::condvar::Condvar::wait_timeout::h9432be4cfcee4660 at condvar.rs:61:8 [opt]
    frame #4: 0x000000010f00a779 cargo`std::thread::park_timeout::h720fb68cea6bbedc [inlined] std::sync::condvar::Condvar::wait_timeout::h89fe015b8e4005ec at condvar.rs:398 [opt]
    frame #5: 0x000000010f00a75f cargo`std::thread::park_timeout::h720fb68cea6bbedc at mod.rs:1014 [opt]
    frame #6: 0x000000010e464a91 cargo`crossbeam_channel::context::Context::wait_until::h6901c1def5496560(self=0x00007ffee2188c58, deadline=Option @ 0x00007ffee2188a68) at context.rs:166:20
    frame #7: 0x000000010deefdcc cargo`crossbeam_channel::flavors::list::Channel$LT$T$GT$::recv::_$u7b$$u7b$closure$u7d$$u7d$::he6a471e506a65c40(cx=0x00007ffee2188c58) at list.rs:468:26
    frame #8: 0x000000010e332917 cargo`crossbeam_channel::context::Context::with::_$u7b$$u7b$closure$u7d$$u7d$::hf57354f6418ee4e9(cx=0x00007ffee2188c58) at context.rs:50:12
    frame #9: 0x000000010e3321c0 cargo`crossbeam_channel::context::Context::with::_$u7b$$u7b$closure$u7d$$u7d$::h66e35c75e7d4bf9d(cell=0x00007fa29750b6b0) at context.rs:58:30
    frame #10: 0x000000010e4b20ab cargo`std::thread::local::LocalKey$LT$T$GT$::try_with::h91d55931bd9ff94d(self=0x000000010f2492f8, f=closure-1 @ 0x00007ffee2188cd0) at local.rs:262:15
    frame #11: 0x000000010e331967 cargo`crossbeam_channel::context::Context::with::hdf51612656ff8915(f=) at context.rs:53:8
    frame #12: 0x000000010deefcd8 cargo`crossbeam_channel::flavors::list::Channel$LT$T$GT$::recv::hee8a499186302f48(self=0x00007fa287728080, deadline=Option @ 0x00007ffee2188dd8) at list.rs:458:12
    frame #13: 0x000000010decea44 cargo`crossbeam_channel::channel::Receiver$LT$T$GT$::recv_timeout::h62ff668aef36d82d(self=0x00007ffee218a1c8, timeout=(secs = 0, nanos = 500000000)) at channel.rs:787:42
    frame #14: 0x000000010e43f750 cargo`cargo::core::compiler::job_queue::DrainState::wait_for_events::hd1a60109c70cdce6(self=0x00007ffee218a108) at job_queue.rs:606:22
    frame #15: 0x000000010e43fde4 cargo`cargo::core::compiler::job_queue::DrainState::drain_the_queue::h6a56f69fac3b9f15(self=DrainState @ 0x00007ffee218a108, cx=0x00007ffee2197ab0, plan=0x00007ffee218f170, scope=0x00007ffee218d028, jobserver_helper=0x00007ffee218a9c8) at job_queue.rs:652:25
    frame #16: 0x000000010e233a79 cargo`cargo::core::compiler::job_queue::JobQueue::execute::_$u7b$$u7b$closure$u7d$$u7d$::h9d78c34a0b3db416(scope=0x00007ffee218d028) at job_queue.rs:388:52
    frame #17: 0x000000010dcaed1b cargo`crossbeam_utils::thread::scope::_$u7b$$u7b$closure$u7d$$u7d$::hd6a4bc76b00512f6 at thread.rs:161:64
    frame #18: 0x000000010dccd5c3 cargo`_$LT$std..panic..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::h68d1c373a83542f5(self=, _args=) at panic.rs:318:8
    frame #19: 0x000000010e7c2578 cargo`std::panicking::try::do_call::h67cacc3b91681367(data="\n") at panicking.rs:305:39
    frame #20: 0x000000010f029d8b cargo`__rust_maybe_catch_panic at lib.rs:86:7 [opt]
    frame #21: 0x000000010e7c2033 cargo`std::panicking::try::h3b0e05f2fd64da40(f=) at panicking.rs:281:12
    frame #22: 0x000000010dccd7dd cargo`std::panic::catch_unwind::hd8a792e70d2f44f4(f=) at panic.rs:394:13
    frame #23: 0x000000010dcae716 cargo`crossbeam_utils::thread::scope::h213d35650410ab80(f=closure-3 @ 0x00007ffee218e678) at thread.rs:161:17
    frame #24: 0x000000010e43bde9 cargo`cargo::core::compiler::job_queue::JobQueue::execute::h0bd053de42cc4521(self=JobQueue @ 0x00007ffee218f328, cx=0x00007ffee2197ab0, plan=0x00007ffee218f170) at job_queue.rs:388:8
    frame #25: 0x000000010e6ff0d7 cargo`cargo::core::compiler::context::Context::compile::h8f21a492f15ad76f(self=Context @ 0x00007ffee2197ab0, units=&[cargo::core::compiler::unit::Unit] @ 0x00007ffee218ef60, export_dir=Option @ 0x00007ffee21981a8, exec=0x00007ffee21988b0) at mod.rs:162:8
    frame #26: 0x000000010e4a6c13 cargo`cargo::ops::cargo_compile::compile_ws::h42a9ba37c7535750(ws=0x00007ffee21989a0, options=0x00007ffee2198ec8, exec=0x00007ffee21988b0) at cargo_compile.rs:469:8
    frame #27: 0x000000010e4a3f62 cargo`cargo::ops::cargo_compile::compile_with_exec::hd245ded500839023(ws=0x00007ffee21989a0, options=0x00007ffee2198ec8, exec=0x00007ffee21988b0) at cargo_compile.rs:259:4
    frame #28: 0x000000010e4a3e49 cargo`cargo::ops::cargo_compile::compile::h3c971c8afce10067(ws=0x00007ffee21989a0, options=0x00007ffee2198ec8) at cargo_compile.rs:248:4
    frame #29: 0x000000010dabf6f8 cargo`cargo::commands::build::exec::hcb2addeb1d65e2ff(config=0x00007ffee219c1a8, args=0x00007fa297600708) at build.rs:77:4
    frame #30: 0x000000010da8fa2c cargo`cargo::cli::execute_subcommand::h28890818fd6e1a7a(config=0x00007ffee219c1a8, cmd=(data_ptr = "build", length = 5), subcommand_args=0x00007fa297600708) at cli.rs:218:15
    frame #31: 0x000000010da8d5f2 cargo`cargo::cli::main::h07b06c703f643fde(config=0x00007ffee219c1a8) at cli.rs:109:4
    frame #32: 0x000000010da76408 cargo`cargo::main::he5d83abf3175b8b6 at main.rs:39:12
    frame #33: 0x000000010dad20d2 cargo`std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::h2336f328244924f7 at rt.rs:67:33

Thread 2 (system thread)

Thread 3:
Jobserver helper thread, looping waiting for tokens.

* thread #3
  * frame #0: 0x00007fff7be8b36a libsystem_kernel.dylib`poll + 10
    frame #1: 0x000000010ef9ead9 cargo`jobserver::imp::Client::acquire_allow_interrupts::h0ba496f560c512b8(self=0x00007fa29750b9b0) at unix.rs:143:19
    frame #2: 0x000000010ef9fb57 cargo`jobserver::imp::spawn_helper::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h491d913bb26f80d8(helper=0x00007fa287727ff0) at unix.rs:236:18
    frame #3: 0x000000010efa98c9 cargo`jobserver::HelperState::for_each_request::h58965db9ffb7bfaa(self=0x00007fa287727ff0, f=closure-0 @ 0x0000700002acda20) at lib.rs:483:12
    frame #4: 0x000000010ef9fdb0 cargo`jobserver::imp::spawn_helper::_$u7b$$u7b$closure$u7d$$u7d$::hf2d372804fc2f72d at unix.rs:235:8
    frame #5: 0x000000010efaae00 cargo`std::sys_common::backtrace::__rust_begin_short_backtrace::ha2e20f37be1e59e0(f=) at backtrace.rs:129:4
    frame #6: 0x000000010efa3fa0 cargo`std::thread::Builder::spawn_unchecked::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h43756dbfc52207a0 at mod.rs:475:16
    frame #7: 0x000000010efaea80 cargo`_$LT$std..panic..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::h32a4cbf5658ade24(self=, _args=) at panic.rs:318:8
    frame #8: 0x000000010efab125 cargo`std::panicking::try::do_call::h3f00bc1ba2828415(data="��r\x87�\x7f") at panicking.rs:305:39
    frame #9: 0x000000010f029d8b cargo`__rust_maybe_catch_panic at lib.rs:86:7 [opt]
    frame #10: 0x000000010efab036 cargo`std::panicking::try::h224a94160f48021f(f=) at panicking.rs:281:12
    frame #11: 0x000000010efaeb60 cargo`std::panic::catch_unwind::ha85c6b001151600c(f=) at panic.rs:394:13
    frame #12: 0x000000010efa3dee cargo`std::thread::Builder::spawn_unchecked::_$u7b$$u7b$closure$u7d$$u7d$::hd1546c0998e1769b at mod.rs:474:29
    frame #13: 0x000000010efaef25 cargo`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::he2cc030fe6263a34((null)=0x0000700002acde70, (null)=) at function.rs:232:4
    frame #14: 0x000000010f009fce cargo`_$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::h6c88753ef435a3bf at boxed.rs:1015:8 [opt]
    frame #15: 0x000000010f0292be cargo`std::sys::unix::thread::Thread::new::thread_start::h00f0d1c3fc7b9f5d [inlined] _$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::h1acca448dca83f76 at boxed.rs:1015:8 [opt]
    frame #16: 0x000000010f0292b2 cargo`std::sys::unix::thread::Thread::new::thread_start::h00f0d1c3fc7b9f5d [inlined] std::sys_common::thread::start_thread::h059abcaef02775d8 at thread.rs:13 [opt]
    frame #17: 0x000000010f02923e cargo`std::sys::unix::thread::Thread::new::thread_start::h00f0d1c3fc7b9f5d at thread.rs:80 [opt]

Thread 4:
rustc job thread. This is stuck inside read2. This should have noticed that the rustc process has finished, and exited. One pipe has closed (stderr in this case), the other is still open. Stepping through in the debugger, poll returns 0x20000e6, and the stdout revents == 0, and it loops back to the top. I can't tell if that return value is an artifact of attaching the debugger or not, it's a strange return code. Curiously google found this file mentioning that number with poll, but I don't know what it means.

* thread #4
  * frame #0: 0x00007fff7be8b36a libsystem_kernel.dylib`poll + 10
    frame #1: 0x000000010ddc7718 cargo`cargo::util::read2::imp::read2::h17d20e8aef5cf265(out_pipe=ChildStdout @ 0x00007000030d4cf0, err_pipe=ChildStderr @ 0x00007000030d4cf4, data=&mut FnMut<(bool, &mut alloc::vec::Vec, bool)> @ 0x00007000030d4cf8) at read2.rs:37:29
    frame #2: 0x000000010df6cf35 cargo`cargo::util::process_builder::ProcessBuilder::exec_with_streaming::_$u7b$$u7b$closure$u7d$$u7d$::h1e2c965411a81f3f at process_builder.rs:236:12
    frame #3: 0x000000010ded4c9b cargo`cargo::util::process_builder::ProcessBuilder::exec_with_streaming::hb6f46f224abd1bd2(self=0x00007000030d5e28, on_stdout_line=&mut FnMut<(&str)> @ 0x00007000030d51b8, on_stderr_line=&mut FnMut<(&str)> @ 0x00007000030d51c8, capture_output=false) at process_builder.rs:232:21
    frame #4: 0x000000010df6f1a2 cargo`_$LT$cargo..core..compiler..DefaultExecutor$u20$as$u20$cargo..core..compiler..Executor$GT$::exec::hecff8be66ca7608c(self=0x00007fa29761dab0, cmd=ProcessBuilder @ 0x00007000030d5e28, _id=PackageId @ 0x00007000030d56d0, _target=0x00007000030d61a8, _mode=CompileMode @ 0x00007000030d56e0, on_stdout_line=&mut FnMut<(&str)> @ 0x00007000030d56e8, on_stderr_line=&mut FnMut<(&str)> @ 0x00007000030d56f8) at mod.rs:95:8
    frame #5: 0x000000010e2353cc cargo`cargo::core::compiler::rustc::_$u7b$$u7b$closure$u7d$$u7d$::hadf57bb76a99e021(state=0x00007000030d6760) at mod.rs:283:12
    frame #6: 0x000000010df19960 cargo`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h34916101505489b8((null)=0x00007000030d6070, (null)=(&cargo::core::compiler::job_queue::JobState) @ 0x00007000030d6058) at function.rs:232:4
    frame #7: 0x000000010dc731d0 cargo`_$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::h22a1b965beb28182(self=Box> @ 0x00007000030d6348, args=(&cargo::core::compiler::job_queue::JobState) @ 0x00007000030d6358) at boxed.rs:1015:8
    frame #8: 0x000000010e3003a0 cargo`cargo::core::compiler::job::Work::call::hdfafac0fcaf19121(self=(inner = core::ops::function::Box > @ 0x00007000030d63a0), tx=0x00007000030d6760) at job.rs:31:8
    frame #9: 0x000000010de34fcb cargo`cargo::core::compiler::job::Work::then::_$u7b$$u7b$closure$u7d$$u7d$::h5137216155ee19e2(state=0x00007000030d6760) at job.rs:36:12
    frame #10: 0x000000010df19860 cargo`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h20efd4e0575109e8((null)=0x00007000030d6480, (null)=(&cargo::core::compiler::job_queue::JobState) @ 0x00007000030d6468) at function.rs:232:4
    frame #11: 0x000000010dc731d0 cargo`_$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::h22a1b965beb28182(self=Box> @ 0x00007000030d64d8, args=(&cargo::core::compiler::job_queue::JobState) @ 0x00007000030d64e8) at boxed.rs:1015:8
    frame #12: 0x000000010e3003a0 cargo`cargo::core::compiler::job::Work::call::hdfafac0fcaf19121(self=(inner = core::ops::function::Box > @ 0x00007000030d6530), tx=0x00007000030d6760) at job.rs:31:8
    frame #13: 0x000000010de34fcb cargo`cargo::core::compiler::job::Work::then::_$u7b$$u7b$closure$u7d$$u7d$::h5137216155ee19e2(state=0x00007000030d6760) at job.rs:36:12
    frame #14: 0x000000010df19860 cargo`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h20efd4e0575109e8((null)=0x00007000030d6610, (null)=(&cargo::core::compiler::job_queue::JobState) @ 0x00007000030d65f8) at function.rs:232:4
    frame #15: 0x000000010dc731d0 cargo`_$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::h22a1b965beb28182(self=Box> @ 0x00007000030d6668, args=(&cargo::core::compiler::job_queue::JobState) @ 0x00007000030d6678) at boxed.rs:1015:8
    frame #16: 0x000000010e3003a0 cargo`cargo::core::compiler::job::Work::call::hdfafac0fcaf19121(self=(inner = core::ops::function::Box > @ 0x00007000030d66c0), tx=0x00007000030d6760) at job.rs:31:8
    frame #17: 0x000000010e30046f cargo`cargo::core::compiler::job::Job::run::h44ffe3f43fa7a0e5(self=, state=0x00007000030d6760) at job.rs:51:8
    frame #18: 0x000000010e233dd2 cargo`cargo::core::compiler::job_queue::DrainState::run::_$u7b$$u7b$closure$u7d$$u7d$::h8cfa99aa6c05c901 at job_queue.rs:782:28
    frame #19: 0x000000010e233fe5 cargo`cargo::core::compiler::job_queue::DrainState::run::_$u7b$$u7b$closure$u7d$$u7d$::hdc8d8a9e61daa4d1((null)=0x00007000030d69f8) at job_queue.rs:827:37
    frame #20: 0x000000010dcae396 cargo`crossbeam_utils::thread::ScopedThreadBuilder::spawn::_$u7b$$u7b$closure$u7d$$u7d$::hcbcb64ec9d714614 at thread.rs:415:30
    frame #21: 0x000000010dcae30b cargo`crossbeam_utils::thread::ScopedThreadBuilder::spawn::_$u7b$$u7b$closure$u7d$$u7d$::h5755aa15dec5125e at thread.rs:423:38
    frame #22: 0x000000010dc7314a cargo`_$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnMut$LT$A$GT$$GT$::call_mut::hd67203c897695a0b(self=0x00007000030d6b70, args=) at boxed.rs:1022:8
    frame #23: 0x000000010dcae289 cargo`crossbeam_utils::thread::ScopedThreadBuilder::spawn::_$u7b$$u7b$closure$u7d$$u7d$::h3dce2d09fcec8066 at thread.rs:431:43
    frame #24: 0x000000010e5be26d cargo`std::sys_common::backtrace::__rust_begin_short_backtrace::h77bf83b3d928866d(f=(__0 = core::ops::function::Box > @ 0x00007000030d6ba0)) at backtrace.rs:129:4
    frame #25: 0x000000010dbb415d cargo`std::thread::Builder::spawn_unchecked::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h5ba90093d9717ee9 at mod.rs:475:16
    frame #26: 0x000000010dccd65d cargo`_$LT$std..panic..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::he87a609418b6efc6(self=AssertUnwindSafe @ 0x00007000030d6be8, _args=) at panic.rs:318:8
    frame #27: 0x000000010e7c25fa cargo`std::panicking::try::do_call::hcf8a3898bce38318(data="�,@\x97�\x7f") at panicking.rs:305:39
    frame #28: 0x000000010f029d8b cargo`__rust_maybe_catch_panic at lib.rs:86:7 [opt]
    frame #29: 0x000000010e7c2400 cargo`std::panicking::try::hdd190907da3f6244(f=AssertUnwindSafe @ 0x00007000030d6cb8) at panicking.rs:281:12
    frame #30: 0x000000010dccd76d cargo`std::panic::catch_unwind::hac6cbe449b531953(f=AssertUnwindSafe @ 0x00007000030d6d60) at panic.rs:394:13
    frame #31: 0x000000010dbb3a46 cargo`std::thread::Builder::spawn_unchecked::_$u7b$$u7b$closure$u7d$$u7d$::h2196c079dbb8d7ee at mod.rs:474:29
    frame #32: 0x000000010df19c65 cargo`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h6deb7312ad632587((null)=0x00007000030d6e80, (null)=) at function.rs:232:4
    frame #33: 0x000000010f009fce cargo`_$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::h6c88753ef435a3bf at boxed.rs:1015:8 [opt]
    frame #34: 0x000000010f0292be cargo`std::sys::unix::thread::Thread::new::thread_start::h00f0d1c3fc7b9f5d [inlined] _$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::h1acca448dca83f76 at boxed.rs:1015:8 [opt]
    frame #35: 0x000000010f0292b2 cargo`std::sys::unix::thread::Thread::new::thread_start::h00f0d1c3fc7b9f5d [inlined] std::sys_common::thread::start_thread::h059abcaef02775d8 at thread.rs:13 [opt]
    frame #36: 0x000000010f02923e cargo`std::sys::unix::thread::Thread::new::thread_start::h00f0d1c3fc7b9f5d at thread.rs:80 [opt]
(lldb) p out_done
(bool) $2 = false
(lldb) p err_done
(bool) $3 = true

(lldb) p r
(int) $5 = 33554662
(lldb) p fds
(libc::unix::pollfd [2]) $6 = {
  [0] = (fd = 8, events = 1, revents = 0)
  [1] = (fd = 18, events = 1, revents = 17)
}

Thread 5:
This is the build-script job, waiting for it to finish. The build script is waiting for the testsuite to tell it to exit.

  * frame #0: 0x00007fff7be8b36a libsystem_kernel.dylib`poll + 10
    frame #1: 0x000000010ddc7718 cargo`cargo::util::read2::imp::read2::h17d20e8aef5cf265(out_pipe=ChildStdout @ 0x00007000032d7850, err_pipe=ChildStderr @ 0x00007000032d7854, data=&mut FnMut<(bool, &mut alloc::vec::Vec, bool)> @ 0x00007000032d7858) at read2.rs:37:29
    frame #2: 0x000000010df6cf35 cargo`cargo::util::process_builder::ProcessBuilder::exec_with_streaming::_$u7b$$u7b$closure$u7d$$u7d$::h1e2c965411a81f3f at process_builder.rs:236:12
    frame #3: 0x000000010ded4c9b cargo`cargo::util::process_builder::ProcessBuilder::exec_with_streaming::hb6f46f224abd1bd2(self=0x00007000032d9348, on_stdout_line=&mut FnMut<(&str)> @ 0x00007000032d7d18, on_stderr_line=&mut FnMut<(&str)> @ 0x00007000032d7d28, capture_output=true) at process_builder.rs:232:21
    frame #4: 0x000000010e438027 cargo`cargo::core::compiler::custom_build::build_work::_$u7b$$u7b$closure$u7d$$u7d$::h1398015f1dc885b5(state=0x00007000032d9760) at custom_build.rs:325:21
    frame #5: 0x000000010df19890 cargo`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h272bb37990640691((null)=0x00007000032d9300, (null)=(&cargo::core::compiler::job_queue::JobState) @ 0x00007000032d92e8) at function.rs:232:4
    frame #6: 0x000000010dc731d0 cargo`_$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::h22a1b965beb28182(self=Box> @ 0x00007000032d94d8, args=(&cargo::core::compiler::job_queue::JobState) @ 0x00007000032d94e8) at boxed.rs:1015:8
    frame #7: 0x000000010e3003a0 cargo`cargo::core::compiler::job::Work::call::hdfafac0fcaf19121(self=(inner = core::ops::function::Box > @ 0x00007000032d9530), tx=0x00007000032d9760) at job.rs:31:8
    frame #8: 0x000000010de34fcb cargo`cargo::core::compiler::job::Work::then::_$u7b$$u7b$closure$u7d$$u7d$::h5137216155ee19e2(state=0x00007000032d9760) at job.rs:36:12
    frame #9: 0x000000010df19860 cargo`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h20efd4e0575109e8((null)=0x00007000032d9610, (null)=(&cargo::core::compiler::job_queue::JobState) @ 0x00007000032d95f8) at function.rs:232:4
    frame #10: 0x000000010dc731d0 cargo`_$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::h22a1b965beb28182(self=Box> @ 0x00007000032d9668, args=(&cargo::core::compiler::job_queue::JobState) @ 0x00007000032d9678) at boxed.rs:1015:8
    frame #11: 0x000000010e3003a0 cargo`cargo::core::compiler::job::Work::call::hdfafac0fcaf19121(self=(inner = core::ops::function::Box > @ 0x00007000032d96c0), tx=0x00007000032d9760) at job.rs:31:8
    frame #12: 0x000000010e30046f cargo`cargo::core::compiler::job::Job::run::h44ffe3f43fa7a0e5(self=, state=0x00007000032d9760) at job.rs:51:8
    frame #13: 0x000000010e233dd2 cargo`cargo::core::compiler::job_queue::DrainState::run::_$u7b$$u7b$closure$u7d$$u7d$::h8cfa99aa6c05c901 at job_queue.rs:782:28
    frame #14: 0x000000010e233fe5 cargo`cargo::core::compiler::job_queue::DrainState::run::_$u7b$$u7b$closure$u7d$$u7d$::hdc8d8a9e61daa4d1((null)=0x00007000032d99f8) at job_queue.rs:827:37
    frame #15: 0x000000010dcae396 cargo`crossbeam_utils::thread::ScopedThreadBuilder::spawn::_$u7b$$u7b$closure$u7d$$u7d$::hcbcb64ec9d714614 at thread.rs:415:30
    frame #16: 0x000000010dcae30b cargo`crossbeam_utils::thread::ScopedThreadBuilder::spawn::_$u7b$$u7b$closure$u7d$$u7d$::h5755aa15dec5125e at thread.rs:423:38
    frame #17: 0x000000010dc7314a cargo`_$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnMut$LT$A$GT$$GT$::call_mut::hd67203c897695a0b(self=0x00007000032d9b70, args=) at boxed.rs:1022:8
    frame #18: 0x000000010dcae289 cargo`crossbeam_utils::thread::ScopedThreadBuilder::spawn::_$u7b$$u7b$closure$u7d$$u7d$::h3dce2d09fcec8066 at thread.rs:431:43
    frame #19: 0x000000010e5be26d cargo`std::sys_common::backtrace::__rust_begin_short_backtrace::h77bf83b3d928866d(f=(__0 = core::ops::function::Box > @ 0x00007000032d9ba0)) at backtrace.rs:129:4
    frame #20: 0x000000010dbb415d cargo`std::thread::Builder::spawn_unchecked::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h5ba90093d9717ee9 at mod.rs:475:16
    frame #21: 0x000000010dccd65d cargo`_$LT$std..panic..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::he87a609418b6efc6(self=AssertUnwindSafe @ 0x00007000032d9be8, _args=) at panic.rs:318:8
    frame #22: 0x000000010e7c25fa cargo`std::panicking::try::do_call::hcf8a3898bce38318(data="p:@\x97�\x7f") at panicking.rs:305:39
    frame #23: 0x000000010f029d8b cargo`__rust_maybe_catch_panic at lib.rs:86:7 [opt]
    frame #24: 0x000000010e7c2400 cargo`std::panicking::try::hdd190907da3f6244(f=AssertUnwindSafe @ 0x00007000032d9cb8) at panicking.rs:281:12
    frame #25: 0x000000010dccd76d cargo`std::panic::catch_unwind::hac6cbe449b531953(f=AssertUnwindSafe @ 0x00007000032d9d60) at panic.rs:394:13
    frame #26: 0x000000010dbb3a46 cargo`std::thread::Builder::spawn_unchecked::_$u7b$$u7b$closure$u7d$$u7d$::h2196c079dbb8d7ee at mod.rs:474:29
    frame #27: 0x000000010df19c65 cargo`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h6deb7312ad632587((null)=0x00007000032d9e80, (null)=) at function.rs:232:4
    frame #28: 0x000000010f009fce cargo`_$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::h6c88753ef435a3bf at boxed.rs:1015:8 [opt]
    frame #29: 0x000000010f0292be cargo`std::sys::unix::thread::Thread::new::thread_start::h00f0d1c3fc7b9f5d [inlined] _$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::h1acca448dca83f76 at boxed.rs:1015:8 [opt]
    frame #30: 0x000000010f0292b2 cargo`std::sys::unix::thread::Thread::new::thread_start::h00f0d1c3fc7b9f5d [inlined] std::sys_common::thread::start_thread::h059abcaef02775d8 at thread.rs:13 [opt]
    frame #31: 0x000000010f02923e cargo`std::sys::unix::thread::Thread::new::thread_start::h00f0d1c3fc7b9f5d at thread.rs:80 [opt]

@alexcrichton
Copy link
Member

Aha, I think I see what's going on.

We read pipes to the end, meaning we wait for the pipes to be completely closed. That typically happens through the natural process exit (kernel cleanup). Turns out it doesn't happen specifically on OSX sometimes. We non-atomically create a pipe and then set CLOEXEC on it on OSX. On Linux we use pipe2 and on Windows we use some rough equivalent that's atomic I believe. This means that we can leak pipe file descriptors into child processes on macOS. For example we can leak a pipe file descriptor into a build script that's hanging indefinitely, causing the original pipe to never get closed!

I think to fix this we need to either:

  • Figure out how to atomically set CLOEXEC on macOS and update libstd (I don't think this is possible)
  • Orchestrate a dance where we wait for process exit, then set fds to nonblocking, then read everything remaining
  • Close all fds in our ever-blocking build script

It's probably worth double-checking the first and failing that implement the third?

@ehuss
Copy link
Contributor Author

ehuss commented Feb 5, 2020

Oh, that totally makes sense! For some reason I was assuming they were all CLOEXEC on macos.

More details: rust-lang/rust#24237

@alexcrichton
Copy link
Member

Another possible fix would be to, at Cargo's startup, figure out a set of fds that we shouldn't close, and then close everything that isn't in that set inside of our child processes, but only on OSX. This is a pretty unfortunate bug no matter how we slice it...

AFAIK this is basically a "wontfix" in terms of atomically cloexec and it's up to us to figure out the most colorful workaround for this. It is a general problem for all macos users though and it's not just limited to Cargo's test suite.

Another possible random option would be to have a global lock for pipe creation and process spawning within Cargo itself. That may be the least invasive solution here actually?

@luser
Copy link
Contributor

luser commented Feb 11, 2020

Another possible fix would be to, at Cargo's startup, figure out a set of fds that we shouldn't close, and then close everything that isn't in that set inside of our child processes, but only on OSX. This is a pretty unfortunate bug no matter how we slice it...

n.b. This is tricky to get right, although maybe with a known set of FDs it'd be OK. There's code that does this in Chromium you can look at.

@epage epage added the S-triage Status: This issue is waiting on initial triage. label Nov 3, 2023
@ehuss ehuss added O-macos OS: macOS A-jobserver Area: jobserver, concurrency, parallelism S-needs-mentor Status: Issue or feature is accepted, but needs a team member to commit to helping and reviewing. and removed S-triage Status: This issue is waiting on initial triage. labels Nov 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-jobserver Area: jobserver, concurrency, parallelism A-testing-cargo-itself Area: cargo's tests C-bug Category: bug O-macos OS: macOS S-needs-mentor Status: Issue or feature is accepted, but needs a team member to commit to helping and reviewing.
Projects
None yet
Development

No branches or pull requests

4 participants