Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic in libp2p-websocket at 'SinkImpl::poll_ready called after error.' #2598

Closed
hrxi opened this issue Mar 29, 2022 · 24 comments
Closed

Panic in libp2p-websocket at 'SinkImpl::poll_ready called after error.' #2598

hrxi opened this issue Mar 29, 2022 · 24 comments

Comments

@hrxi
Copy link
Contributor

hrxi commented Mar 29, 2022

Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:    0: log_panics::init::{{closure}}
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:    1: std::panicking::rust_panic_with_hook
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:              at /rustc/3c17c84a386e7badf6b2c6018d172496b3a28a04/library/std/src/panicking.rs:702:17
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:    2: std::panicking::begin_panic::{{closure}}
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:    3: std::sys_common::backtrace::__rust_end_short_backtrace
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:    4: std::panicking::begin_panic
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:    5: <quicksink::SinkImpl<S,F,T,A,E> as futures_sink::Sink<A>>::poll_ready
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:    6: <rw_stream_sink::RwStreamSink<S> as futures_io::if_std::AsyncWrite>::poll_write
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:    7: <multistream_select::negotiated::Negotiated<TInner> as futures_io::if_std::AsyncWrite>::poll_write
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:    8: <libp2p_noise::io::framed::NoiseFramed<T,S> as futures_sink::Sink<&alloc::vec::Vec<u8>>>::poll_ready
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:    9: <libp2p_noise::io::framed::NoiseFramed<T,S> as futures_sink::Sink<&alloc::vec::Vec<u8>>>::poll_flush
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   10: <libp2p_noise::io::NoiseOutput<T> as futures_io::if_std::AsyncWrite>::poll_flush
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   11: multistream_select::negotiated::Negotiated<TInner>::poll
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   12: <multistream_select::negotiated::Negotiated<TInner> as futures_io::if_std::AsyncWrite>::poll_close
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   13: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   14: <S as futures_core::stream::TryStream>::try_poll_next
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   15: <futures_util::stream::try_stream::ErrInto<St,E> as futures_core::stream::Stream>::poll_next
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   16: <libp2p_yamux::Yamux<S> as libp2p_core::muxing::StreamMuxer>::close
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   17: <libp2p_core::muxing::Wrap<T> as libp2p_core::muxing::StreamMuxer>::close
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   18: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   19: tokio::runtime::task::harness::poll_future
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   20: tokio::runtime::task::harness::Harness<T,S>::poll
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   21: std::thread::local::LocalKey<T>::with
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   22: tokio::runtime::thread_pool::worker::Context::run_task
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   23: tokio::runtime::thread_pool::worker::Context::run
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   24: tokio::macros::scoped_tls::ScopedKey<T>::set
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   25: tokio::runtime::thread_pool::worker::run
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   26: tokio::runtime::task::harness::Harness<T,S>::poll
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   27: tokio::runtime::blocking::pool::Inner::run
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   28: std::sys_common::backtrace::__rust_begin_short_backtrace
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   29: core::ops::function::FnOnce::call_once{{vtable.shim}}
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   30: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:              at /rustc/3c17c84a386e7badf6b2c6018d172496b3a28a04/library/alloc/src/boxed.rs:1853:9
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:       <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:              at /rustc/3c17c84a386e7badf6b2c6018d172496b3a28a04/library/alloc/src/boxed.rs:1853:9
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:       std::sys::unix::thread::Thread::new::thread_start
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:              at /rustc/3c17c84a386e7badf6b2c6018d172496b3a28a04/library/std/src/sys/unix/thread.rs:108:17
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   31: start_thread
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   32: clone

I don't know how to reproduce the bug, but it seemed likely to be fixable without reproduction.

CC nimiq/core-rs-albatross#732

hrxi added a commit to hrxi/rust-libp2p that referenced this issue Mar 30, 2022
Return an error instead. `quicksink` panics if you call a method after
it returned an error once.

Fixes libp2p#2598.
@tomaka
Copy link
Member

tomaka commented Jun 13, 2022

I believe the issue might be here:

match self.as_mut().poll_flush(cx) {
Poll::Ready(Ok(())) => {}
Poll::Pending => return Poll::Pending,
Poll::Ready(Err(e)) => {
// If the remote closed the stream, it is important to still
// continue reading the data that was sent, if any.
if e.kind() != io::ErrorKind::WriteZero {
return Poll::Ready(Err(e.into()));
}
}
}

For context, this object is a stream onto which a protocol has been negotiated. Because the negotiation is reported in the API has having succeeded before the underlying stream has been flushed, reading on the stream also has the side effect of flushing the stream.

However, if an error happens when flushing, we don't keep track of it, meaning that if the API user then calls poll_flush on the Negotiated, it will effectively call pull_flush a second time on the underlying stream after the first time has errored.

I don't know if this is the cause of this bug, but it is certainly a bug.

@tomaka
Copy link
Member

tomaka commented Jun 13, 2022

Actual, what happens, I think is:

The yamux muxer reads from the connection, which calls Negotiated::poll_read. Negotiated::poll_read calls QuickSink::poll_flush, which returns an error, and the error is returned from poll_read.

Yamux detects the errors, and when yamux detects an error when reading, it switches to "shutdown mode" and tries to close the connection in a clean way. This closing calls Negotiated::poll_close, which calls QuickSink::poll_flush again, which panics.

@NZT48
Copy link

NZT48 commented Jul 24, 2022

I updated my parachain to Polkadot 0.9.26 dependencies and this issue is breaking node after around 100 blocks.

@mxinden
Copy link
Member

mxinden commented Jul 27, 2022

The yamux muxer reads from the connection, which calls Negotiated::poll_read. Negotiated::poll_read calls QuickSink::poll_flush, which returns an error, and the error is returned from poll_read.

Yamux detects the errors, and when yamux detects an error when reading, it switches to "shutdown mode" and tries to close the connection in a clean way. This closing calls Negotiated::poll_close, which calls QuickSink::poll_flush again, which panics.

In such case libp2p-core would never call close on the StreamMuxer, though the line below from the panic above indicates that:

Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   16: <libp2p_yamux::Yamux<S> as libp2p_core::muxing::StreamMuxer>::close
Mar 24 02:52:40 ip-172-31-13-88 nimiq-client[36147]:   17: <libp2p_core::muxing::Wrap<T> as libp2p_core::muxing::StreamMuxer>::close

@tomaka am I missing something?

@mxinden
Copy link
Member

mxinden commented Jul 27, 2022

@NZT48 @hrxi @jasl @doutv

It would be very helpful to get more logs right before (~10 seconds) a crash like the one above. Any chance one of you can provide a log output on debug level RUST_LOG=debug with the logs 10 seconds before up until the crash?

@NZT48
Copy link

NZT48 commented Jul 27, 2022

@mxinden Thanks for helping, here are the logs with 10 seconds before: https://we.tl/t-6TuATzVqV4

@mxinden
Copy link
Member

mxinden commented Aug 1, 2022

A suspicion I have:

When closing the Yamux muxer, we call inner.control.poll_close and inner.incoming.poll_next_unpin concurrently.

https://github.com/libp2p/rust-libp2p/blob/master/muxers/yamux/src/lib.rs#L132-L150

This could result in both the control_command and the frame Future to be ready at the same time.

https://github.com/libp2p/rust-yamux/blob/master/src/connection.rs#L492-L504

The frame future might be ready with an Error from the underlying socket (i.e. here libp2p-websocket). Though given that the result of the control_command Future is handled first, on_control_command is called despite frame having returned an Error. on_control_command itself may try to write to the underlying socket, which will panic given that the socket returned an error earlier via the frame Future.

I suggest I prepare a patch for rust-yamux. To validate the above assumption, could one of you (e.g. @NZT48 @hrxi @jasl @doutv) then run a version of Polkadot with that patch? You would simply need to build Polkadot with a dependency override.

mxinden added a commit to mxinden/rust-yamux that referenced this issue Aug 1, 2022
> The `frame` future might be _ready_ with an `Error` from the underlying
socket (i.e. here `libp2p-websocket`). Though given that the result of the
`control_command` `Future` is handled first, `on_control_command` is called
despite `frame` having returned an `Error`. `on_control_command` itself may try
to write to the underlying socket, which will panic given that the socket
returned an error earlier via the `frame` `Future`.

Patch to validate suspicion in
libp2p/rust-libp2p#2598.
@mxinden
Copy link
Member

mxinden commented Aug 1, 2022

@NZT48 @hrxi @jasl @doutv @kpp

Would very much appreciate help testing libp2p/rust-yamux#137. Instructions are in the pull request description.

@NZT48
Copy link

NZT48 commented Aug 2, 2022

@mxinden It works on my side also :)

@jasl
Copy link
Contributor

jasl commented Aug 2, 2022

Running nearly a day with patched binaries, haven't crash yet

@mxinden
Copy link
Member

mxinden commented Aug 2, 2022

@NZT48 @jasl those are wonderful news. Would you mind keeping those deployments running for a bit longer, just to make sure?

@jasl
Copy link
Contributor

jasl commented Aug 2, 2022

@NZT48 @jasl those are wonderful news. Would you mind keeping those deployments running for a bit longer, just to make sure?

that's no problem, I patched our testnet so it will keep running

@mxinden
Copy link
Member

mxinden commented Aug 3, 2022

Will try to prepare a proper patch for yamux tomorrow. This can be released as patch release, i.e. yamux v0.10.2 and thus only needs an update in the Polkadot Cargo.lock.

mxinden added a commit to mxinden/rust-yamux that referenced this issue Aug 4, 2022
> The `frame` future might be _ready_ with an `Error` from the underlying
socket (i.e. here `libp2p-websocket`). Though given that the result of the
`control_command` `Future` is handled first, `on_control_command` is called
despite `frame` having returned an `Error`. `on_control_command` itself may try
to write to the underlying socket, which will panic given that the socket
returned an error earlier via the `frame` `Future`.

With this patch, once any of `next_stream_command`, `next_control_command` or
`next_frame` `Future` is ready, the result is processed right away, without
additionally polling the remaining pending `Future`s, thus surfacing errors as
early as possible.

See libp2p/rust-libp2p#2598 for details.
@mxinden
Copy link
Member

mxinden commented Aug 4, 2022

I prepared libp2p/rust-yamux#138, which in my eyes represents a valid bug fix, replacing libp2p/rust-yamux#137. libp2p/rust-yamux#138 still needs to pass review.

Given that this is a very subtle issue, and given that libp2p/rust-yamux#138 differs from libp2p/rust-yamux#137, I would appreciate help testing libp2p/rust-yamux#138 as well.

@NZT48 @hrxi @jasl @doutv @kpp in case one of you has more capacity to run tests, would you mind deploying libp2p/rust-yamux#138 on one of your test instances?

@jasl
Copy link
Contributor

jasl commented Aug 4, 2022

I prepared libp2p/rust-yamux#138, which in my eyes represents a valid bug fix, replacing libp2p/rust-yamux#137. libp2p/rust-yamux#138 still needs to pass review.

Given that this is a very subtle issue, and given that libp2p/rust-yamux#138 differs from libp2p/rust-yamux#137, I would appreciate help testing libp2p/rust-yamux#138 as well.

@NZT48 @hrxi @jasl @doutv @kpp in case one of you has more capacity to run tests, would you mind deploying libp2p/rust-yamux#138 on one of your test instances?

no problem

@mxinden
Copy link
Member

mxinden commented Aug 5, 2022

libp2p/rust-yamux#138 is reviewed and approved thanks to @elenaf9 and @thomaseizinger.

@jasl once I get the green light from you, I will cut a release.

@jasl
Copy link
Contributor

jasl commented Aug 5, 2022

libp2p/rust-yamux#138 is reviewed and approved thanks to @elenaf9 and @thomaseizinger.

@jasl once I get the green light from you, I will cut a release.

Great! I can't wait

BTW, could you help to backport the upgrade to polkadot-v0.9.27 ?

@mxinden
Copy link
Member

mxinden commented Aug 5, 2022

libp2p/rust-yamux#138 is reviewed and approved thanks to @elenaf9 and @thomaseizinger.
@jasl once I get the green light from you, I will cut a release.

Great! I can't wait

Just to make sure, do I understand correctly that you ran libp2p/rust-yamux#138 and that you are not seeing any panics @jasl?

BTW, could you help to backport the upgrade to polkadot-v0.9.27 ?

I can not, as I am no longer directly involved in the project. Maybe @kpp or @bkchr can help with the backport.

@jasl
Copy link
Contributor

jasl commented Aug 5, 2022

libp2p/rust-yamux#138 is reviewed and approved thanks to @elenaf9 and @thomaseizinger.
@jasl once I get the green light from you, I will cut a release.

Great! I can't wait

Just to make sure, do I understand correctly that you ran libp2p/rust-yamux#138 and that you are not seeing any panics @jasl?

BTW, could you help to backport the upgrade to polkadot-v0.9.27 ?

I can not, as I am no longer directly involved in the project. Maybe @kpp or @bkchr can help with the backport.

For a night I don't see panic

mxinden added a commit to libp2p/rust-yamux that referenced this issue Aug 5, 2022
> The `frame` future might be _ready_ with an `Error` from the underlying
socket (i.e. here `libp2p-websocket`). Though given that the result of the
`control_command` `Future` is handled first, `on_control_command` is called
despite `frame` having returned an `Error`. `on_control_command` itself may try
to write to the underlying socket, which will panic given that the socket
returned an error earlier via the `frame` `Future`.

With this patch, once any of `next_stream_command`, `next_control_command` or
`next_frame` `Future` is ready, the result is processed right away, without
additionally polling the remaining pending `Future`s, thus surfacing errors as
early as possible.

See libp2p/rust-libp2p#2598 for details.
@mxinden
Copy link
Member

mxinden commented Aug 5, 2022

v0.10.2 of the yamux crate is now released.

libp2p/rust-yamux#138 (comment)

Thanks everyone for the help!

I am closing here. Since this is a patch release, there is no need to update libp2p, but instead an upgrade to v0.10.2 at the binary level suffices.

@mxinden mxinden closed this as completed Aug 5, 2022
jasl added a commit to Phala-Network/khala-parachain that referenced this issue Aug 5, 2022
@bkchr
Copy link

bkchr commented Aug 5, 2022

libp2p/rust-yamux#138 is reviewed and approved thanks to @elenaf9 and @thomaseizinger.
@jasl once I get the green light from you, I will cut a release.

Great! I can't wait

Just to make sure, do I understand correctly that you ran libp2p/rust-yamux#138 and that you are not seeing any panics @jasl?

BTW, could you help to backport the upgrade to polkadot-v0.9.27 ?

I can not, as I am no longer directly involved in the project. Maybe @kpp or @bkchr can help with the backport.

No need to backport this. Parachains can just run cargo update -p yamux to get the latest version of the crate. As this is nothing critical and doesn't happen that often there is no need for a full polkadot release. It can wait and go into the next polkadot release.

@kpp
Copy link
Contributor

kpp commented Aug 5, 2022

there is no need to update libp2p

Still the yamux dep in libp2p needs to be increased in order to force people to download the new version upon upgrade.

@mxinden
Copy link
Member

mxinden commented Aug 8, 2022

there is no need to update libp2p

Still the yamux dep in libp2p needs to be increased in order to force people to download the new version upon upgrade.

I would expect downstream users to use tools like @dependabot and thus keeping their dependencies up to date. That said, I am happy to merge a pull request updating the yamux version in the libp2p-yamux crate.

@tomaka
Copy link
Member

tomaka commented Aug 8, 2022

Still the yamux dep in libp2p needs to be increased in order to force people to download the new version upon upgrade.

There's no need to force people. People are expected to understand that they need to run cargo update to make bugs go away.
Increasing the version on the other hand prevents people from downgrading to a previous version if they need to.

xmas7 pushed a commit to RubyOnWorld/rust-yamux that referenced this issue Sep 6, 2022
> The `frame` future might be _ready_ with an `Error` from the underlying
socket (i.e. here `libp2p-websocket`). Though given that the result of the
`control_command` `Future` is handled first, `on_control_command` is called
despite `frame` having returned an `Error`. `on_control_command` itself may try
to write to the underlying socket, which will panic given that the socket
returned an error earlier via the `frame` `Future`.

With this patch, once any of `next_stream_command`, `next_control_command` or
`next_frame` `Future` is ready, the result is processed right away, without
additionally polling the remaining pending `Future`s, thus surfacing errors as
early as possible.

See libp2p/rust-libp2p#2598 for details.
jsdanielh added a commit to nimiq/core-rs-albatross that referenced this issue Mar 1, 2023
Update the `yamux` dependency from version 0.10.1 to version 0.10.2
to pull a fix for a call to `SinkImpl::poll_ready` after error as
described in [this rust-libp2p
issue](libp2p/rust-libp2p#2598).

This fixes #1319.
viquezclaudio pushed a commit to nimiq/core-rs-albatross that referenced this issue Mar 1, 2023
Update the `yamux` dependency from version 0.10.1 to version 0.10.2
to pull a fix for a call to `SinkImpl::poll_ready` after error as
described in [this rust-libp2p
issue](libp2p/rust-libp2p#2598).

This fixes #1319.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants