Skip to content
This repository has been archived by the owner on Feb 12, 2024. It is now read-only.

Websocket connection errors on startup should be clearer and not crash IPFS #1326

Closed
shessenauer opened this issue Apr 25, 2018 · 10 comments
Closed
Labels
exp/expert Having worked on the specific codebase is important kind/bug A bug in existing code (including security flaws) P1 High: Likely tackled by core team if no one steps up

Comments

@shessenauer
Copy link

  • Version: "ipfs": "^0.28.2"
  • Platform: Linux on AWS EC2 (Dockerized API)
  • Subsystem: /ipfs/src/core/boot.js or p2plib?

Type: Bug

Severity:Critical - System crash, application panic.

Description:

Upon start of the application, it starts normally - then as soon as the IPFS js node gets initialized, the application throws a critical error with the following:

App starting..

events.js:165
      throw er; // Unhandled 'error' event
      ^

Error: websocket error
    at WS.Transport.onError (/path/to/application/node_modules/engine.io-client/lib/transport.js:64:13)
    at WebSocket.ws.onerror (/path/to/application/node_modules/engine.io-client/lib/transports/websocket.js:150:10)
    at WebSocket.onError (/path/to/application/node_modules/ws/lib/EventTarget.js:109:16)
    at WebSocket.emit (events.js:180:13)
    at WebSocket.finalize (/path/to/application/node_modules/ws/lib/WebSocket.js:182:41)
    at ClientRequest._req.on (/path/to/application/node_modules/ws/lib/WebSocket.js:653:12)
    at ClientRequest.emit (events.js:180:13)
    at HTTPParser.parserOnIncomingClient [as onIncoming] (_http_client.js:539:21)
    at HTTPParser.parserOnHeadersComplete (_http_common.js:117:17)
    at TLSSocket.socketOnData (_http_client.js:444:20)
Emitted 'error' event at:
    at done (/path/to/application/node_modules/ipfs/src/core/boot.js:58:19)
    at /path/to/application/node_modules/async/internal/parallel.js:39:9
    at /path/to/application/node_modules/async/internal/once.js:12:16
    at iterateeCallback (/path/to/application/node_modules/async/internal/eachOfLimit.js:44:17)
    at /path/to/application/node_modules/async/internal/onlyOnce.js:12:16
    at /path/to/application/node_modules/async/internal/parallel.js:36:13
    at done (/path/to/application/node_modules/ipfs/src/core/components/start.js:15:16)
    at series (/path/to/application/node_moduless/ipfs/src/core/components/start.js:39:25)
    at /path/to/application/node_modules/async/internal/parallel.js:39:9
    at /path/to/application/node_modules/async/internal/once.js:12:16

In the configuration, I have a node.js application that uses IPFS-js. I have the application dockerized as well as a full ipfs dockerized node that websockets as a peer to the IPFS-js app. This works perfectly fine and I have had no problem until last night when every single one of my environments started throwing the same error out of nowhere. Now none of my team can spin up the api because it has the same issue. One odd thing is that in the gap of time before it crashes, we can spam the API with a curl and get back the IPFS results for a quick second, otherwise it fails. I have confirmed it is not a websocket issue on my end and it is infact something with the IPFS-js package in the nodeJS application. Guidance is much appreciated.

Steps to reproduce the error:

  1. npm install ipfs
  2. set the options for websockets
let options = {
  config: {
    Addresses: {
      Swarm: [
        // '/dns4/wrtc-star.discovery.libp2p.io/tcp/443/wss/p2p-webrtc-star'
        '/dns4/ws-star.discovery.libp2p.io/tcp/443/wss/p2p-websocket-star'
      ]
    }
  }
}
  1. initialize the node
    const ipfsNode = new IPFS(options)
4.  Connect to swarm
ipfsNode.on('ready', () => {
  // Your node is now ready to use \o/
  console.log('IPFS Online Status: ', ipfsNode.isOnline())
  ipfsNode.swarm.connect(ipfsPeer, (err, result) => {
    console.log('connecting to peers: ', result)
    ipfsNode.swarm.peers((err, peerCount) => {
      console.log('There are this many peers: ', peerCount)
    })
  })
@victorb
Copy link
Member

victorb commented Apr 26, 2018

I just found this morning that the ws-star server was hung. It responded to normal http requests properly, meaning that our health-checks though everything was fine but the actual websocket endpoints didn't work.

I've restarted the server and confirmed it to be working again, currently adding better health-checks to it now. Could you please retry and report back if it's working now?

@shessenauer
Copy link
Author

@victorbjelkholm It is resolved now due to that fix, thank you so much! Much appreciated.

@Mr0grog
Copy link
Contributor

Mr0grog commented Apr 26, 2018

It does seem like we could surface a better error here, though, saying something like could not connect to websocket server <address> or something. It wouldn’t fix the problem of a lot of people all depending on a particular server that went down, but it would at least make it more clear what happened.

@victorb
Copy link
Member

victorb commented Apr 27, 2018

@Mr0grog That's a good point, have been raised before in: #804 (comment)

Basically, current implementation crashes when the node can't start listening on the swarm address (since the endpoint does not), which leads to this crash. While I agree that the error message could be better, I would also say that it should be a warning instead of a error, and the daemon should continue booting even if a swarm address fails to open.

@Mr0grog
Copy link
Contributor

Mr0grog commented Apr 27, 2018

I think I’d agree with all those points 😄

@shessenauer
Copy link
Author

@victorbjelkholm I am getting this same error again on all my applications. Potentially may need to re-kick off your ws-star server

@Astrovicis
Copy link

Astrovicis commented Jun 6, 2018

Coworker of shessenauer here: we're still getting this error. Are there other servers we can add that would be more stable? Even better, if there was an option to check that would allow for the automatic addition of swarm nodes connected to the nodes listed in options.config.Addresses.Swarm so that the list of servers is dynamic and thus maybe more resilient? That would be fantastic. Is that congruent with the idea of a Swarm?

Question: will this happen if any one of the endpoints in options.config.Addresses.Swarm is down? Another way of asking this question is: Do all of the servers in options.config.Addresses.Swarm have to be up to avoid this error?

Thanks guys.

@Astrovicis
Copy link

Ok so after looking at it a little more closely, the problem is that IPFS never gets out of the 'starting' state after a websocket connection error is thrown. Is there/can there be a reconnect option for this sort of thing?

@Mr0grog Mr0grog changed the title Ipfs-js websocket error on startup Websocket connection errors on startup should be clearer and not crash IPFS Jun 11, 2018
@Mr0grog
Copy link
Contributor

Mr0grog commented Jun 11, 2018

Re-opening (and re-titling) this to track things a little more clearly. There are two things to address here:

  1. This error message needs to be more clear (e.g. “failed to connect to websocket ${address}” or something).

  2. A connection failure (probably for any swarm connection?) should not be fatal for IPFS.

These are tied in with #1325, but are concrete enough we should be able to address them sooner and more directly.

@Mr0grog Mr0grog reopened this Jun 11, 2018
@fsdiogo fsdiogo self-assigned this Jun 12, 2018
@fsdiogo fsdiogo added kind/bug A bug in existing code (including security flaws) P1 High: Likely tackled by core team if no one steps up status/in-progress In progress exp/expert Having worked on the specific codebase is important labels Jun 12, 2018
@fsdiogo fsdiogo removed their assignment Jul 4, 2018
@alanshaw
Copy link
Member

This was resolved by #1793

@ghost ghost removed the status/in-progress In progress label Mar 12, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
exp/expert Having worked on the specific codebase is important kind/bug A bug in existing code (including security flaws) P1 High: Likely tackled by core team if no one steps up
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants