Skip to content
This repository has been archived by the owner on Jun 20, 2023. It is now read-only.

Strange Behavior With Provider Correlated With Long Running Contexts #7

Closed
bonedaddy opened this issue Jun 12, 2019 · 8 comments
Closed

Comments

@bonedaddy
Copy link

Currently, when instantiating the provider a context is passed in. This context is then used internally when providing records to the network.

In my usage of this library, the context that gets passed in during initialization is never cancelled until my custom node service stops. The issue appears to be that if the context being used when providing a record to the network never cancels, then the provide call doesn't actually complete. This has lead to strange behaviour that I've noticed if I'm starting+stopping my custom node multiple times, using different peerID's each time. And seemingly each time the node "starts", records that were previously being announced to the network started showing up.

Digging a little further this isn't actually the case. What's happening is that only when the node service stops, is the context cancelled, thus triggering the Provide call to finish.

Digging a little further, I forked the repository to introduce some custom behaviour which is to initialize a temporary context with a timeout of 1 minute. By doing this, records will be provided to the network as expected and everything is okay.

So the root of the issue appears to be that records aren't actually being provided to the network, until the context that's passed into the Provide call is cancelled.

@lanzafame
Copy link

This has lead to strange behaviour that I've noticed if I'm starting+stopping my custom node multiple times, using different peerID's each time. And seemingly each time the node "starts", records that were previously being announced to the network started showing up.

I have noticed this behaviour when testing go-libp2p-dht-overlay, I ended up persisting the identity of the peers as workaround.

@lanzafame
Copy link

@postables which content router are you passing to NewProvider in your custom nodes?

@bonedaddy
Copy link
Author

Just a regular *dht.IpfsDHT

These are the exact options I use when building my libp2p host, and dht. The datastore I'm using is a map datastore with a mutex wrap, and an in memory peerstore. Additionally I'm using a forked version of the interface connection manager

	opts = append(opts,
		libp2p.Identity(hostKey),
		libp2p.ListenAddrs(listenAddrs...),
		// disabled because it is racy
		// see https://github.com/libp2p/go-nat/issues/11
		//	libp2p.NATPortMap(),
		libp2p.EnableRelay(circuit.OptHop),
		libp2p.ConnectionManager(
			connmgr.NewConnManager(
				ctx,
				wg,
				logger,
				200,
				600,
				time.Minute,
			),
		),
		libp2p.DefaultMuxers,
		libp2p.DefaultTransports,
		libp2p.DefaultSecurity,
	)
	h, err := libp2p.New(ctx, opts...)
	if err != nil {
		return nil, nil, err
	}

	idht, err := dht.New(ctx, h,
		dopts.Validator(record.NamespacedValidator{
			"pk":   record.PublicKeyValidator{},
			"ipns": ipns.Validator{KeyBook: pstore},
		}),
		dopts.Datastore(dstore),
	)

@hsanjuan
Copy link
Contributor

I understand what is hanging is dht.Provide(). This calls GetClosestPeers which returns a channel which might be hanging waiting for more peers (until the context is closed) (it says it gets K peers).

git blame shows @whyrusleeping name all over this logic written 5 years ago. I wonder if K=20 actually makes the Provide calls hang on smaller DHTs which do not reach 20 peers (I would hope not but the logic is not super straightforward and I haven't had time to read through it).

@michaelavila
Copy link
Contributor

michaelavila commented Jun 12, 2019

So the root of the issue appears to be that records aren't actually being provided to the network, until the context that's passed into the Provide call is cancelled.

@postables cancelling the context is one way to complete the Provide() call, but it's a short-circuit intended to be used in the event that we need to kill ipfs before a Provide() has returned. I think the thing to focus on is the fact that Provide() itself does not return in a timely manner.

I suspect something along the lines of what @hsanjuan mentioned.

@michaelavila
Copy link
Contributor

@postables I've added a timeout to Provide() to match what is done in go-bitswap. I'm going to do some root cause analysis today, but this should alleviate the never returning bug you pointed out.

#8

@bonedaddy
Copy link
Author

bonedaddy commented Jun 13, 2019

Ok I've done some brief testing and these are my findings:

  1. As expected, if no timeout is given the issue still stands
  2. Providing a timeout, provide submissions to the dht work as expected
  3. Providing here, vs providing via go-bitswap seems to be a bit slower. I have no "hard proof" of this, and am merely basing this off my non-scientific observations so take this with a grain of salt.

I'll also take a look into the dht code, since it may help to have some extra eyes on the issue.

edit 1: i'll post this on the PR as welll

edit 2: to clarify what I mean by "slower", it seems like message propagation that node X is providing content Y is slower when using go-ipfs-provider, than if using bitswap.

@michaelavila
Copy link
Contributor

Merged.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants