Skip to content
This repository has been archived by the owner on Feb 12, 2024. It is now read-only.

Feature: PubSub 🌟 #644

Merged
merged 55 commits into from
Jan 16, 2017
Merged

Feature: PubSub 🌟 #644

merged 55 commits into from
Jan 16, 2017

Conversation

daviddias
Copy link
Member

@daviddias daviddias commented Dec 8, 2016

Continues #610

@daviddias daviddias added the status/in-progress In progress label Dec 8, 2016
@daviddias
Copy link
Member Author

daviddias commented Dec 9, 2016

Things are starting to look good, right now there is only an issue with the stress tests, apparently js-pubsub is too slow for 10K messages under 80seconds (crypto is hard)

  core
    --both
      .pubsub
        callback API
Swarm listening on /ip4/127.0.0.1/tcp/52279
Swarm listening on /ip4/127.0.0.1/tcp/52281
          .publish
            ✓ message from string
            ✓ message from buffer
          .subscribe
            ✓ to one topic
            ✓ errors on double subscription
            ✓ discover options
          subscription
            ✓ .cancel and wait for callback
            ✓ .cancel and wait for end event
          .peers
            ✓ returns an error when not subscribed to a topic
            - returns no peers within 10 seconds
            - doesn't return extra peers
            - returns peers for a topic - one peer
            - lists peers for a topic - multiple peers
          .ls
            ✓ empty list when no topics are subscribed
            ✓ list with 1 subscribed topic
            ✓ list with 3 subscribed topicss
          multiple nodes
            ✓ receive messages from different node (1010ms)
            ✓ receive multiple messages (1010ms)
          load tests

@gavinmcdermott
Copy link

gavinmcdermott commented Dec 9, 2016

Hey @haadcode: Great day today, and things are looking good...so here are some of the remaining items left to pick-off for tonight's handoff if you're interested / have time...

  • Convert the subscription object stream returned via http to ndjson. Here's a link to the specific point in code
  • When we return streams from a subscribe call, we need to track of a pool of proxy streams and only cancel the primary stream when subscriptions no longer exist (Thanks @diasdavid for finding this one..)
  • Update any / all of the remaining tests to pass—this specifically deals with the load tests
  • Send the subscriptions immediately on peer connection—this is an update to floodsub here's where you'll send them through.

@diasdavid and @haadcode - thanks for the help and let's get this to the finish line!

@haadcode
Copy link
Member

haadcode commented Dec 9, 2016

@gavinmcdermott @diasdavid I just fixed all(-1) the tests. They're now running except one! The one that is currently failing is the 10k load test (I'm so happy I added that!).

See the changes to the tests here ipfs-inactive/interface-js-ipfs-core@501c304

They weren't running properly when run all together (other tests would trigger subscription.on('data', ...) callbacks in previous tests and the reason was that we weren't cancelling the subscriptions properly. This was fixed with: faa6e33#diff-dcea902c16515396959a8b7e3b1ff40eR50. With that line, and the other changes to the tests, they all(-1) pass now correctly.

So the 10k one that is failing:

apparently js-pubsub is too slow for 10K messages under 80seconds (crypto is hard)

This is not the reason. The reason is that the receiving stream doesn't start processing the messages until the sender has sent all 10k. Ie. the 10k messages are first buffered to one stream and only after all of them have been buffered will the receiving stream start getting 'data' events. This takes too much time and the test fails (and fails the rest of them too). If you set the number of messages to send to 2k, the test will pass, so I'm very happy the arbitrary 10k was correct :) The expected result is: messages are sent and received in parallel. You can verify the correct behaviour by looking at the sent/received numbers in the console while running the test. (This is exactly why I added that test because the same thing was happening in js-ipfs-api)

I tried to fix it in libp2p-floodsub where there was a TODO for the 2 PassThrough streams to be converted to pull-pushable. I believe fixing the streams here (or elsewhere in floodsub) will result in correct behaviour and a passing test. I'm sure you will have an idea how to fix this.

Another small note was: faa6e33#diff-dcea902c16515396959a8b7e3b1ff40eR120. This part (the filtering) should be moved to floodsub, and perhaps floodsub should just implement .peers([topic]) instead of .getPeerSet() (leaving the transformation Peer -> idB58Str there should be fine I think). This will clarify the responsibility of core/components/pubsub and keep it thin (I'm thinking over time we can merge the public facing layers of both js-ipfs-api and js-ipfs so that we don't have 2 different pubsub objects doing the same thing). @whyrusleeping changed the go-ipfs one to work in a way that if no topic is passed as arg, it'll return "all peers we're currently pubsubbing with" and if a topic is provided, only peers subscribed to that topic will be returned. I think this makes sense and we should have it similar in js-floodsub.

@diasdavid you added https://github.com/ipfs/interface-ipfs-core/pull/101/files#diff-b84db3db8b011618f99fed4c49d338c6R143 and I was wondering why that is? Are the peers not connected when the callback is fired? If so, we should fix this in wherever it can be fixed so that we don't need that hack (because if we need in tests, the users will need it in their programs).

I'm gonna leave this to you now to pick up again. Great work so far everyone! We can land this! :)

My plan next is to start trying it out with Orbit and see what is needed. @diasdavid how should the peer discovery work in the browser? Will the peers (that you normally get) just get picked up for pubsub automatically or is there a need for connecting peers manually?

@haadcode
Copy link
Member

haadcode commented Dec 9, 2016

Ok, tried it with Orbit. The good news is: it works! \o/ 🎉 The bad news is that as a whole, it's not quite working.

What I'm observing is that while some peers can send messages through to others, not all of them can. The version I'm using is https://ipfs.io/ipfs/Qmf2AKhyKodLq2nEq6g4rgPV5SGAvAYqKhRfJxRt5syzEW/ (should work for you too).

See the videos below (they're sequential). That's what's happening. The last picture is the swarm.peers at the end of the second gif.

Initial state:
screen shot 2016-12-09 at 14 35 39

Messages from 2nd and 3rd peer are not going through to first peer:
orbit-js-pubsub1

Still not going through:
orbit-js-pubsub2

Peers after the gifs, the old peer is still there:
screen shot 2016-12-09 at 14 32 51

@daviddias
Copy link
Member Author

They weren't running properly when run all together (other tests would trigger subscription.on('data', ...) callbacks in previous tests and the reason was that we weren't cancelling the subscriptions properly. This was fixed with: faa6e33#diff-dcea902c16515396959a8b7e3b1ff40eR50. With that line, and the other changes to the tests, they all(-1) pass now correctly.

This is unfortunately not the right solution as it, it will kill all the subscriptions silently if one of the subscribers decides to cancel.

@gavinmcdermott had listed the right solution as a TODO item:

  • When we return streams from a subscribe call, we need to track of a pool of proxy streams and only cancel the primary stream when subscriptions no longer exist (Thanks @diasdavid for finding this one..)

This is not the reason. The reason is that the receiving stream doesn't start processing the messages until the sender has sent all 10k. Ie. the 10k messages are first buffered to one stream and only after all of them have been buffered will the receiving stream start getting 'data' events.

Awesome catch, strange though because the stream is being 'drained' as expected https://github.com/libp2p/js-libp2p-floodsub/blob/master/src/mount-floodsub.js#L43


@diasdavid you added https://github.com/ipfs/interface-ipfs-core/pull/101/files#diff-b84db3db8b011618f99fed4c49d338c6R143 and I was wondering why that is? Are the peers not connected when the callback is fired? If so, we should fix this in wherever it can be fixed so that we don't need that hack (because if we need in tests, the users will need it in their programs).

This is a nuance because dialing streams in js-ipfs is non blocking, full context here: https://github.com/libp2p/js-libp2p-swarm/issues/78. TL;DR is: Once this -- https://github.com/libp2p/js-libp2p-floodsub/blob/master/src/dial-floodsub.js#L53-L55 -- gets implemented, no timeout will be needed.


#644 (comment)

Thank you for the visualization! :D

@haadcode
Copy link
Member

haadcode commented Dec 9, 2016

They weren't running properly when run all together (other tests would trigger subscription.on('data', ...) callbacks in previous tests and the reason was that we weren't cancelling the subscriptions properly. This was fixed with: faa6e33#diff-dcea902c16515396959a8b7e3b1ff40eR50. With that line, and the other changes to the tests, they all(-1) pass now correctly.
This is unfortunately not the right solution as it, it will kill all the subscriptions silently if one of the subscribers decides to cancel.

Ah, you're right, we need to cancel that one particular listener.

When we return streams from a subscribe call, we need to track of a pool of proxy streams and only cancel the primary stream when subscriptions no longer exist (Thanks @diasdavid for finding this one..)

Not sure what is meant by this, but you go ahead and find/implement the right solution for this.

Once this ... gets implemented, no timeout will be needed.

Awesome!

@daviddias
Copy link
Member Author

fixed the buffering problem

          load tests
Send/Receive 10k messages took: 72792 ms, 137 ops / s
            ✓ send/receive 10k messages (73796ms)
            ✓ call publish 1k times
            ✓ call subscribe 1k times
            ✓ subscribe/unsubscribe 1k times (61ms)

@gavinmcdermott
Copy link

@diasdavid - I expect some free time this afternoon to knock out the rest of the items on the list. Either way, I'll keep you updated and push the latest!

@daviddias
Copy link
Member Author

I've all the tests passing locally \o/, now waiting for CI.

@daviddias
Copy link
Member Author

bubbling all the deps to reflect the changes in multistream-select. Apparently CI npm install has different rules of what is a patch update (??!!?)

@daviddias
Copy link
Member Author

Updated all the dependencies, CI should be happy now (tests continue to pass locally at least)

@daviddias daviddias changed the title feat/pubsub \o/ Feature: PubSub 🌟 Jan 11, 2017
@daviddias
Copy link
Member Author

daviddias commented Jan 11, 2017

last mile is all about unsupported .includes in Node.js 4

     Uncaught TypeError: peers.includes is not a function
      at node_modules/interface-ipfs-core/src/pubsub.js:20:31

//cc @haadcode

will fix asap

@dignifiedquire
Copy link
Member

@diasdavid what did you do about the browser tests?

@daviddias
Copy link
Member Author

pushed fix for #644 (comment)

@daviddias
Copy link
Member Author

The tests were misleading, however after fixing them and more testing, I learned that the implementation works as expected in Node.js 6 and 7, but the message reception clogs at certain point in Node.js 4.

Digging through the rabbit hole.

@daviddias
Copy link
Member Author

Ok, just found the issue, the way that floodsub is now set up, makes it so that there is no guarantee that messages written in order are delivered in order correctly, this is both due:

  • Timecache is being flooded and starts discarding things
  • The callback of the publish gets called before the write to the wire happens.

Adding tests to cover this bugs in floodsub while the fixes are underway

🌟 And this is why we have tests in 3 Node.js versions, 2 CI systems and try them in two different browsers.

@daviddias
Copy link
Member Author

EVERYTHING IS GREEEN \o/ 💚🍀✅🥝

@daviddias daviddias merged commit 5078ddc into master Jan 16, 2017
@daviddias daviddias deleted the feat/pubsub branch January 16, 2017 11:57
@daviddias daviddias removed the status/in-progress In progress label Jan 16, 2017
@daviddias
Copy link
Member Author

Thank you all, this was a long and enduring process, really happy with all the developments, bug fixes, collaborative API design and all the testing that was done for this API ❤️❤️❤️❤️

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants