Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix OurViewChange small race #5356

Merged
merged 3 commits into from
Aug 14, 2024
Merged

Fix OurViewChange small race #5356

merged 3 commits into from
Aug 14, 2024

Conversation

alexggh
Copy link
Contributor

@alexggh alexggh commented Aug 14, 2024

Always queue OurViewChange event before we send view changes to our peers, because otherwise we risk the peers sending us a message that can be processed by our subsystems before OurViewChange.

Normally, this is not really a problem because the latency of the ViewChange we send to our peers is way higher that our subsystem processing OurViewChange, however on testnets like versi where CPU is sometimes overcommitted this race gets triggered occasionally, so let's fix it by sending the messages in the right order.

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@alexggh alexggh added the R0-silent Changes should not be mentioned in any release notes label Aug 14, 2024
alexggh and others added 2 commits August 14, 2024 15:46
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@alexggh alexggh added T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network. and removed R0-silent Changes should not be mentioned in any release notes labels Aug 14, 2024
@alexggh alexggh enabled auto-merge August 14, 2024 13:13
@alexggh alexggh added this pull request to the merge queue Aug 14, 2024
Merged via the queue into master with commit 05a8ba6 Aug 14, 2024
194 of 211 checks passed
@alexggh alexggh deleted the alexggh/fix_signal_order branch August 14, 2024 14:41
@bkchr
Copy link
Member

bkchr commented Aug 14, 2024

Always queue OurViewChange event before we send view changes to our peers, because otherwise we risk the peers sending us a message that can be processed by our subsystems before OurViewChange.

So the message traveled faster via TCP to a different peer and this answered us faster with a new message that got processed before the ourviewchange? And wouldn't the rx side also be first required to process these messages before they are propagated to the subsystem?

ordian added a commit that referenced this pull request Aug 14, 2024
* master: (35 commits)
  Fix OurViewChange small race (#5356)
  Make ticket non-optional and add ensure_successful method to Consideration trait (#5359)
  [tests] dedup test code, add more tests, improve naming and docs (#5338)
  Stop running the wishlist workflow on forks (#5297)
  Migrate foreign assets v3::Location to v4::Location (#4129)
  Minor clean up (#5284)
  [Pools] Ensure members can always exit the pool gracefully (#4998)
  StorageWeightReclaim: set to node pov size if higher (#5281)
  [Bot] Add prdoc generation (#5331)
  Small nits found accidentally along the way (#5341)
  Create subsystem-benchmarks.yml (#5325)
  Bump libp2p-identity from 0.2.8 to 0.2.9 (#5232)
  Bump authoring duration for async backing to 2s. (#5195)
  Fix spelling issues (#5206)
  Bump the known_good_semver group across 1 directory with 3 updates (#5315)
  `polkadot-node-core-pvf-common`: Fix test compilation error (#5310)
  ci: Paused `cmd-action` commenter (#5287)
  Remove unnecessary mut (#5318)
  chain-spec: minor clarification on the genesis config patch (#5324)
  fix av-distribution Jaeger spans mem leak (#5321)
  ...
ordian added a commit that referenced this pull request Aug 14, 2024
* master: (35 commits)
  Fix OurViewChange small race (#5356)
  Make ticket non-optional and add ensure_successful method to Consideration trait (#5359)
  [tests] dedup test code, add more tests, improve naming and docs (#5338)
  Stop running the wishlist workflow on forks (#5297)
  Migrate foreign assets v3::Location to v4::Location (#4129)
  Minor clean up (#5284)
  [Pools] Ensure members can always exit the pool gracefully (#4998)
  StorageWeightReclaim: set to node pov size if higher (#5281)
  [Bot] Add prdoc generation (#5331)
  Small nits found accidentally along the way (#5341)
  Create subsystem-benchmarks.yml (#5325)
  Bump libp2p-identity from 0.2.8 to 0.2.9 (#5232)
  Bump authoring duration for async backing to 2s. (#5195)
  Fix spelling issues (#5206)
  Bump the known_good_semver group across 1 directory with 3 updates (#5315)
  `polkadot-node-core-pvf-common`: Fix test compilation error (#5310)
  ci: Paused `cmd-action` commenter (#5287)
  Remove unnecessary mut (#5318)
  chain-spec: minor clarification on the genesis config patch (#5324)
  fix av-distribution Jaeger spans mem leak (#5321)
  ...
@alexggh
Copy link
Contributor Author

alexggh commented Aug 14, 2024

So the message traveled faster via TCP to a different peer and this answered us faster with a new message that got processed before the ourviewchange?

Yes, that was the chronology from timestamps on versi where the CPU is 8 times over-committed.

And wouldn't the rx side also be first required to process these messages before they are propagated to the subsystem?

Yes they would come through rx, I guess your question is where is the context switch between this task and the rx one, right ?

ordian added a commit that referenced this pull request Aug 15, 2024
* master: (167 commits)
  Upgrade accidentally downgraded deps (#5365)
  [Pools] Fix issues with member migration to `DelegateStake` (#4822)
  Unify `no_genesis` check (#5360)
  [CI] Fix prdoc command (#5358)
  Beefy: add benchmarks for `report_fork_voting()` (#5188)
  Fix OurViewChange small race (#5356)
  Make ticket non-optional and add ensure_successful method to Consideration trait (#5359)
  [tests] dedup test code, add more tests, improve naming and docs (#5338)
  Stop running the wishlist workflow on forks (#5297)
  Migrate foreign assets v3::Location to v4::Location (#4129)
  Minor clean up (#5284)
  [Pools] Ensure members can always exit the pool gracefully (#4998)
  StorageWeightReclaim: set to node pov size if higher (#5281)
  [Bot] Add prdoc generation (#5331)
  Small nits found accidentally along the way (#5341)
  Create subsystem-benchmarks.yml (#5325)
  Bump libp2p-identity from 0.2.8 to 0.2.9 (#5232)
  Bump authoring duration for async backing to 2s. (#5195)
  Fix spelling issues (#5206)
  Bump the known_good_semver group across 1 directory with 3 updates (#5315)
  ...
@bkchr
Copy link
Member

bkchr commented Aug 15, 2024

Yes they would come through rx, I guess your question is where is the context switch between this task and the rx one, right ?

Just seen that the orchestra and network event handler are two different tasks. Still crazy that this worked, but explains how it could work. Ty!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
Status: Completed
Development

Successfully merging this pull request may close these issues.

5 participants