Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collectioncompliance, epochmgr Engines Implement component.Component #3248

Conversation

jordanschalm
Copy link
Member

@jordanschalm jordanschalm commented Sep 19, 2022

⚠️ Depends on #3121

This PR updates collection/compliance.Engine and collection/epochmgr.Engine to implement component.Component (#3121, but for collection nodes).

@jordanschalm jordanschalm changed the title Collection Compliance, Epoch Manager Engines Implement component.Component Collection Compliance Engine Implements component.Component Sep 19, 2022
@jordanschalm jordanschalm marked this pull request as ready for review September 19, 2022 16:24
@jordanschalm jordanschalm marked this pull request as draft September 19, 2022 18:24
@jordanschalm jordanschalm changed the title Collection Compliance Engine Implements component.Component Collectioncompliance, epochmgr Engines Implement component.Component Sep 20, 2022
Base automatically changed from jordan/6263-compliance-component-pattern to feature/active-pacemaker September 20, 2022 14:41
engine/collection/compliance/engine.go Show resolved Hide resolved
Comment on lines 57 to 60
// wait for shutdown to be commenced
<-ctx.Done()
// wait for compliance engine and event loop to shut down
<-util.AllDone(components.prop, components.sync, components.voteAggregator, components.timeoutAggregator)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can itself shut down before other components to shut down?

would it miss a few events if it's shut down earlier?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can itself shut down before other components to shut down?

Do you mean, "can the epochmgr engine shut down before the epoch components shut down?" (What is itself referring to?)

would it miss a few events if it's shut down earlier?

Yes, although I think we can miss events regardless of the order in which we are shutting down. Shutting down means, we are either turning off an old cluster consensus or shutting down the whole node. In neither case are missed events problematic from my perspective.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. the ctx is actually the shut down signal itself.

I thought it was an engine's Done channel, actually we are moving away from that pattern.

@codecov-commenter
Copy link

codecov-commenter commented Sep 20, 2022

Codecov Report

Merging #3248 (1f8f2e9) into feature/active-pacemaker (03fb58b) will increase coverage by 0.00%.
The diff coverage is 67.08%.

@@                    Coverage Diff                    @@
##           feature/active-pacemaker    #3248   +/-   ##
=========================================================
  Coverage                     55.26%   55.27%           
=========================================================
  Files                           746      748    +2     
  Lines                         69325    69350   +25     
=========================================================
+ Hits                          38314    38330   +16     
- Misses                        27880    27889    +9     
  Partials                       3131     3131           
Flag Coverage Δ
unittests 55.27% <67.08%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
module/epochs/base_client.go 0.00% <0.00%> (ø)
module/epochs/qc_client.go 0.00% <0.00%> (ø)
module/epochs/qc_voter.go 58.82% <5.00%> (-6.28%) ⬇️
engine/collection/synchronization/engine.go 68.40% <50.00%> (-0.58%) ⬇️
engine/consensus/compliance/engine.go 63.79% <70.00%> (+2.38%) ⬆️
module/epochs/errors.go 72.72% <72.72%> (ø)
engine/collection/compliance/engine.go 66.48% <74.62%> (+2.92%) ⬆️
engine/collection/epochmgr/engine.go 74.23% <76.55%> (-2.10%) ⬇️
engine/collection/epochmgr/epoch_components.go 100.00% <100.00%> (ø)
...e/orchestrator/mock_corruptible_conduit_factory.go 80.59% <0.00%> (-2.99%) ⬇️
... and 9 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Comment on lines 57 to 60
// wait for shutdown to be commenced
<-ctx.Done()
// wait for compliance engine and event loop to shut down
<-util.AllDone(components.prop, components.sync, components.voteAggregator, components.timeoutAggregator)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. the ctx is actually the shut down signal itself.

I thought it was an engine's Done channel, actually we are moving away from that pattern.

@jordanschalm jordanschalm marked this pull request as ready for review September 23, 2022 15:44
@jordanschalm
Copy link
Member Author

jordanschalm commented Sep 23, 2022

CI Test Failure:

--- FAIL: TestClusterSwitchover_MultiCollectorCluster (11.04s)
    cluster_switchover_test.go:262: expecting transaction 08f3dc749f467a58c4e7da5a4d5607df6defa9403dd7c276517d143f962d70a7 in epoch 1 for cluster 0
    cluster_switchover_test.go:403: got guarantee from 909b9654e4c1256195ff459ab9f56a3135d403f886f8c53e5bc5355085ced986
    cluster_switchover_test.go:403: got guarantee from 4e4cdb4c29b887885c54c628be0daab61d8602df492034f66a15d8dfa87e6b96
    cluster_switchover_test.go:262: expecting transaction ec7bf89c3d4b59ce6977d34d02a26b8792c9cade28b998137172317aca7513da in epoch 2 for cluster 0
    cluster_switchover_test.go:403: got guarantee from 909b9654e4c1256195ff459ab9f56a3135d403f886f8c53e5bc5355085ced986
    cluster_switchover_test.go:403: got guarantee from 4e4cdb4c29b887885c54c628be0daab61d8602df492034f66a15d8dfa87e6b96
    irrecoverable.go:17: 
        	Error Trace:	/home/runner/work/flow-go/flow-go/engine/collection/test/irrecoverable.go:17
        	            				/home/runner/work/flow-go/flow-go/engine/collection/test/asm_amd64.s:1594
        	Error:      	Received unexpected error:
        	            	could not process proposal feee1c1dae702e7645966b68249d6d7d230b1d573b2001476fa10840fa556226: could not add block (feee1c1dae702e7645966b68249d6d7d230b1d573b2001476fa10840fa556226) to vote aggregator: could not process block: feee1c1dae702e7645966b68249d6d7d230b1d573b2001476fa10840fa556226, internal error updating VoteProcessor's status from VoteCollectorStatusCaching to VoteCollectorStatusVerifying for block feee1c1dae702e7645966b68249d6d7d230b1d573b2001476fa10840fa556226: failed to create VerifyingVoteProcessor for block feee1c1dae702e7645966b68249d6d7d230b1d573b2001476fa10840fa556226: instantiating vote processor for block feee1c1dae702e7645966b68249d6d7d230b1d573b2001476fa10840fa556226 failed: error retrieving consensus participants: could not retrieve resource: key not found
        	Test:       	TestClusterSwitchover_MultiCollectorCluster
        	Messages:   	observed unexpected irrecoverable error
FAIL
coverage: [no statements]
FAIL	github.com/onflow/flow-go/engine/collection/test	32.680s

CI Run

⚠️ Problem with the Collection-Node version - We can't guarantee we'll know Identities to satisfy IdentitiesByBlock

it is only a problem with the tests, because cluster.Mutator checks that the ref block ID is known

Copy link
Member

@durkmurder durkmurder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First batch of comments

return e.lm.Stopped()
}
e.log.Info().Msg("starting hotstuff")
e.core.hotstuff.Start(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since both hotstuff and compliance are separate module which implement component.Component, do we want to start hotstuff from compliance engine? I would say we could start it as separate component, that would be more explicit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover it will simplify design of compliance engine

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree - might be better to do that along with MessageHub though. I feel like changing it now would largely get clobbered by MessageHub rewiring

engine/collection/epochmgr/epoch_components.go Outdated Show resolved Hide resolved
engine/collection/epochmgr/epoch_components.go Outdated Show resolved Hide resolved
network/errors.go Show resolved Hide resolved
engine/collection/epochmgr/engine.go Outdated Show resolved Hide resolved
engine/collection/epochmgr/engine.go Outdated Show resolved Hide resolved
engine/collection/epochmgr/engine.go Outdated Show resolved Hide resolved
}

e.cm = component.NewComponentManagerBuilder().
AddWorker(e.handleEpochSetupPhaseStartedLoop).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need 3 different worker threads to dispatch very infrequent events? I feel like it's unnecessary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

engine/collection/epochmgr/engine.go Outdated Show resolved Hide resolved
e.comp.SubmitLocal(synced)
// forward the block to the compliance engine for validation and processing
// we use the network.MessageProcessor interface here because the block is un-validated
err := e.comp.Process(channels.SyncCluster(block.Header.ChainID), originID, synced)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to pass me.NodeID() here instead of originID, original sender is already part of SyncedClusterBlock

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I feel like we're mixing practices a bit here.

  • We decided to use Process API, because the messages have not been validated (should be treated as originating from another node)
  • But passing me.NodeID() as originID indicates the message originates from this node (trusted)

The MessageProcessor interface says

// MessageProcessor represents a component which receives messages from the
// networking layer. Since these messages come from other nodes, which may
// be Byzantine, implementations must expect and handle arbitrary message inputs
// (including invalid message types, malformed messages, etc.). Because of this,
// node-internal messages should NEVER be submitted to a component using Process.

Overall I think the API doesn't fit well with this usage pattern, where an untrusted message is passing through an internal component, without being validated first. Though setting originID to the local node ID for an untrusted message seems like the worse of two not-great options here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure here. We need to define clear rules how should we treat such cases.
For instance sync engine doesn't do any checks so technically it's an untrusted input on other side it didn't come from network layer but from the sync engine.
I would treat this case as general one as if message comes from unknown origin and only use trusted callbacks when we actually need different processing logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, for now I will do that and pass me.NodeID with a comment.

Copy link
Member

@durkmurder durkmurder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks, very solid. The only thing is we need to decide on message passing pattern between different components, otherwise all good.

@jordanschalm
Copy link
Member Author

bors merge

@bors bors bot merged commit 42c7f37 into feature/active-pacemaker Oct 4, 2022
@bors bors bot deleted the jordan/6263-compliance-component-pattern-ln-engine branch October 4, 2022 19:28
bors bot added a commit that referenced this pull request Oct 12, 2022
3294: Front-Load HotStuff Verification Pt. 1: Update Compliance Engines r=jordanschalm a=jordanschalm

This PR modifies compliance engines to front-load HotStuff verification:
* Once a block connects to the finalized state, we validate the proposal using HotStuff first, and only then attempt to extend the state (compliance checks)
* This means, all blocks in storage have been fully validated (there is no need for a separate `MarkValid` step)

⚠️ Depends on #3248
👉  Removing `MarkValid` will be done in a separate PR: #3303

Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
bors bot added a commit that referenced this pull request Oct 12, 2022
3294: Front-Load HotStuff Verification Pt. 1: Update Compliance Engines r=jordanschalm a=jordanschalm

This PR modifies compliance engines to front-load HotStuff verification:
* Once a block connects to the finalized state, we validate the proposal using HotStuff first, and only then attempt to extend the state (compliance checks)
* This means, all blocks in storage have been fully validated (there is no need for a separate `MarkValid` step)

⚠️ Depends on #3248
👉  Removing `MarkValid` will be done in a separate PR: #3303

Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
bors bot added a commit that referenced this pull request Oct 21, 2022
3294: Front-Load HotStuff Verification Pt. 1: Update Compliance Engines r=jordanschalm a=jordanschalm

This PR modifies compliance engines to front-load HotStuff verification:
* Once a block connects to the finalized state, we validate the proposal using HotStuff first, and only then attempt to extend the state (compliance checks)
* This means, all blocks in storage have been fully validated (there is no need for a separate `MarkValid` step)

⚠️ Depends on #3248
👉  Removing `MarkValid` will be done in a separate PR: #3303

Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
Co-authored-by: Alexander Hentschel <alex.hentschel@axiomzen.co>
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants