Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audit Logging support K8s Networkpolicy #4047

Merged
merged 12 commits into from
Aug 9, 2022
Merged

Conversation

qiyueyao
Copy link
Contributor

@qiyueyao qiyueyao commented Jul 21, 2022

Previously audit logging only works for Antrea NetworkPolicy and ClusterNetworkPolicy when setting the enableLogging field in the rule. This PR is to address this issue.

Method overview of this PR is at this design doc.

To enable logging for K8s NetworkPolicy, set annotations in metadata for the Namespace as policy.antrea.io/enable-np-logging: "true". Then K8s NetworkPolicy applied to this Namespace will be logged in the directory /var/log/antrea/networkpolicy/np.log.

@codecov
Copy link

codecov bot commented Jul 21, 2022

Codecov Report

Merging #4047 (dca7cee) into main (454df5d) will increase coverage by 3.89%.
The diff coverage is 70.54%.

❗ Current head dca7cee differs from pull request most recent head 6b8a815. Consider uploading reports for the commit 6b8a815 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4047      +/-   ##
==========================================
+ Coverage   64.25%   68.15%   +3.89%     
==========================================
  Files         294      297       +3     
  Lines       44805    45699     +894     
==========================================
+ Hits        28789    31145    +2356     
+ Misses      13687    12202    -1485     
- Partials     2329     2352      +23     
Flag Coverage Δ *Carryforward flag
e2e-tests 40.68% <58.75%> (?)
integration-tests 35.44% <42.46%> (?) Carriedforward from 26ac871
kind-e2e-tests 51.08% <71.42%> (+0.79%) ⬆️ Carriedforward from 26ac871
unit-tests 44.37% <51.30%> (+0.04%) ⬆️ Carriedforward from 26ac871

*This pull request uses carry forward flags. Click here to find out more.

Impacted Files Coverage Δ
multicluster/cmd/multicluster-controller/leader.go 0.00% <0.00%> (ø)
multicluster/cmd/multicluster-controller/member.go 0.00% <0.00%> (ø)
pkg/agent/openflow/client.go 70.30% <ø> (+4.26%) ⬆️
pkg/agent/controller/networkpolicy/reconciler.go 71.51% <16.66%> (ø)
...g/controller/networkpolicy/clusternetworkpolicy.go 61.85% <55.55%> (-6.33%) ⬇️
...uster/controllers/multicluster/stale_controller.go 69.09% <64.51%> (+17.37%) ⬆️
pkg/agent/openflow/network_policy.go 79.46% <88.57%> (+1.89%) ⬆️
...kg/agent/controller/networkpolicy/audit_logging.go 90.71% <90.00%> (+6.68%) ⬆️
...lers/multicluster/commonarea/remote_common_area.go 69.79% <100.00%> (+42.04%) ⬆️
pkg/agent/controller/networkpolicy/packetin.go 76.35% <100.00%> (+3.37%) ⬆️
... and 66 more

@qiyueyao qiyueyao force-pushed the log-k8s-drop branch 2 times, most recently from 185d85e to 66fdf99 Compare July 26, 2022 21:48
@qiyueyao qiyueyao added the area/network-policy Issues or PRs related to network policies. label Jul 26, 2022
@qiyueyao qiyueyao added this to the Antrea v1.8 release milestone Jul 26, 2022
docs/antrea-network-policy.md Outdated Show resolved Hide resolved
pkg/agent/openflow/network_policy.go Outdated Show resolved Hide resolved
pkg/agent/openflow/network_policy.go Outdated Show resolved Hide resolved
pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
test/e2e/framework.go Outdated Show resolved Hide resolved
pkg/agent/controller/networkpolicy/audit_logging.go Outdated Show resolved Hide resolved
docs/antrea-network-policy.md Outdated Show resolved Hide resolved
docs/antrea-network-policy.md Outdated Show resolved Hide resolved
docs/antrea-network-policy.md Outdated Show resolved Hide resolved
docs/antrea-network-policy.md Outdated Show resolved Hide resolved
docs/antrea-network-policy.md Outdated Show resolved Hide resolved
docs/antrea-network-policy.md Outdated Show resolved Hide resolved
@tnqn tnqn added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API. kind/feature Categorizes issue or PR as related to a new feature. action/release-note Indicates a PR that should be included in release notes. and removed kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API. labels Aug 2, 2022
@qiyueyao qiyueyao requested a review from wenyingd August 2, 2022 05:54
@qiyueyao
Copy link
Contributor Author

qiyueyao commented Aug 4, 2022

Reviewers, there's a problem for the Namespace UPDATE event. When modifying the drop flow in an UPDATE event, we might need to compare the previous drop flow and the newly generated one to determine if a modification is necessary. Code here.
Currently the method is to always modify the drop flow due to lack of available information. If we want to compare the flows, we need to change the Flow struct to add useful methods, which involves much more changes to the OVS struct.
Any suggestions or thoughts? Thanks.

pkg/controller/networkpolicy/clusternetworkpolicy.go Outdated Show resolved Hide resolved
if oldNamespace.Annotations[EnableNPLoggingAnnotationKey] != curNamespace.Annotations[EnableNPLoggingAnnotationKey] {
affectedNPs, err := n.networkPolicyLister.NetworkPolicies(curNamespace.Name).List(labels.Everything())
if err != nil {
klog.Errorf("Error fetching NetworkPolicies in the Namespace %s: %v", curNamespace.Name, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use structured logging for new logs, but here the error can be ignored, as this is in-memory operation, List call will never return an error.

test/e2e/framework.go Outdated Show resolved Hide resolved
return
}
for _, np := range affectedNPs {
n.updateNetworkPolicy(np, np)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling the event handler directly would introduce another race condition similar to the ones I'm trying to fix via #4028:

  1. updateNamespace gets NP from lister and calls updateNetworkPolicy
  2. A NP update event updates NP in lister and triggers updateNetworkPolicy with new object
  3. updateNetworkPolicy triggered in step 2 finished and updated internal NP
  4. updateNetworkPolicy triggered in step 1 finished and overrided internal NP with a stale state

But I don't have a good solution to quickly fix it in this patch without refactoring the workflow like I did in #4028. Perhaps we could handle the annotation update case later and wait for #4028 refactoring the workflow first, or we would risky NetworkPolicies working in stale state.

@tnqn
Copy link
Member

tnqn commented Aug 5, 2022

Reviewers, there's a problem for the Namespace UPDATE event. When modifying the drop flow in an UPDATE event, we might need to compare the previous drop flow and the newly generated one to determine if a modification is necessary. Code here. Currently the method is to always modify the drop flow due to lack of available information. If we want to compare the flows, we need to change the Flow struct to add useful methods, which involves much more changes to the OVS struct. Any suggestions or thoughts? Thanks.

If enableLogging changes, the ID of rule also changes, the original rule would be deleted and the new rule would be installed:

rule := &rule{
Direction: r.Direction,
From: r.From,
To: r.To,
Services: r.Services,
Action: r.Action,
Priority: r.Priority,
PolicyPriority: policy.Priority,
TierPriority: policy.TierPriority,
AppliedToGroups: appliedToGroups,
Name: r.Name,
PolicyUID: policy.UID,
SourceRef: policy.SourceRef,
EnableLogging: r.EnableLogging,
}
rule.ID = hashRule(rule)

So it's already handled. Otherwise current code won't work because reconciler doesn't call functions like AddPolicyRuleAddress for a previously install rule if there is no address change.

I guess Code is not needed?

@qiyueyao
Copy link
Contributor Author

qiyueyao commented Aug 6, 2022

So it's already handled. Otherwise current code won't work because reconciler doesn't call functions like AddPolicyRuleAddress for a previously install rule if there is no address change.
I guess Code is not needed?

Probably I didn't express it clearly, for the flow generated by the rule directly, it indeed changes due to ruleID update, so the IngressRuleTable is updated correspondingly.
However, for K8s NetworkPolicies, the default drop flow in IngressDefaultTable would not update since address was not changed. The pathway is Reconcile->add->installOFRule->InstallPolicyRuleFlows->calculateMatchFlowChangesForRule, calculateChangesForRuleCreation->addAddrFlows->addConjunctiveMatchFlow. In this function, it searches for globalConjMatchFlowCache where the key only depends on tableID, priority and matchPairs, it does not depend on enableLogging.
I did some tests as well, Code here is the place to update default drop flow by K8s NP, but the problem remains whether to compare the flows so that we only update when it is necessary. If we need to compare but there is no easy way to modify Flow, then I believe the UPDATE event should be marked TODO, to be solved together with the race condition in a future PR. Thanks.

@tnqn
Copy link
Member

tnqn commented Aug 8, 2022

@qiyueyao thanks for the explanation. Now I understand the problem is the dropFlow is not specific to a rule but specific to a match condition. Then does it work if we store enableLogging in conjMatchFlowContext and update the dropFlow when it changes?

@GraysonWu
Copy link
Contributor

I think Quan's suggestion could work. Since we only use it for dropFlow, maybe name it as something like: dropFlowEnableLogging?

Signed-off-by: Qiyue Yao <yaoq@vmware.com>
Signed-off-by: Qiyue Yao <yaoq@vmware.com>
Signed-off-by: Qiyue Yao <yaoq@vmware.com>
Signed-off-by: Qiyue Yao <yaoq@vmware.com>
Signed-off-by: Qiyue Yao <yaoq@vmware.com>
Signed-off-by: Qiyue Yao <yaoq@vmware.com>
Signed-off-by: Qiyue Yao <yaoq@vmware.com>
Signed-off-by: Qiyue Yao <yaoq@vmware.com>
Signed-off-by: Qiyue Yao <yaoq@vmware.com>
Signed-off-by: Qiyue Yao <yaoq@vmware.com>
Signed-off-by: Qiyue Yao <yaoq@vmware.com>
Signed-off-by: Qiyue Yao <yaoq@vmware.com>
@qiyueyao
Copy link
Contributor Author

qiyueyao commented Aug 9, 2022

Storing enableLogging in conjMatchFlowContext works, the latest commit updates that and is ready. Thanks!

@@ -659,6 +659,9 @@ for all NetworkPolicies in the Namespace. Packets of any connection that match
a NetworkPolicy rule will be logged with a reference to the NetworkPolicy name,
but packets dropped by the implicit "default drop" (not allowed by any NetworkPolicy)
will only be logged with consistent name `K8sNetworkPolicy` for reference.
Note that currently, Antrea only retrieves the logging Annotation once when adding
NetworkPolicies and in case of agent restart, users should not update Namespace
logging Annotations, otherwise it would risk NetworkPolicies working in a stale state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a reader (a probably bad one :)) I understand that if an agent is restarted, users should not be updated the namespace logging annotation anymore.

If I am not mistaken, the concept to convey is that events where the namespace annotation is processed concurrently with the network policy, such as agent restart, might trigger a risk of having stale NetworkPolicy info realized. If my understanding is correct, we might consider a little rephrasing above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@salv-orlando The race condition was explained here: #4047 (comment). I think we should remove this restriction in this release. It would be a simple change after #4028 is merged. I was waiting for these feature PRs to be merged first, otherwise all of them need to address many conflicts caused by the refactoring by #4028

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not get what exact issues can happen under what exact conditions either. Could you explain more? @qiyueyao

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we are talking about two issues here:

  1. We do not handle annotation changes after a NetworkPolicies are created.
  2. For agent restart (before? after? during?), users cannot change the Namespace annotation, otherwise we can run into some issues (but what issues? What does "working in a stale state means"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline and opened follow up PR #4099.

@@ -731,6 +734,15 @@ func (c *clause) addConjunctiveMatchFlow(featureNetworkPolicy *featureNetworkPol
changeType: insertion,
}
}
} else if context.dropFlowEnableLogging != enableLogging {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this will also work if the namespace annotation is changed while the antrea agent is down, as this code will be invoked upon agent restart. Is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there is another pathway for agent restart BatchInstallPolicyRuleFlows -> addRuleToConjunctiveMatch -> addActionToConjunctiveMatch especially for batch install, which is triggered upon Antrea agent restart.

if ns.Status.Phase == corev1.NamespaceTerminating {
return fmt.Errorf("error when creating '%s' Namespace: namespace exists but is in 'Terminating' phase", namespace)
}
}
return nil
}

// createNamespaceWithAnnotations creates the namespace with Annotations.
func (data *TestData) UpdateNamespace(namespace string, mutateFunc func(*corev1.Namespace)) error {
ns, _ := data.clientset.CoreV1().Namespaces().Get(context.TODO(), namespace, metav1.GetOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not ignore the error since this access K8s API, which could fail because of several reasons, otherwise the following code would panic because ns is nil

// testAuditLoggingEnableNP tests that audit logs are generated when K8s NP is applied
// tests both Allow traffic by K8s NP and Drop traffic by implicit K8s policy drop
func testAuditLoggingEnableNP(t *testing.T, data *TestData) {
data.updateNamespaceWithAnnotations(namespaces["x"], map[string]string{networkpolicy.EnableNPLoggingAnnotationKey: "true"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The returned error should checked

failOnError(data.updateNamespaceWithAnnotations(namespaces["x"], map[string]string{networkpolicy.EnableNPLoggingAnnotationKey: "true"}), t)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in follow up PR #4099.

@@ -659,6 +659,9 @@ for all NetworkPolicies in the Namespace. Packets of any connection that match
a NetworkPolicy rule will be logged with a reference to the NetworkPolicy name,
but packets dropped by the implicit "default drop" (not allowed by any NetworkPolicy)
will only be logged with consistent name `K8sNetworkPolicy` for reference.
Note that currently, Antrea only retrieves the logging Annotation once when adding
NetworkPolicies and in case of agent restart, users should not update Namespace
logging Annotations, otherwise it would risk NetworkPolicies working in a stale state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@salv-orlando The race condition was explained here: #4047 (comment). I think we should remove this restriction in this release. It would be a simple change after #4028 is merged. I was waiting for these feature PRs to be merged first, otherwise all of them need to address many conflicts caused by the refactoring by #4028

@tnqn
Copy link
Member

tnqn commented Aug 9, 2022

/test-all

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will merge the PR as is. We could improve the tests via separate PR

@tnqn tnqn merged commit 3454520 into antrea-io:main Aug 9, 2022
@tnqn
Copy link
Member

tnqn commented Aug 9, 2022

@qiyueyao I had to rewrite the commit message and close the corresponding issue manually when merging the PR. Could you please write detailed commit message (and perhaps squash commits timely) for future PRs? If you prefer appending commits instead of force pushing, the first commit should have a proper commit message which could be used when merging.

The PR description and commit message should use the well known format to link the issue, like "Fixes #XXX", instead of a hyperlink "This PR is to address this issue", which couldn't close issue automatically. You could find more details here.

@qiyueyao qiyueyao deleted the log-k8s-drop branch August 10, 2022 22:18
@qiyueyao qiyueyao added the action/backport Indicates a PR that requires backports. label Sep 3, 2022
@qiyueyao qiyueyao restored the log-k8s-drop branch March 29, 2023 00:17
@qiyueyao qiyueyao deleted the log-k8s-drop branch March 29, 2023 00:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. area/network-policy Issues or PRs related to network policies. kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants