Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPPool counters #3072

Merged
merged 1 commit into from
Aug 2, 2022
Merged

IPPool counters #3072

merged 1 commit into from
Aug 2, 2022

Conversation

ksamoray
Copy link
Contributor

@ksamoray ksamoray commented Dec 1, 2021

Add IPPools usage counters, and expose them via CRD.

Signed-off-by: Kobi Samoray ksamoray@vmware.com

@codecov-commenter
Copy link

codecov-commenter commented Dec 1, 2021

Codecov Report

Merging #3072 (1d4e269) into main (231b09d) will increase coverage by 2.88%.
The diff coverage is 77.02%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3072      +/-   ##
==========================================
+ Coverage   64.40%   67.28%   +2.88%     
==========================================
  Files         293      297       +4     
  Lines       43658    45365    +1707     
==========================================
+ Hits        28116    30526    +2410     
+ Misses      13253    12461     -792     
- Partials     2289     2378      +89     
Flag Coverage Δ
e2e-tests 40.66% <5.40%> (?)
integration-tests 36.06% <0.00%> (?)
kind-e2e-tests 50.39% <77.02%> (-0.37%) ⬇️
unit-tests 44.33% <77.02%> (+0.16%) ⬆️
Impacted Files Coverage Δ
pkg/controller/ipam/antrea_ipam_controller.go 77.89% <73.84%> (+1.47%) ⬆️
pkg/controller/externalippool/controller.go 91.51% <100.00%> (+6.69%) ⬆️
pkg/ipam/ipallocator/allocator.go 88.14% <100.00%> (-0.40%) ⬇️
pkg/ipam/poolallocator/allocator.go 55.71% <100.00%> (+7.52%) ⬆️
...s/multicluster/memberclusterannounce_controller.go 56.64% <0.00%> (-15.16%) ⬇️
pkg/agent/multicast/mcast_controller.go 44.92% <0.00%> (-7.02%) ⬇️
pkg/util/k8s/node.go 80.90% <0.00%> (-6.72%) ⬇️
pkg/agent/multicast/mcast_discovery.go 56.44% <0.00%> (-5.17%) ⬇️
pkg/controller/networkpolicy/tier.go 50.00% <0.00%> (-5.00%) ⬇️
pkg/agent/controller/networkpolicy/reject.go 82.48% <0.00%> (-3.39%) ⬇️
... and 70 more

Usage IPPoolUsage `json:"usage,omitempty"`
}

type IPPoolUsage struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can combine this type with ExternalIPPoolUsage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this is doable of course. IDK enough the external pool functionality to conclude whether these two feature will have the same counters forever.
@tnqn do you have an opinion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good suggestion. They should be same from pool's perspective, I can't think of any difference between the two cases for now.
I'm fine with making ExternalIPPoolStatus use IPPoolUsage if it doesn't impact API compatibility.

@ksamoray ksamoray force-pushed the ipam-stats branch 3 times, most recently from e42b216 to 02fffe9 Compare December 14, 2021 09:10
Usage IPPoolUsage `json:"usage,omitempty"`
}

type IPPoolUsage struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good suggestion. They should be same from pool's perspective, I can't think of any difference between the two cases for now.
I'm fine with making ExternalIPPoolStatus use IPPoolUsage if it doesn't impact API compatibility.

@@ -132,6 +132,9 @@ func (a *IPPoolAllocator) appendPoolUsage(ipPool *v1alpha2.IPPool, ip net.IP, st
}

newPool.Status.IPAddresses = append(newPool.Status.IPAddresses, usageEntry)
// CRD is not updated yet, therefore the newly allocated IP will not be reflected by Used(). Therefore +1
newPool.Status.Usage.Used = a.Used() + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we use len(newPool.Status.IPAddresses)?

@@ -160,6 +163,10 @@ func (a *IPPoolAllocator) removePoolUsage(ipPool *v1alpha2.IPPool, ip net.IP) er

newPool.Status.IPAddresses = newList

// CRD is not updated yet, therefore the newly deleted allocation will not be reflected by Used(). Therefore -1
newPool.Status.Usage.Used = a.Used() - 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@@ -384,3 +391,21 @@ func (a *IPPoolAllocator) HasContainer(containerID string) (bool, error) {
}
return false, nil
}

func (a IPPoolAllocator) Used() int {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is not needed if we just use len(newPool.Status.IPAddresses).

_, allocators, _ := a.readPoolAndInitIPAllocators()
total := 0
for _, allocator := range allocators {
total += allocator.Total()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of reading the pool and init the allocators another time, could we use the existing allocators to get Total directly, maybe pass it as an argument of appendPoolUsage and removePoolUsage? And I think you have added ToTal method to MultiIPAllocator, we could call it directly?

@ksamoray ksamoray force-pushed the ipam-stats branch 3 times, most recently from 8deb058 to 3010632 Compare December 19, 2021 15:33
}

func (c *AntreaIPAMMetricsHandler) Run(stopCh <-chan struct{}) {
c.ipPoolInformer.Informer().AddEventHandlerWithResyncPeriod(cache.ResourceEventHandlerFuncs{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I notice many places in the code use AddEventHandlerWithResyncPeriod while supplying 0 for resyncPeriod, but I think you can just use AddEventHandler without the resyncPeriod parameter

func (c *AntreaIPAMMetricsHandler) updateIPPoolCounters(obj interface{}) {
ipPool := obj.(*crdv1a2.IPPool)

allocator, err := poolallocator.NewIPPoolAllocator(ipPool.Name, c.crdClient)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to change a bit once #3046 is merged (pool informer is already included in this class so not a big deal)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe it's better if I rebase on top of #3046

total := allocator.Total()

// Used is gathered from IP allocation status within the CRD - as it can be set by each one of the agents
used := len(ipPool.Status.IPAddresses)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I'm not sure whether we want to count reserved addresses? Those would be the addresses reserved for StatefulSet but not actually allocated for Pods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense that we count them. These numbers should reflect the usage, whether it's a reserved or allocated for a Pod.
Otherwise total minus used won't reflect the available IPs in the pool, right?

if pool.Status.Usage.Used == used && pool.Status.Usage.Total == total {
return
}
pool.Status.Usage.Used = used
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to update these counters along with other Status updates (when addresses are actually allocated/released), rather then asynchronously here? This would also save the need for an extra CRD update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this actually, here is the problems though:
If I update the total number in the other status update, it won't catch IPPool creation or updates when an IPRange is added. So for the total it makes sense to use an informer.
As for updating the used count, I could indeed place it where the other updates are done. It will clutter the code though a bit.
Another option is using a mutating webhook, I didn't try it though.
What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this makes sense to me, as long as we perform the update here only if totals are changed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess mutating webhook doesn't work here because "PUT/POST/PATCH requests to the custom resource ignore changes to the status stanza." Spec and Status cannot be updated via same request.
At least for spec update, asynchronous counter update might be the only way.

But does it need to introduce a metrics controller to do this? The name sounds it's for prometheus metrics. We don't expose other counters as prometheus metrics and no real requirements are asking for it, and it's kind of duplicated with the IPPool status. We already have a AntreaIPAMController which basically handles the IPPool Status stuff, I wonder we should just add the logic there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tnqn the change does expose Prometheus metrics as well as the stats via CRD. As this PR has been around for while, I can't recall why it's so. I can remove this.
This change has to be revised, maybe discard the Prometheus code as you suggest.

@ksamoray
Copy link
Contributor Author

ksamoray commented Mar 3, 2022

@annakhm @tnqn is this code still relevant? Or should I abandon that change?

@annakhm
Copy link
Contributor

annakhm commented Mar 3, 2022

I think its relevant, yes

@@ -0,0 +1,3585 @@
apiVersion: apiextensions.k8s.io/v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this file should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, thanks!

@github-actions
Copy link
Contributor

github-actions bot commented Jul 3, 2022

This PR is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 3, 2022
@tnqn tnqn added this to the Antrea v1.8 release milestone Jul 5, 2022
@tnqn tnqn added api-review Categorizes an issue or PR as actively needing an API review. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 5, 2022
@tnqn tnqn added the action/release-note Indicates a PR that should be included in release notes. label Jul 5, 2022
@ksamoray ksamoray force-pushed the ipam-stats branch 3 times, most recently from b98890e to b04c91d Compare July 21, 2022 19:12
Comment on lines 370 to 385
if err != nil {
klog.Warningf("Failed to initialize allocator for IPPool %s", ipPool.Name)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code shouldn't continue to run if err is not nil, otherwise it would panic

pool.Status.Usage.Used = used
pool.Status.Usage.Total = total

_, err = c.crdClient.CrdV1alpha2().IPPools().UpdateStatus(context.TODO(), pool, metav1.UpdateOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. event handlers are executed sequentially, doing all jobs here may not scale well, especially when there is an external API call.
  2. Every update to a pool will call the event handler once, doing all jobs here is not as effecient as doing it in workers, as multiple updates to a pool will only trigger one processing if the time the events happen is close.
    A more typical and efficient way is:
    Event handlers extract the pool's key and enqueue it; multiple workers get keys from the queue and do the actual job.

if ipPool.Status.Usage.Used == used && ipPool.Status.Usage.Total == total {
return nil
}
ipPool.Status.Usage.Used = used
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Objects returned here must be treated as read-only as they are shared among all consumers.

ipPoolToUpdate := ipPool.DeepCopy()
ipPoolToUpdate.Status.Usage.Used = used
...

c.statusQueue.Add(ipPool.Name)
}

func (c *AntreaIPAMController) updateHandler(oldObj, newObj interface{}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not registered to informer

@tnqn tnqn added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 2, 2022
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, but was confused by the new test

Comment on lines 162 to 164
// This test verifies correct behavior in case of update conflict. Allocation should be retiried
// Taking into account the latest status
func TestAllocateConflict(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find how the test simulate a case of update conflict and how it is different from TestAllocateNext: there is one line difference about whether check allocator is nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a merge error, that test was around in prior revisions in allocator_test.go. I'll remove it.

Add IPPools usage counters, and expose them via CRD.

Signed-off-by: Kobi Samoray <ksamoray@vmware.com>
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn
Copy link
Member

tnqn commented Aug 2, 2022

/test-all

@tnqn
Copy link
Member

tnqn commented Aug 2, 2022

/test-e2e

@tnqn
Copy link
Member

tnqn commented Aug 2, 2022

/skip-e2e which failed on a known flaky test ACNPFQDNPolicy

@tnqn tnqn merged commit b66ffda into antrea-io:main Aug 2, 2022
@ksamoray ksamoray deleted the ipam-stats branch August 2, 2022 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/release-note Indicates a PR that should be included in release notes. api-review Categorizes an issue or PR as actively needing an API review. kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants