Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding support for azure managed prometheus #4256

Merged
merged 30 commits into from
Mar 6, 2023
Merged

Conversation

raggupta-ms
Copy link
Contributor

@raggupta-ms raggupta-ms commented Feb 21, 2023

Adding Azure Workload Identity and Azure Pod Identity Auth support to existing Prometheus scaler. This will enable Azure Managed Prometheus use case with KEDA.

Checklist

Relates to #4153

Signed-off-by: Raghav Gupta <guptaraghav16@gmail.com>
@raggupta-ms
Copy link
Contributor Author

@tomkerkhove @JorTurFer Created this PR. I am working on e2e tests, changelog and documentation parallely. I understand that this PR will not be merged until those are pending. Still raising this so that you can help in revieweing the changes and we are not blocked, in case you suggest any code changes.

Please take a look. Thanks!

Copy link
Member

@JorTurFer JorTurFer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good in general, I'll wait till all the things are ready, but in the meantime I have kept some comments inline.
BTW, thanks for this improvement.
For e2e tests, do you need some help?
You have to open a PR to the infra repo using terraform in order to create the needed resources

@tomkerkhove
Copy link
Member

You have to open a PR to the infra repo using terraform in order to create the needed resources

Terraform does not support it yet so we did it manually (as I shared offline with you :))

@raggupta-ms
Copy link
Contributor Author

raggupta-ms commented Feb 21, 2023

For e2e tests, do you need some help?

@JorTurFer Thanks for offering the help, I might need some actually! I am little confused on the flow of e2e tests, for example are all tests run in Azure env? Is it a requirement to run tests on ezisting AKS clusters? what if I create new for Azure Mnaged Prometheus - how would nightly and pr runs get affected? Azure Managed Prometheus and native Prometheus might not work on 1 cluster at a time. Yes I can think of cleanup but then, if installing and uninstalling an "addon" on AKS cluser is supported right now by test framework? And may be couple of more ques..

I am afraid, figuring all that by mysef will take time and I may miss next release. Is it possible to connect on a quick call? What would be your preferred time? I can ping you on slack..

@JorTurFer
Copy link
Member

JorTurFer commented Feb 21, 2023

Terraform does not support it yet so we did it manually (as I shared offline with you :))

We can deploy it using terraform + ARM, but I wouldn't add infra manually if it isn't the only option. I can tackle that PR for adding the manged Prometheus if @raggupta-ms faces with problems

@tomkerkhove
Copy link
Member

Terraform does not support it yet so we did it manually (as I shared offline with you :))

We can deploy it using terraform + ARM, but I wouldn't add infra manually if it isn't the only option. I can tackle that PR for adding the manged Prometheus if @raggupta-ms faces with problems

As mentioned, it's not supported yet :)

@JorTurFer
Copy link
Member

As mentioned, it's not supported yet :)

I'm almost there :)
All the resources are already created, only something is still missing, I need to figure out why the cluster can't access to the Prometheus, but I think that all needed resources are created 😄

Signed-off-by: Raghav Gupta <guptaraghav16@gmail.com>
@tomkerkhove
Copy link
Member

Thank you @JorTurFer for the automation!

@JorTurFer
Copy link
Member

You're welcome. It has been an interesting exercise 😝

Signed-off-by: Raghav Gupta <guptaraghav16@gmail.com>
Signed-off-by: Raghav Gupta <guptaraghav16@gmail.com>
@raggupta-ms
Copy link
Contributor Author

As mentioned, it's not supported yet :)

I'm almost there :) All the resources are already created, only something is still missing, I need to figure out why the cluster can't access to the Prometheus, but I think that all needed resources are created 😄

Thank you very much @JorTurFer for helping out on this. It was not straight forward but you made it work :)

@raggupta-ms
Copy link
Contributor Author

raggupta-ms commented Feb 24, 2023

@tomkerkhove @JorTurFer The code changes are done and updated in the PR. Please review.
I am working on e2e and hopefully, will be updating the PR with e2e tests by EOD today if all goes fine.
Will also update docs and other pending tasks in the checklist.

btw, do you know why e2e tests are not running in this PR? Seems like its queued but not getting picked up any worker. Do I need to do anythig special to triggere tests on the PR?

Signed-off-by: Raghav Gupta <guptaraghav16@gmail.com>
Signed-off-by: Raghav Gupta <guptaraghav16@gmail.com>
@JorTurFer
Copy link
Member

I'm not sure if we should place the e2e test inside Prometheus folder or inside azure folder. @kedacore/keda-core-contributors ?

@JorTurFer
Copy link
Member

JorTurFer commented Feb 27, 2023

/run-e2e prometheus*
Update: You can check the progress here

Signed-off-by: Raghav Gupta <guptaraghav16@gmail.com>
@JorTurFer
Copy link
Member

JorTurFer commented Feb 27, 2023

/run-e2e azure_managed_prometheus*
Update: You can check the progress here

@raggupta-ms
Copy link
Contributor Author

I'm not sure if we should place the e2e test inside Prometheus folder or inside azure folder. @kedacore/keda-core-contributors ?

I added that here since azure managed prometheus doesn't have its own scaler, but an extension to existing prometheus scaler. But I am open to putting beside other azure scalers since that also makes sense, based on what people think.

Signed-off-by: Raghav Gupta <guptaraghav16@gmail.com>
@JorTurFer
Copy link
Member

JorTurFer commented Mar 3, 2023

/run-e2e prometheus*
Update: You can check the progress here

Signed-off-by: Raghav Gupta <guptaraghav16@gmail.com>
Copy link
Member

@JorTurFer JorTurFer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/run-e2e prometheus*

@JorTurFer
Copy link
Member

JorTurFer commented Mar 4, 2023

/run-e2e prometheus*
Update: You can check the progress here

@JorTurFer
Copy link
Member

JorTurFer commented Mar 4, 2023

/run-e2e prometheus*
Update: You can check the progress here

Signed-off-by: Raghav Gupta <guptaraghav16@gmail.com>
@JorTurFer
Copy link
Member

JorTurFer commented Mar 5, 2023

/run-e2e prometheus*
Update: You can check the progress here

Signed-off-by: Raghav Gupta <guptaraghav16@gmail.com>
@JorTurFer
Copy link
Member

JorTurFer commented Mar 6, 2023

/run-e2e prometheus*
Update: You can check the progress here

@JorTurFer
Copy link
Member

JorTurFer commented Mar 6, 2023

/run-e2e prometheus*
Update: You can check the progress here

@JorTurFer
Copy link
Member

I have noticed that all the time the first try fails and the second try works, maybe we need to increase timeouts somewhere

image

@tomkerkhove
Copy link
Member

Given it works but it's a test optimization, we might want to do that in a seperate PR then - Make sense?

@JorTurFer
Copy link
Member

JorTurFer commented Mar 6, 2023

Given it works but it's a test optimization, we might want to do that in a seperate PR then - Make sense?

I think that a test which requires being executed twice always is a flaky test. It isn't an optimization IMO
I'd like to see the e2e test passing correctly during the first execution, the retries are a hack to reduce the failing tests, not a solution for flaky tests

@tomkerkhove
Copy link
Member

Agreed but my point was more that I don't think it should block this PR from being merged and we can do it async?

@zroubalik
Copy link
Member

zroubalik commented Mar 6, 2023

We should try to avoid merging flaky tests if possible 🙏 This one seems like an easy fix, so could be easily included in this PR.

And the current behaviour also extends the execution time of e2e test suite, which is not great 😄

@raggupta-ms
Copy link
Contributor Author

raggupta-ms commented Mar 6, 2023

@tomkerkhove @JorTurFer @zroubalik regarding the flakyness of test, I think it makes sense to increase the timeout. Since this is managed prometheus running on cloud there are delays. Roughly like this:

collector agent reaction on config map update: 1m
scrape interval: 1m
ingestion dealy: 1-2m
query in scaled object look back: 2 min

so test can be flaky as it waits for scale out/in for 5 min. I will increase that timeout to 7min to be safe. Let me know if there are any concerns. Small fix, will update PR in shortly.

Edit:
I just noticed that the timeout is for 10 min already in test. I am adding 2 more mins now, if it still is flaky, we will see. There isn't much that I can do on the ingestion delays. I can try modfying scrape interval or query lookback if it still fails.

Signed-off-by: Raghav Gupta <guptaraghav16@gmail.com>
@JorTurFer
Copy link
Member

JorTurFer commented Mar 6, 2023

@raggupta-ms
I have noticed that you deploy the Prometheus configMap at the end of the setup file. I'd move it above the workload identity tests, because the tests in the file are executed from top to bottom. If you move it up, the configMap which configures the prometheus will be applied some minutes before.
If you move it above func TestSetupWorkloadIdentityComponents(t *testing.T) {, you will earn 4-6 minutes having the configMap ready in the cluster for starting the scrapping.

My proposal is to move this code:

func TestSetupAzureManagedPrometheusComponents(t *testing.T) {
	// this will install config map in kube-system namespace, as needed by azure manage prometheus collector agent
	KubectlApplyWithTemplate(t, helper.EmptyTemplateData{}, "azureManagedPrometheusConfigMapTemplate", helper.AzureManagedPrometheusConfigMapTemplate)
}

between TestSetupHelm and TestSetupWorkloadIdentityComponents

Signed-off-by: Raghav Gupta <guptaraghav16@gmail.com>
@raggupta-ms
Copy link
Contributor Author

@raggupta-ms I have noticed that you deploy the Prometheus configMap at the end of the setup file. I'd move it above the workload identity tests, because the tests in the file are executed from top to bottom. If you move it up, the configMap which configures the prometheus will be applied some minutes before. If you move it above func TestSetupWorkloadIdentityComponents(t *testing.T) {, you will earn 4-6 minutes having the configMap ready in the cluster for starting the scrapping.

My proposal is to move this code:

func TestSetupAzureManagedPrometheusComponents(t *testing.T) {
	// this will install config map in kube-system namespace, as needed by azure manage prometheus collector agent
	KubectlApplyWithTemplate(t, helper.EmptyTemplateData{}, "azureManagedPrometheusConfigMapTemplate", helper.AzureManagedPrometheusConfigMapTemplate)
}

between TestSetupHelm and TestSetupWorkloadIdentityComponents

good call out, made that change. Also increased timout by 2 min. it should succeed now. Let's see.. please trigger e2e..

@JorTurFer
Copy link
Member

JorTurFer commented Mar 6, 2023

/run-e2e prometheus*
Update: You can check the progress here

@JorTurFer
Copy link
Member

JorTurFer commented Mar 6, 2023

/run-e2e prometheus*
Update: You can check the progress here

@JorTurFer
Copy link
Member

It seems that the change during the setup has been enough

Copy link
Member

@JorTurFer JorTurFer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@JorTurFer JorTurFer merged commit 7238408 into kedacore:main Mar 6, 2023
@raggupta-ms
Copy link
Contributor Author

@tomkerkhove @JorTurFer @zroubalik Thankyou for all the help and support throught! I learned a lot during this contribution.

@JorTurFer
Copy link
Member

You're welcome! Thanks to you for the contribution! :)

xoanmm pushed a commit to xoanmm/keda that referenced this pull request Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants