Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5.0.0 - OperatorHealthCheck race condition? #489

Open
bhellema opened this issue Jan 19, 2023 · 8 comments
Open

5.0.0 - OperatorHealthCheck race condition? #489

bhellema opened this issue Jan 19, 2023 · 8 comments
Assignees

Comments

@bhellema
Copy link

bhellema commented Jan 19, 2023

Since upgrading to 5.0.0 we're noticing different behavior when it comes to starting up our operator. Sometimes we see the following health check status in our logs indicating that Quarkus Operator SDK health check is DOWN.

{"time":"2023-01-19T06:03:38.017Z","class":"io.smallrye.health","level":"INFO","message":"SRHCK01001: Reporting health down status: {\"status\":\"DOWN\",\"checks\":[{\"name\":\"Quarkus Operator SDK health check\",\"status\":\"DOWN\"},{\"name\":\"K8s connected to kubernetes api version v1\",\"status\":\"UP\"},{\"name\":\"Quarkus Operator SDK health check\",\"status\":\"UP\",\"data\":{\"myfirstcontroller\":\"OK\",\"mysecondcontroller\":\"OK\"}},{\"name\":\"K8s connected to kubernetes api version v1\",\"status\":\"UP\"}]}"}

Could there be a race condition here on this line?

@metacosm any chance you have some insight here wrt to this potentially being a race condition? I can have mixed results when starting our operator up.

@csviri csviri self-assigned this Jan 26, 2023
@csviri
Copy link
Contributor

csviri commented Jan 26, 2023

@bhellema how often you see this ? is it resolvedd shortly after?

thinking if this might be an issue in JOSDK:

https://github.com/java-operator-sdk/java-operator-sdk/blob/9d274534f8a222c931b05dc05490b744a9ea6ea4/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/health/InformerHealthIndicator.java#L11-L14

So, the ... && hasSynced() && isWatching() is also part of the expression. So hasSynced might be problem on startup. But isWatching could be problem any time, since watch might be down.

However this coin has two sides, if the informer is running but not watching it is a problem. But only if it happens for a longer time period. Since it might resemble a real problem with the watch (like lost permissions to watch a resource). I will add log messages for the next patch release, so we can verify if this is the problem.

@csviri
Copy link
Contributor

csviri commented Jan 26, 2023

@bhellema
Copy link
Author

bhellema commented Jan 28, 2023

@csviri the issue showed up pretty frequently and wouldn't resolve.

Is there a release with this debug logging that I can try?

@csviri
Copy link
Contributor

csviri commented Jan 30, 2023

@metacosm
Copy link
Member

metacosm commented Feb 8, 2023

The corresponding Quarkus extension release is 5.0.4. Please let us know if that addresses your issue.

@metacosm
Copy link
Member

I believe this has been addressed in the 5.0.4 version so closing this issue for now. Please re-open if you're still encounter the problem.

@bhellema
Copy link
Author

bhellema commented May 10, 2023

@metacosm, sorry for the very long delay in getting back to this, I finally had a bit of time to try to upgrade things. I'm still having issues using the latest quarkus-operator-sdk 6.0.1.

It looks like we're having issues getting the Kubernetes Client injected into the Health check for some reason after having upgraded. I'm still investigating.

{"time":"2023-05-10T17:28:52.489Z","class":"io.smallrye.health","level":"ERROR","message":"SRHCK01000: Error processing Health Checks [Error Occurred After Shutdown]","stackTrace":"java.lang.RuntimeException: Error injecting io.fabric8.kubernetes.client.KubernetesClient com.adobe.granite.contentbackflow.operator.probes.ContentBackflowHealthCheck.kubernetesClient\n\tat com.adobe.granite.contentbackflow.operator.probes.ContentBackflowHealthCheck_Bean.doCreate(Unknown Source)\n\tat com.adobe.granite.contentbackflow.operator.probes.ContentBackflowHealthCheck_Bean.create(Unknown Source)\n\tat com.adobe.granite.contentbackflow.operator.probes.ContentBackflowHealthCheck_Bean.create(Unknown Source)\n\tat io.quarkus.arc.impl.AbstractSharedContext.createInstanceHandle(AbstractSharedContext.java:113)\n\tat io.quarkus.arc.impl.AbstractSharedContext$1.get(AbstractSharedContext.java:37)\n\tat io.quarkus.arc.impl.AbstractSharedContext$1.get(AbstractSharedContext.java:34)\n\tat io.quarkus.arc.impl.LazyValue.get(LazyValue.java:26)\n\tat io.quarkus.arc.impl.ComputingCache.computeIfAbsent(ComputingCache.java:69)\n\tat io.quarkus.arc.impl.AbstractSharedContext.get(AbstractSharedContext.java:34)\n\tat io.quarkus.arc.impl.ClientProxies.getApplicationScopedDelegate(ClientProxies.java:21)\n\tat com.adobe.granite.contentbackflow.operator.probes.ContentBackflowHealthCheck_ClientProxy.arc$delegate(Unknown Source)\n\tat com.adobe.granite.contentbackflow.operator.probes.ContentBackflowHealthCheck_ClientProxy.call(Unknown Source)\n\tat io.smallrye.context.impl.wrappers.SlowContextualSupplier.get(SlowContextualSupplier.java:21)\n\tat io.smallrye.mutiny.operators.uni.builders.UniCreateFromItemSupplier.subscribe(UniCreateFromItemSupplier.java:28)\n\tat io.smallrye.mutiny.operators.AbstractUni.subscribe(AbstractUni.java:36)\n\tat io.smallrye.mutiny.operators.uni.UniOnFailureFlatMap.subscribe(UniOnFailureFlatMap.java:31)\n\tat io.smallrye.mutiny.operators.AbstractUni.subscribe(AbstractUni.java:36)\n\tat io.smallrye.mutiny.operators.uni.UniOnItemTransform.subscribe(UniOnItemTransform.java:22)\n\tat io.smallrye.mutiny.operators.AbstractUni.subscribe(AbstractUni.java:36)\n\tat io.smallrye.mutiny.operators.uni.UniAndCombination$UniHandler.subscribe(UniAndCombination.java:233)\n\tat io.smallrye.mutiny.operators.uni.UniAndCombination$AndSupervisor.run(UniAndCombination.java:87)\n\tat io.smallrye.mutiny.operators.uni.UniAndCombination.subscribe(UniAndCombination.java:53)\n\tat io.smallrye.mutiny.operators.AbstractUni.subscribe(AbstractUni.java:36)\n\tat io.smallrye.mutiny.operators.uni.UniBlockingAwait.await(UniBlockingAwait.java:60)\n\tat io.smallrye.mutiny.groups.UniAwait.atMost(UniAwait.java:65)\n\tat io.smallrye.health.SmallRyeHealthReporter.getReadiness(SmallRyeHealthReporter.java:232)\n\tat io.smallrye.health.SmallRyeHealthReporter_ClientProxy.getReadiness(Unknown Source)\n\tat io.quarkus.smallrye.health.runtime.SmallRyeReadinessHandler.getHealth(SmallRyeReadinessHandler.java:11)\n\tat io.quarkus.smallrye.health.runtime.SmallRyeHealthHandlerBase.doHandle(SmallRyeHealthHandlerBase.java:44)\n\tat io.quarkus.smallrye.health.runtime.SmallRyeHealthHandlerBase.handle(SmallRyeHealthHandlerBase.java:31)\n\tat io.quarkus.smallrye.health.runtime.SmallRyeReadinessHandler.handle(SmallRyeReadinessHandler.java:7)\n\tat io.quarkus.smallrye.health.runtime.SmallRyeHealthHandlerBase.handle(SmallRyeHealthHandlerBase.java:19)\n\tat io.vertx.ext.web.impl.BlockingHandlerDecorator.lambda$handle$0(BlockingHandlerDecorator.java:48)\n\tat io.vertx.core.impl.ContextBase.lambda$null$0(ContextBase.java:137)\n\tat io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:264)\n\tat io.vertx.core.impl.ContextBase.lambda$executeBlocking$1(ContextBase.java:135)\n\tat io.quarkus.vertx.core.runtime.VertxCoreRecorder$14.runWith(VertxCoreRecorder.java:576)\n\tat org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2513)\n\tat org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1538)\n\tat org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:29)\n\tat org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:29)\n\tat io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat java.base/java.lang.Thread.run(Thread.java:834)\nCaused by: java.lang.NullPointerException\n\tat io.quarkus.it.openshift.client.runtime.OpenShiftClientProducer_ProducerMethod_openShiftClient_ccf49c646fbe836d0e8b8406400d0ad276cd5438_Bean.get(Unknown Source)\n\tat io.quarkus.it.openshift.client.runtime.OpenShiftClientProducer_ProducerMethod_openShiftClient_ccf49c646fbe836d0e8b8406400d0ad276cd5438_Bean.get(Unknown Source)\n\t... 43 more\n","errorType":"java.lang.RuntimeException","errorMessage":"Error injecting io.fabric8.kubernetes.client.KubernetesClient com.adobe.granite.contentbackflow.operator.probes.ContentBackflowHealthCheck.kubernetesClient"}

@metacosm metacosm reopened this May 11, 2023
@metacosm
Copy link
Member

@bhellema are you still experiencing issues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants