Redpanda fails to start and console failed to validate SASL config after 5.8.9 upgrade #1381

zegerius · 2024-06-25T11:36:33Z

What happened?

Starting our development cluster Redpanda causes the following error moving from 5.8.8 to 5.8.9:

Redpanda doesn't become healthy:

  Warning  Unhealthy  25s (x64 over 9m55s)  kubelet            Readiness probe failed: command "/bin/sh -c set -x\nRESULT=$(rpk cluster health)\necho $RESULT\necho $RESULT | grep 'Healthy:.*true'\n" timed out

Which is odd, because when I exec the rpk cluster health command in the container:

redpanda@redpanda-0:/$ rpk cluster health
CLUSTER HEALTH OVERVIEW
=======================
Healthy:                          true
Unhealthy reasons:                []
Controller ID:                    0
All nodes:                        [0]
Nodes down:                       []
Leaderless partitions (0):        []
Under-replicated partitions (0):  []

That is potentially due to a faulty readiness check command:

redpanda@redpanda-0:/$ rpk cluster health
CLUSTER HEALTH OVERVIEW
=======================
Healthy:                          true
Unhealthy reasons:                []
Controller ID:                    0
All nodes:                        [0]
Nodes down:                       []
Leaderless partitions (0):        []
Under-replicated partitions (0):  []
redpanda@redpanda-0:/$ /bin/sh -c set -x\nRESULT=$(rpk cluster health)\necho $RESULT\necho $RESULT | grep 'Healthy:.*true'\n
redpanda@redpanda-0:/$ echo $?
1

Which could be simplified to:

redpanda@redpanda-0:/$ rpk cluster health | grep -q 'Healthy:.*true'
redpanda@redpanda-0:/$ echo $?
0

But I don't see any changes to the readiness check as compared to the previous version, I think?

I am not sure if that is related to timeout as k8s reports. Otherwise there is something else going on.

Additionally, the console doesn't start with an error I have never seen before. This seems to be related to the Go migration?

{
	"level": "fatal",
	"msg": "failed to validate config",
	"error": "failed to validate Kafka config: failed to validate sasl config: given sasl mechanism '%!s(<nil>)' is invalid"
}

What did you expect to happen?

Normal startup.

How can we reproduce it (as minimally and precisely as possible)?. Please include values file.

$ helm get values <redpanda-release-name> -n <redpanda-release-namespace> --all
COMPUTED VALUES:
affinity: {}
auditLogging:
  clientMaxBufferSize: 16777216
  enabled: false
  enabledEventTypes: null
  excludedPrincipals: null
  excludedTopics: null
  listener: internal
  partitions: 12
  queueDrainIntervalMs: 500
  queueMaxBufferSizePerShard: 1048576
  replicationFactor: null
auth:
  sasl:
    enabled: true
    mechanism: SCRAM-SHA-512
    secretRef: redpanda-users
    users:
    - name: redpanda
      password: redpanda
clusterDomain: cluster.local
commonLabels: {}
config:
  cluster:
    auto_create_topics_enabled: true
    default_topic_replications: 3
  node:
    crash_loop_limit: 5
    developer_mode: true
  pandaproxy_client: {}
  rpk: {}
  schema_registry_client: {}
  tunable:
    compacted_log_segment_size: 67108864
    group_topic_partitions: 16
    kafka_batch_max_bytes: 1048576
    kafka_connection_rate_limit: 1000
    log_segment_size: 134217728
    log_segment_size_max: 268435456
    log_segment_size_min: 16777216
    max_compacted_log_segment_size: 536870912
    topic_partitions_per_shard: 1000
connectors:
  deployment:
    create: false
  enabled: false
  test:
    create: false
console:
  affinity: {}
  annotations: {}
  automountServiceAccountToken: true
  autoscaling:
    enabled: false
    maxReplicas: 100
    minReplicas: 1
    targetCPUUtilizationPercentage: 80
  commonLabels: {}
  config: {}
  configmap:
    create: false
  console:
    config: {}
  deployment:
    create: false
  enabled: true
  enterprise:
    licenseSecretRef:
      key: ""
      name: ""
  extraContainers: []
  extraEnv: []
  extraEnvFrom:
  - configMapRef:
      name: redpanda-console-protobuf
  extraVolumeMounts: []
  extraVolumes: []
  fullnameOverride: ""
  global: {}
  image:
    pullPolicy: IfNotPresent
    registry: docker.redpanda.com
    repository: redpandadata/console
    tag: ""
  imagePullSecrets: []
  ingress:
    annotations:
      cert-manager.io/cluster-issuer: ca-issuer
    className: nginx
    enabled: true
    hosts:
    - host: redpanda.infra.localenv
      paths:
      - path: /
        pathType: ImplementationSpecific
    tls:
    - hosts:
      - redpanda.infra.localenv
      secretName: redpanda.infra.localenv-tls
  initContainers:
    extraInitContainers: ""
  livenessProbe:
    failureThreshold: 3
    initialDelaySeconds: 0
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 1
  nameOverride: ""
  nodeSelector: {}
  podAnnotations: {}
  podLabels: {}
  podSecurityContext:
    fsGroup: 99
    runAsUser: 99
  priorityClassName: ""
  readinessProbe:
    failureThreshold: 3
    initialDelaySeconds: 10
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 1
  replicaCount: 1
  resources: {}
  secret:
    create: false
    enterprise: {}
    kafka: {}
    login:
      github: {}
      google: {}
      jwtSecret: ""
      oidc: {}
      okta: {}
    redpanda:
      adminApi: {}
  secretMounts: []
  securityContext:
    runAsNonRoot: true
  service:
    annotations: {}
    port: 8080
    targetPort: null
    type: ClusterIP
  serviceAccount:
    annotations: {}
    automountServiceAccountToken: true
    create: true
    name: ""
  strategy: {}
  tests:
    enabled: true
  tolerations: []
  topologySpreadConstraints: {}
enterprise:
  license: ""
  licenseSecretRef: {}
external:
  domain: redpanda.infra.svc.cluster.local
  enabled: true
  service:
    enabled: false
  type: NodePort
fullnameOverride: ""
image:
  pullPolicy: IfNotPresent
  repository: docker.redpanda.com/redpandadata/redpanda
  tag: ""
imagePullSecrets: []
license_key: ""
license_secret_ref: {}
listeners:
  admin:
    external:
      default:
        advertisedPorts:
        - 31644
        port: 9645
        tls:
          cert: external
    port: 9644
    tls:
      cert: default
      requireClientAuth: false
  http:
    authenticationMethod: null
    enabled: true
    external:
      default:
        advertisedPorts:
        - 30082
        authenticationMethod: null
        port: 8083
        tls:
          cert: external
          requireClientAuth: false
    kafkaEndpoint: default
    port: 8082
    tls:
      cert: default
      requireClientAuth: false
  kafka:
    authenticationMethod: null
    external:
      default:
        advertisedPorts:
        - 9094
        authenticationMethod: null
        port: 9094
        tls:
          cert: external
    port: 9093
    tls:
      cert: default
      requireClientAuth: false
  rpc:
    port: 33145
    tls:
      cert: default
      requireClientAuth: false
  schemaRegistry:
    authenticationMethod: null
    enabled: true
    external:
      default:
        advertisedPorts:
        - 30081
        authenticationMethod: null
        port: 8084
        tls:
          cert: external
          requireClientAuth: false
    kafkaEndpoint: default
    port: 8081
    tls:
      cert: default
      requireClientAuth: false
logging:
  logLevel: info
  usageStats:
    enabled: true
monitoring:
  enabled: false
  labels: {}
  scrapeInterval: 30s
nameOverride: ""
nodeSelector: {}
post_install_job:
  affinity: {}
  enabled: true
post_upgrade_job:
  affinity: {}
  enabled: true
rackAwareness:
  enabled: false
  nodeAnnotation: topology.kubernetes.io/zone
rbac:
  annotations: {}
  enabled: false
resources:
  cpu:
    cores: 200m
    overprovisioned: true
  memory:
    container:
      max: 2.5Gi
      min: 512Mi
    redpanda:
      memory: 400Mi
      reserveMemory: 112Mi
serviceAccount:
  annotations: {}
  create: false
  name: ""
statefulset:
  additionalRedpandaCmdFlags: []
  additionalSelectorLabels: {}
  annotations: {}
  budget:
    maxUnavailable: 1
  extraVolumeMounts: ""
  extraVolumes: ""
  initContainerImage:
    repository: busybox
    tag: latest
  initContainers:
    configurator:
      extraVolumeMounts: ""
      resources: {}
    extraInitContainers: ""
    fsValidator:
      enabled: false
      expectedFS: xfs
      extraVolumeMounts: ""
      resources: {}
    setDataDirOwnership:
      enabled: false
      extraVolumeMounts: ""
      resources: {}
    setTieredStorageCacheDirOwnership:
      extraVolumeMounts: ""
      resources: {}
    tuning:
      extraVolumeMounts: ""
      resources: {}
  livenessProbe:
    failureThreshold: 3
    initialDelaySeconds: 10
    periodSeconds: 10
  nodeSelector: {}
  podAffinity: {}
  podAntiAffinity:
    custom: {}
    topologyKey: kubernetes.io/hostname
    type: hard
    weight: 100
  podTemplate:
    annotations: {}
    labels: {}
    spec:
      containers: []
  priorityClassName: ""
  readinessProbe:
    failureThreshold: 3
    initialDelaySeconds: 1
    periodSeconds: 10
    successThreshold: 1
  replicas: 1
  securityContext:
    fsGroup: 101
    fsGroupChangePolicy: OnRootMismatch
    runAsUser: 101
  sideCars:
    configWatcher:
      enabled: true
      extraVolumeMounts: ""
      resources: {}
      securityContext: {}
    controllers:
      createRBAC: true
      enabled: false
      healthProbeAddress: :8085
      image:
        repository: docker.redpanda.com/redpandadata/redpanda-operator
        tag: v2.1.10-23.2.18
      metricsAddress: :9082
      resources: {}
      run:
      - all
      securityContext: {}
  startupProbe:
    failureThreshold: 120
    initialDelaySeconds: 1
    periodSeconds: 10
  terminationGracePeriodSeconds: 90
  tolerations: []
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: ScheduleAnyway
  updateStrategy:
    type: RollingUpdate
storage:
  hostPath: ""
  persistentVolume:
    annotations: {}
    enabled: true
    labels: {}
    nameOverwrite: ""
    size: 20Gi
    storageClass: ""
  tiered:
    config:
      cloud_storage_access_key: ""
      cloud_storage_api_endpoint: ""
      cloud_storage_azure_container: null
      cloud_storage_azure_managed_identity_id: null
      cloud_storage_azure_shared_key: null
      cloud_storage_azure_storage_account: null
      cloud_storage_bucket: ""
      cloud_storage_cache_size: 5368709120
      cloud_storage_credentials_source: config_file
      cloud_storage_enable_remote_read: true
      cloud_storage_enable_remote_write: true
      cloud_storage_enabled: false
      cloud_storage_region: ""
      cloud_storage_secret_key: ""
    credentialsSecretRef:
      accessKey:
        configurationKey: cloud_storage_access_key
      secretKey:
        configurationKey: cloud_storage_secret_key
    hostPath: ""
    mountType: emptyDir
    persistentVolume:
      annotations: {}
      labels: {}
      storageClass: ""
tests:
  enabled: true
tls:
  certs:
    default:
      caEnabled: true
      issuerRef:
        kind: ClusterIssuer
        name: internal-ca-issuer
    external:
      caEnabled: true
      issuerRef:
        kind: ClusterIssuer
        name: ca-issuer
  enabled: true
tolerations: []
tuning:
  tune_aio_events: true

Anything else we need to know?

I have reviewed the changes redpanda-5.8.8...redpanda-5.8.9, but I didn't see any glaring breaking change.

Which are the affected charts?

Redpanda

Chart Version(s)

$ helm -n <redpanda-release-namespace> list 
redpanda        infra           1               2024-06-25 13:20:17.426293 +0200 CEST   deployed  redpanda-5.8.9           v24.1.8

Cloud provider

N/A

JIRA Link: K8S-266

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redpanda fails to start and console failed to validate SASL config after 5.8.9 upgrade #1381

Redpanda fails to start and console failed to validate SASL config after 5.8.9 upgrade #1381

zegerius commented Jun 25, 2024 •

edited

Loading

Redpanda fails to start and console failed to validate SASL config after 5.8.9 upgrade #1381

Redpanda fails to start and console failed to validate SASL config after 5.8.9 upgrade #1381

Comments

zegerius commented Jun 25, 2024 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?. Please include values file.

Anything else we need to know?

Which are the affected charts?

Chart Version(s)

Cloud provider

zegerius commented Jun 25, 2024 •

edited

Loading