-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Descrepancy in labels using api/v1/series and api/v1/query when external label and internal label has the same key #6844
Comments
Are all components on the same version? |
@GiedriusS sidecars have 0.32.4, the rest 0.32.5. |
Hey, are you able to share some downstream blocks so we can try to reproduce locally? Also what downstream stores apis are you querying? |
Not sure what are downstream stores apis. But we have global thanos, which is querieng querier on another cluster via grpc, which works with thanos sidecar. Data is stored in gcp bucket.
Can you help me with some guide? Should I just send you some chunks from bucket? If so, how can I find necessary chunks (we have too many of them and no "created at" in gcp bucket". |
Mh, ok so sharing might not be practical. With "downsteam store api" i meant essentially "--endpoint"s! |
Can you bump sidecars to 0.32.5? There was this #6816 Store: fix prometheus store label values matches for external labels Which feels somewhat related. |
@MichaHoffmann
Well, we have many endpoints.
The one, which have metrics in questions is thanos.dev.int.company.live:443 |
@andrejshapal Can you try bumping up the version? Seems it is the same bug fixed in v0.32.5 |
@yeya24 |
Hey @andrejshapal can you share configuration of the offending |
@MichaHoffmann Sure: spec:
project: application-support
sources:
- repoURL: https://helm.onairent.live
chart: any-resource
targetRevision: "0.1.0"
helm:
values: |
anyResources:
- repoURL: https://charts.bitnami.com/bitnami
chart: thanos
targetRevision: "12.13.12"
helm:
values: |
fullnameOverride: thanos-sidecar-querier
query:
dnsDiscovery:
enabled: true
sidecarsService: kube-prometheus-stack-thanos-discovery
sidecarsNamespace: monitoring
service:
annotations:
traefik.ingress.kubernetes.io/service.serversscheme: h2c
serviceGrpc:
annotations:
traefik.ingress.kubernetes.io/service.serversscheme: h2c
ingress:
grpc:
enabled: true
ingressClassName: traefik-internal
annotations:
traefik.ingress.kubernetes.io/router.tls.options: monitoring-thanos@kubernetescrd
hostname: thanos.dev.int.company.live
extraTls:
- hosts:
- thanos.dev.int.company.live
secretName: thanos-client-server-cert-1
bucketweb:
enabled: false
compactor:
enabled: false
storegateway:
enabled: false
receive:
enabled: false
metrics:
enabled: true
serviceMonitor:
enabled: true
labels:
prometheus: main I also noticed it returns one cluster untill 07:00 27/10/2023 (local time, now is 12:41) and at 07:05 already 2 "clusters". |
can you share the prometheus configurations from the instances that monitor the offending cassandra cluster too please? |
We use kube-prometheus-stack. Nothing really special: - repoURL: https://prometheus-community.github.io/helm-charts
chart: kube-prometheus-stack
targetRevision: "50.3.1"
helm:
values: |
fullnameOverride: kube-prometheus-stack
commonLabels:
prometheus: main
defaultRules:
create: false
kube-state-metrics:
fullnameOverride: kube-state-metrics
prometheus:
monitor:
enabled: true
additionalLabels:
prometheus: main
metricRelabelings:
- action: labeldrop
regex: container_id
- action: labeldrop
regex: uid
- sourceLabels: [__name__]
action: drop
regex: 'kube_configmap_(annotations|created|info|labels|metadata_resource_version)'
collectors:
- certificatesigningrequests
- configmaps
- cronjobs
- daemonsets
- deployments
- endpoints
- horizontalpodautoscalers
- ingresses
- jobs
- limitranges
- mutatingwebhookconfigurations
- namespaces
- networkpolicies
- nodes
- persistentvolumeclaims
- persistentvolumes
- poddisruptionbudgets
- pods
- replicasets
- replicationcontrollers
- resourcequotas
- secrets
- services
- statefulsets
- storageclasses
- validatingwebhookconfigurations
- volumeattachments
metricLabelsAllowlist:
- pods=[version]
kubeScheduler:
enabled: false
kubeEtcd:
enabled: false
kubeProxy:
enabled: false
kubeControllerManager:
enabled: false
prometheus-node-exporter:
fullnameOverride: node-exporter
extraArgs:
- --collector.filesystem.mount-points-exclude=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/.+)($|/)
- --collector.filesystem.fs-types-exclude=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|tmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$
prometheus:
monitor:
enabled: true
additionalLabels:
prometheus: main
relabelings:
- action: replace
sourceLabels:
- __meta_kubernetes_pod_node_name
targetLabel: instance
coreDns:
enabled: false
kubelet:
enabled: true
serviceMonitor:
cAdvisorMetricRelabelings:
- sourceLabels: [__name__]
action: drop
regex: 'container_cpu_(cfs_throttled_seconds_total|load_average_10s|system_seconds_total|user_seconds_total)'
- sourceLabels: [__name__]
action: drop
regex: 'container_fs_(io_current|io_time_seconds_total|io_time_weighted_seconds_total|reads_merged_total|sector_reads_total|sector_writes_total|writes_merged_total)'
- sourceLabels: [__name__]
action: drop
regex: 'container_memory_(mapped_file|swap)'
- sourceLabels: [__name__]
action: drop
regex: 'container_(file_descriptors|tasks_state|threads_max)'
- sourceLabels: [__name__]
action: drop
regex: 'container_spec.*'
- sourceLabels: [id, pod]
action: drop
regex: '.+;'
- action: labeldrop
regex: id
- action: labeldrop
regex: name
- action: labeldrop
regex: uid
cAdvisorRelabelings:
- action: replace
sourceLabels: [__metrics_path__]
targetLabel: metrics_path
probesMetricRelabelings:
- action: labeldrop
regex: pod_uid
probesRelabelings:
- action: replace
sourceLabels: [__metrics_path__]
targetLabel: metrics_path
resourceRelabelings:
- action: replace
sourceLabels: [__metrics_path__]
targetLabel: metrics_path
relabelings:
- action: replace
sourceLabels: [__metrics_path__]
targetLabel: metrics_path
grafana:
enabled: false
alertmanager:
enabled: false
prometheus:
enabled: true
monitor:
additionalLabels:
prometheus: main
serviceAccount:
create: true
name: "prometheus"
thanosService:
enabled: true
thanosServiceMonitor:
enabled: true
ingress:
enabled: true
annotations:
kubernetes.io/ingress.class: traefik-internal
hosts:
- prometheus.dev.int.company.live
tls:
- hosts:
- prometheus.dev.int.company.live
secretName: wildcard-dev-int-company-live
prometheusSpec:
enableRemoteWriteReceiver: true
serviceAccountName: prometheus
enableAdminAPI: true
disableCompaction: true
scrapeInterval: 10s
retention: 2h
additionalScrapeConfigsSecret:
enabled: false
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 20Gi
externalLabels:
cluster: cit1-k8s
replica: prometheus-cit1-1
additionalAlertManagerConfigs:
- scheme: https
static_configs:
- targets:
- alertmanager.company.live
thanos:
image: quay.io/thanos/thanos:v0.32.5
objectStorageConfig:
name: thanos-objstore
key: objstore.yml
ruleSelector:
matchLabels:
evaluation: prometheus
serviceMonitorSelector:
matchLabels:
prometheus: main
podMonitorSelector:
matchLabels:
prometheus: main
probeSelector:
matchLabels:
prometheus: main
resources:
requests:
cpu: "3.2"
memory: 14Gi
limits:
cpu: 8
memory: 20Gi |
Is there another replica somewhere maybe? Asking since it has the external "replica" label |
@MichaHoffmann Nope. We have HA prometheuses on some clusters, but added replica label everywhere just for consistency. |
having replica label on things that are not replicas of one another feels like ti could be an issue |
@MichaHoffmann I can try to remove replica label. But this should not be an issue, sisnce it just used as a deduplication label? |
@MichaHoffmann I have removed replica label, but no effect on issue in question. |
Ah well, an attempt was made. Do you have the same issue if you uncheck "Use Deduplication" ? |
@MichaHoffmann In thanos query with or without deduplication issue is not noticed. I don't think querieng works via api/v1/series. |
You can specify |
dedup false:
dedup true:
|
Would it be possible to send |
@MichaHoffmann Sorry for long waiting, had busy week. |
Hey, I did small local setup of prometheus, sidecar ,querier (on latest main) and your data and can reproduce!
with prometheus configured like
querier and sidecar are configured mostly as default. Thanks, ill look into this in the debugger a bit later! |
Ok i think i have found the issue and have a fix; was able to reproduce in a minimal acceptance test case. |
Hello,
I am using Thanos 0.32.5.
Issue:
It was noticed flaky issue, that Thanos always exposing external_label when executing queries. But it randomly giving external_label or internal_label value when labels key is the same when querieng api/v1/series.
Here is prometheus output:
And after application external_label, the new label cluster shows in thanos:
But when I am querieng api/v1/series endpoint, it randomly gives value of cluster:
{ "status": "success", "data": [ { "__name__": "collectd_collectd_queue_length", "cassandra_datastax_com_cluster": "cassandra", "cassandra_datastax_com_datacenter": "dc1", "cluster": "cassandra", "collectd": "write_queue", "container": "cassandra", "dc": "dc1", "endpoint": "prometheus", "exported_instance": "10.2.150.192", "instance": "10.2.150.192:9103", "job": "cassandra-dc1-all-pods-service", "namespace": "cit1-core", "pod": "cassandra-dc1-r2-sts-0", "prometheus": "monitoring/kube-prometheus-stack-prometheus", "prometheus_replica": "prometheus-kube-prometheus-stack-prometheus-0", "rack": "r2", "service": "cassandra-dc1-all-pods-service" }, { "__name__": "collectd_collectd_queue_length", "cassandra_datastax_com_cluster": "cassandra", "cassandra_datastax_com_datacenter": "dc1", "cluster": "cassandra", "collectd": "write_queue", "container": "cassandra", "dc": "dc1", "endpoint": "prometheus", "exported_instance": "10.2.151.7", "instance": "10.2.151.7:9103", "job": "cassandra-dc1-all-pods-service", "namespace": "cit1-core", "pod": "cassandra-dc1-r3-sts-0", "prometheus": "monitoring/kube-prometheus-stack-prometheus", "prometheus_replica": "prometheus-kube-prometheus-stack-prometheus-0", "rack": "r3", "service": "cassandra-dc1-all-pods-service" } ] }
Expected:
Any api call should prioritise external_label and return it as a result of request.
Possible solution:
The current solution is to rename internal label in scrape config. But mostly we are using configs via helm out of the box. Meaning, we do not set configs. Therefore, there is a chance external label match some random label from some random metric. Since that, this is worth to fix discrepancy.
Series api endpoint is used by Grafana.
The text was updated successfully, but these errors were encountered: