Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dropping store, external labels are not unique while using AWS NLB #1356

Closed
pablokbs opened this issue Jul 25, 2019 · 9 comments
Closed

dropping store, external labels are not unique while using AWS NLB #1356

pablokbs opened this issue Jul 25, 2019 · 9 comments

Comments

@pablokbs
Copy link
Contributor

Hello, I'm having some issues using AWS's NLB (network load balancer) ... seems like the IP addresses from the NLB nodes are confusing the thanos-query service:

thanos-query-9dcfd5cdb-gtzvf thanos level=warn ts=2019-07-25T17:57:25.520410254Z caller=storeset.go:252 component=storeset msg="dropping store, external labels are not unique" address=10.10.18.76:9090 extLset="{cluster=\"use1-prev-1\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-0\"}" duplicates=2
thanos-query-9dcfd5cdb-gtzvf thanos level=warn ts=2019-07-25T17:57:25.52043309Z caller=storeset.go:252 component=storeset msg="dropping store, external labels are not unique" address=10.10.17.176:9090 extLset="{cluster=\"use1-prev-1\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-0\"}" duplicates=2

Below is a diagram of my scenario

                                                       +-------------------+                                  
                                                       |                   |                                  
                                         +------------ |                   |                                  
                                         |             |      Grafana      |                                  
                                         |             |                   |                                  
        srv+ ?                 +---------|--------+    |                   |                                  
                               |                  |    +-------------------+                                  
   10 10901 aws nlb1           |   Thanos Proxy   |                                                           
   10 10901 aws nlb2           |   (thanos query) |                                                           
   10 10901 aws nlb3           |                  |                                                           
                               +------------------+                                                           
                                         |                                                                    
              +----------------------- ------------------------------+                                        
              |                          |                           |                                        
      +-------|-------+          +-------+-------+           +-------|-------+                                
      |    AWS NLB 1  |          |    AWS NLB 2  |           |    AWS NLB 3  |                                
      +---------------+          +---------------+           +---------------+                                
                                                                                                              
 +----------+ +----------+   +----------+ +----------+   +----------+ +----------+                            
 |          | |          |   |          | |          |   |          | |          |                            
 |Query     | |Query     |   |Query     | |Query     |   |Query     | |Query     |                            
 |(nodeport)| |(nodeport)|   |(nodeport)| |(nodeport)|   |(nodeport)| |(nodeport)|                            
 |          | |          |   |          | |          |   |          | |          |                            
 +----------+ +----------+   +----------+ +----------+   +----------+ +----------+                            
                                                                                                              
  label: cluster= cluster1   label: cluster= cluster2    label: cluster= cluster3                             

The problem I believe is that thanos-query uses the IP address (which is the IP address of AWS NLB workers, not my Kubernetes workers) to know if the label is unique, (comparing IP address vs labels)

If I modify the srv records to point to the nodes directly (using a single thanos-query pod per cluster), everything works fine, but that means I can't move that pod to another node.

Is there a way to change thanos-query to not to use the IP address to compare?

Thanks!

PS:

Thanos v0.6.0
Prometheus v2.7.2
@GiedriusS
Copy link
Member

Hi, it doesn't use the IP address to compare the connected nodes, only the external labels. I believe the problem here is that at the leaf nodes you have the same external labels and they get marked as duplicates. It's hard to tell from your diagram what is happening but it seems like through the DNS discovery both of the leaf nodes get picked up. Is this what's happening?

@pablokbs
Copy link
Contributor Author

pablokbs commented Jul 26, 2019

I believe this answers your question:

thanos-query-9dcfd5cdb-gtzvf thanos level=warn ts=2019-07-25T17:57:25.520410254Z caller=storeset.go:252 component=storeset msg="dropping store, external labels are not unique" address=10.10.18.76:9090 extLset="{cluster=\"use1-prev-1\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-0\"}" duplicates=2
thanos-query-9dcfd5cdb-gtzvf thanos level=warn ts=2019-07-25T17:57:25.52043309Z caller=storeset.go:252 component=storeset msg="dropping store, external labels are not unique" address=10.10.17.176:9090 extLset="{cluster=\"use1-prev-1\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-0\"}" duplicates=2

Those are the exact same pod replying from 2 different IP addresses (as the NLB has backend nodes)

Edit: If I remove the NLB and point to the node IP directly, everything works fine

@pablokbs
Copy link
Contributor Author

pablokbs commented Jul 29, 2019

I've also found something else, seems like one of the thanos-query pods (the ones running on each cluster) is missing the "Announced LabelSets" labels, which is weird as they are being generated with the same exact code in terraform:

image

EDIT: Nevermind, those had some dns issues trying to resolve their Kubernetes services

@pablokbs
Copy link
Contributor Author

So going back to the issue with having 2 pods exposing the labels thru their NLB IP address:

image

I've tried using the NLB setting Proxy Protocol v2 to expose the IP address of the nodes instead of the IP address of the NLB nodes, but it seems like it breaks grpc:

thanos-query-787b59b6c7-lzlrd thanos level=warn ts=2019-07-29T21:45:40.165622344Z caller=storeset.go:322 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = Unavailable desc = transport is closing" address=10.80.8.83:9090
thanos-query-787b59b6c7-5b6vs thanos level=warn ts=2019-07-29T21:45:44.204853593Z caller=storeset.go:322 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=10.80.8.83:9090

@pablokbs
Copy link
Contributor Author

I think this is related to #1338 ... I'll try to use a newer version of thanos with this fix

@pablokbs
Copy link
Contributor Author

Using the latest master fixes the issue, but I'm concerned about the fix, as it seems like it's just choosing one of the IP addresses and keep using that forever (or until it gets unhealthy) which can cause issues as the traffic will always go to a single NLB node. The solution is not ideal

@GiedriusS
Copy link
Member

Yes and that's why you need a load balancer like Envoy/Nginx in front of those two nodes with identical labels. Thanos here is doing the correct thing and protecting you from having needless 2x load. What do you think Thanos should do in such cases?

@pablokbs
Copy link
Contributor Author

These are not two nodes, it's only one Kubernetes node behind 2 NLB nodes (AWS creates a load balancer and add AWS nodes in there, they have their own IP that are the ones being resolved when you query myawsnlb.amazon.com)

Then, my Kubernetes node (the one that has the thanos-query pod) will be behind that NLB, but the IP address that's exposed to the thanos-query Proxy, are the ones from the NLB nodes. So thanos-query can receive the same traffic from 1 pod, with 2 different IP addresses.

I use a load balancer to be able to add a group of nodes that can have the thanos-query pods, I can't maintain a list of nodes manually if the pods are moving around between nodes.

If I add an Envoy/nginx load balancer behind a NLB, I'll have the same problem, the thanos query proxy node will see these nginx/envoy nodes with the NLB ip addresses, that usually are more than 1.

@stale
Copy link

stale bot commented Jan 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 11, 2020
@stale stale bot closed this as completed Jan 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants