-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting high volumes of 404 and 500 errors. Any insight? #469
Comments
OK, so most of the 500 errors disappeared after I added some more registries. Should've caught that! I am still getting a few 404s though. Are these normal? Example below:
|
The 404 should not really be at error level, but the logging is a bit simplistic. That is just Spegel telling the caller that it cannot help and that the caller should try the next configured mirror for that registry, so that is in some sense normal. |
Thanks @bittrance . That's reassuring. Out of curiosity, do you know about this one too?
|
Maybe it's normal actually? The msg does say |
We had reports of the "expected 200 but was 500" before. I think those errors implies that the Spegel cluster is internally inconsistent. I think the error occurs because some Spegel node advertised the blob hash, but when the acting mirror proxied the request to it, it errored. I'm not sure whether that is exactly an error or just eventual consistency in action. In your errors above, is it possible that Error 1 on one node could be resulting in Error 2 on another node? The timestamps suggests not, but maybe they are for different requests? |
Hmm, I'm not sure. I do see some errors like the below that might actually suggest that error 2 causes error 1 but it's quite noisy so it's a bit hard to tell. One interesting thing (might be coincidental) is that the ips are the same for the bottom 3 logs,
|
Another interesting observation. It seems I might not be getting the best performance either because of these errors... Below is a graph of |
I will start off by clarifying something about the Spegel. It is a best effort cache, meaning that images may or may not exist on other nodes when and image is pulled. On top of that images may not be discovered due to the distributed nature of the DHT. Having said that I do not think that this is an issue that you are facing. I made some changes before v0.0.22 to make sure that 404 are returned when an image is not found and 500 errors are returned for other types of errors. The error I saw that you wrote that the errors seems to come specifically from images pulled from ECR. I am not sure why this would be the case, but there might be something specific that is done when EKS pulls images from ECR? Could you share your values configuration used to deploy Spegel? |
Understood. Thanks for clarifying @phillebaba
That sounds plausible to me. Here are my values. Appreciate the help!
|
Have you applied the userData patch to the EKS nodes so that containerd keeps the intermediate layers after a pull? |
had the same problem - fixed it by allowing d39mqg4b1dx9z1.cloudfront.net DNS wise - Had blocked it out of safety concerns. CF host belongs to registry.k8s.io
https://github.com/kubernetes/k8s.io/blob/main/dns/zone-configs/k8s.io._0_base.yaml error in logs
hope this helps - cheers. |
registry.k8s.io does not offer a fixed list of backend - if you need to filter aggressively, you need to pull from a mirror you control. |
https://registry.k8s.io#stability has pointers to mirroring guides and other details. |
I do not understand how changing DNS settings would affect Spegel. Spegel is a best effort mirror that does not do any upstream requests to any other registries. This specific I refactored the logging in #488 and am working on #494 to help users understand a bit better what is going on when they see an error. I was hoping to get #472 done before the next release but that has been blocked due to some account access preventing me from running the benchmarks. So might just cut a release now to see if the update error logs gives some more insight into the actual problem. |
The 500 will most likely show up at the point when spegel checks for the S3 buckets hosting the container image blob pointers and landed on my end in an DNS error. It's further down the road code wise. and for the other comments.... yeah the "DOH ! READ DOCS !!11" signs are visible. For a public K8S OCI registry the security of this setup is questionable at best. (supply chain) cheers |
Closing as this issue has been used to discuss different topics and has low activity. Please open a new issue for future discussions. |
Spegel version
v0.0.22
Kubernetes distribution
EKS
Kubernetes version
1.29.0
CNI
AWS VPC CNI
Describe the bug
UPDATE 1: After some further investigation, it seems that adding my team's internal ECR registry caused a rise in errors and performance to degrade a bit (below). Seems like things aren't going as they should be... wondering what's happening and how to fix...🤔
Hello, apologies if this is very simple. I'm new to this tool. I just deployed Spegel after modifying the containerd settings in my EKS cluster according to the compatibility guide here: https://github.com/spegel-org/spegel/blob/main/docs/COMPATIBILITY.md.
The pods are running so I think on some levels it is working according to https://github.com/spegel-org/spegel/blob/main/docs/FAQ.md but I noticed that when a new node comes up I am getting a lot of 404/500 errors which leads to believe the mirroring isn't working as expected. I'm not familiar with what these errors indicate. Is this normal?
I did find #217 but not sure if it's related. Using
containerd config dump
, confirmed the settings should be ok.Any insight is much appreciated! Thank you. 😄
Error 1
Error 2
Error 3
The text was updated successfully, but these errors were encountered: