-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Restoring snapshot: indices exclusion triggers security_exception (creating OK, listing OK) #1652
Comments
Okay, this is strange, when I deliberately use an user with insufficient rights to list snapshots:
Note the part between ** **
Shouldnt the manage_snapshots be listed in the bold part? |
Okay, it's getting stranger and stranger. Same user as before:
Result same as before:
Opensearch logging: Admin_XXXX has admin access, so it's not an regular user Okay, problem with the roles...nope, watch this:
Gives:
So, not a roles / security problem. Okay, I can restore, if I just exclude .kibana_1?
Result:
Okay, just to play with it, let's just delete the kibana index and try this: DELETE .kibana*
Result:
Problem seems to be in the excluded indices combined with the security module? |
@anirudha maybe you can help with this? |
Any news? :) If this problem is occuring more often it could disrupt restoring failed clusters. |
I'll move this into the security repo and someone should followup. |
Would be nice if a bug fix could make it to the next release. For now I just snapshot the stuff that is really really necessary, but it would be nice to just do it the default way. |
This appears to impact 1.3 as well. Even adding the snapshot_restore and manage_snapshot roles to our admin users via OpenID did not resolve the issue. I was able to get past it by using the admin certificate. |
[Triage] @cliu123 Could you look into this issue and recommend next steps? |
Any news? |
@cliu123 did you have a chance to look into this? |
Guys? :) would be cool if this could be part of the 2.0 :) |
Guys, with 2.0.1 my workaround with restoring specific indexes doesn't work anymore. Basically I can't recover my cluster (which was wiped, don't ask ;) ) with my default admin credentials. I do not want to be pushy, but this is getting really annoying and breaks the snapshotting functionality completely...... curl also failes. |
Trick with the admin certificate works though. |
Guys... although we have solved it by using certificates this is quite an irritating bug, if you don't understand what's happening you won't be able to restore. When people are already in panic with a cluster that has exploded this can be frustrating. |
@ict-one-nl could you elaborate exactly what the workaround is for this? |
Use certificate based authentication instead of basic auth, then it won't be a problem. We switched over to certificate based auth because of this. |
@ict-one-nl Can you provide how to generate certificate to perform certificate based authentication in AWS? |
We have deployed it locally in kubernetes, so I'm afraid I have little knowledge of the AWS setup :) |
Update 12/13: I went ahead and setup a Kubernetes deployment to begin the diagnosis process. I am going to reproduce the issue and start moving out from the |
Sounds great, thanks for the effort :) |
Update 12/14:
|
Update 12/15: I have reproduced the main errors listed in the top couple of posts of this issue. I have also verified that the configuration is valid. As mentioned, it does not seem that restore works properly with basic auth unless the user is a super admin. Will continue to look into fixes tomorrow. |
Update 12/16: I believe I have narrowed down the issue to the two most likely causes. As can be seen in the first attachment, when attempting to restore an index the SnapshotRestoreHelper is unable to read the information from the snapshot repository. In this case, I had 7 snapshots (the loop happened 7 times which is correct if the name is not found), but the names are not shown in the logs as they should be. The correct formatting of the log is also verified by the original code which reads Another possible issue is in the resolveIndexPatterns function of the IndexResolverReplacer. In the second attachment, you can see that the requested patterns has swapped from my original input of snapshot '3' to '*'. This would not be allowed by anyone other than a super admin (requiring the admin certificate) as we can see in the ProtectedIndexAccessEvalutor that requests for '_all' indices are not allowed by a regular user (this includes the admin user). I am not yet sure which of these two problems is the root issue or if it is a combo of the two but hope this provides some insight and that you know it is being addressed. Edit: I also found that it is possible to get a hit in the SnapshotRestoreEvaluator for-loop on an old snapshot created with OpenSearch version 1.3 and now trying to be restored in 3.0.0. This behavior is odd because it shows that either something works differently in creating the snapshots from 2.0 onwards or that the snapshots are always discoverable but that the logger is not able to process the newer ones. |
Great work, good that other stuff became apparent as well. Next week I won't have access to a computer much, but if I can help after that week let me know. |
Update 12/19: I added further logs today and went through trying to determine the exact flow of a request and where we could be running into an issue that would cause the request information to be lost. From what I could find, the earliest step of the process which is RestoreSnapshot specific is the Snapshot Restore Evaluator at which point we see that the request is still null. Continuing forward, I found that throughout the entire flow of the request the request information evaluates to null. To the best of my understanding this should not be the case. It seems like the Thread name is not being set either. Attached is another snippet of the logs from my testing. A relevant section to this particular issue with the security exception you are encountering is that after failing to find |
Update 12/20: I went through the logs and found the point where the request gets swapped from its original form to the wildcard all. I believe it happens when the DlsFlsValveImpl invokes() the snapshot restore request. At this point the request elements read correctly as |
Update 12/21: Went through logs further and tried bypassing a check for marking complete the presponse in SnapshotRestoreEvaluator. This did not fix the issue but did allow me to see further steps of the execution path. I added further logs to the execution path to look at possible jumps in the execution over needed request logic. Will continue to work on the issue and may bring in another contributor. Current changes: https://github.com/scrawfor99/security/tree/restore-snapshot Steps to test & reproduce:
This will create a snapshot repository for you to store a snapshot in. The repository name is just
You should encounter the error discussed with the issue. |
Update 12/22: @cwperks lent me a second pair of eyes and a different angle today and was able to determine that the issue appears to be coming from the parsing of the indices done by the opensearch-core snapshot utils. Basically when it is reading the indices, it treats the ordering differently based on whether a negated indices statement appears first or not in the This line is the culprit. Testing will need to be added and hopefully a correction does not break any existing functionality with the SnapshotUtils but I am happy that @cwperks and I were able to make some progress on this today. |
TLDR; Order matters in the RCA for the bug: To find the root cause of the issue I added a new debugging statement after this line to print out the resolved indices: https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/privileges/SnapshotRestoreEvaluator.java#L97 Then I ran the following queries on a snapshot that contained:
These were the requests and resolved indices:
Wait a minute, why do these resolve to a different set of indices??? That led us to look into SnapshotUtils which is resolving the request to a list of indices to resolve and that's where we found specific logic for when a negation was the first in the list (https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/snapshots/SnapshotUtils.java#L91-L95):
All The thing is, if the negation is later on in the list then it does not add all available indices to the set and only removes the index in the negation if it was previously added to the HashSet. The quickest solution I can think of is to add sorting logic in |
The issue really is with a mix of positive (
I'd like to think through scenarios, but when there is both positive and negative then I think the positive should be evaluated first and the negative taken away from what those have resolved to. |
Thanks again for all the effort guys, sounds like you're homing in on the actual problem. Great work! |
Opened a PR to resolve the issue: opensearch-project/OpenSearch#5626 |
@ict-one-nl, @davidlago , @cwperks , @CEHENKLE I am happy to share that the PR fixing this behavior has been merged into the main branch of the core repository and will be backported. I want to take the opportunity to thank you for your patience in the resolution of this issue since its original opening as well as taking the time to attend our public triaging meeting to bring further attention to this issue. Supporting our community is always the number one priority. I have copied over the updated functionality so that you know what to expect when making calls under the corrected restore logic. The new behavior is as follows:
Now behavior is persevered irregardless of the order of positive and negated wildcards queries. Thank you and Happy New Year! |
@scrawfor99 Did we add a test that verifies the bug is fixed in the security repo? |
@peternied we have not. Since the issue was related to the incorrect parsing of the snapshot indices in core, the tests were added in core. If you feel additional tests in security are required as well they can be added though it will mostly be duplicating the tests in core just going through the entire snapshot restore process as well instead of just checking the parsing which was failing. Do you think that is something that should be added given this? Let me know and if so I will take care of it once the core build finishes! :) |
If we can I'd recommend we test in scenarios that most closely reflect what customers encounter, I'd hate for another change in core to go in that seems innocuous but it breaks snapshot restore via the security plugin |
nice thank you! |
@peternied, I opened a PR in the security codebase to add a test which will check that the behavior is consistent during the SnapshotRestore operation. So once this gets merged everything should be set and this issue will be double covered with new testing in core and security. |
@scrawfor99 When you get a chance can you confirm if your change was in 2.6.0? |
@peternied @scrawfor99 I left a comment on the PR in core. The backport label was not added to the PR so it was not backported and released in 2.6.0. The soonest it will be released is 2.7.0. |
It's on the list for 2.7? cool :) |
Describe the bug
I'm using the S3 repository plugin to store snapshots. I've have tested this previously with the same scripting, but now I can't restore snapshots anymore. Don't know the exact cause, two things have changed: I have moved to 1.2.4 and I have moved from SAML to openid. The calls for the snapshot create/restore/list/etc are still being done through basic auth.
The strange thing is, I can list all the snapshots, I can create snapshots, I just can't restore them. I'm not including global state or the security index:
Error:
Listing works:
Creating snapshot works:
Restoring fails:
all_access is mapped to the admin backend role:
Expected behavior
Snapshot is restored
Plugins
Default docker 1.2.4 plus s3 repo plugin
Also tried default docker 1.1 plus s3 repo plugin
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Docker 1.2.4 image on kubernetes
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: