Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Restoring snapshot: indices exclusion triggers security_exception (creating OK, listing OK) #1652

Closed
sandervandegeijn opened this issue Jan 31, 2022 · 57 comments
Assignees
Labels
bug Something isn't working help wanted Community contributions are especially encouraged for these issues. Plugins

Comments

@sandervandegeijn
Copy link

sandervandegeijn commented Jan 31, 2022

Describe the bug
I'm using the S3 repository plugin to store snapshots. I've have tested this previously with the same scripting, but now I can't restore snapshots anymore. Don't know the exact cause, two things have changed: I have moved to 1.2.4 and I have moved from SAML to openid. The calls for the snapshot create/restore/list/etc are still being done through basic auth.

The strange thing is, I can list all the snapshots, I can create snapshots, I just can't restore them. I'm not including global state or the security index:

Error:

{
    "error": {
        "root_cause": [
            {
                "type": "security_exception",
                "reason": "no permissions for [] and User [name=admin_XXXX, backend_roles=[admin], requestedTenant=null]"
            }
        ],
        "type": "security_exception",
        "reason": "no permissions for [] and User [name=admin_XXXX, backend_roles=[admin], requestedTenant=null]"
    },
    "status": 403
}

Listing works:

image

Creating snapshot works:

image

Restoring fails:

image

all_access is mapped to the admin backend role:

image

Expected behavior
Snapshot is restored

Plugins
Default docker 1.2.4 plus s3 repo plugin
Also tried default docker 1.1 plus s3 repo plugin

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):
Docker 1.2.4 image on kubernetes

Additional context
Add any other context about the problem here.

@sandervandegeijn sandervandegeijn added bug Something isn't working untriaged Require the attention of the repository maintainers and may need to be prioritized labels Jan 31, 2022
@sandervandegeijn sandervandegeijn changed the title [BUG] Restoring snapshot failes with security_exception [BUG] 1.2.4 Restoring snapshot failes with security_exception (creating works, listing as well) Jan 31, 2022
@sandervandegeijn sandervandegeijn changed the title [BUG] 1.2.4 Restoring snapshot failes with security_exception (creating works, listing as well) [BUG] 1.2.4 Restoring snapshot: security_exception (creating OK, listing OK) Jan 31, 2022
@sandervandegeijn
Copy link
Author

sandervandegeijn commented Jan 31, 2022

Okay, this is strange, when I deliberately use an user with insufficient rights to list snapshots:

{
    "error": {
        "root_cause": [
            {
                "type": "security_exception",
                "reason": "no permissions for **[cluster:admin/snapshot/get]** and User [name=telegraf_XXX, backend_roles=[telegraf], requestedTenant=null]"
            }
        ],
        "type": "security_exception",
        "reason": "no permissions for **[cluster:admin/snapshot/get]** and User [name=telegraf_XXXX, backend_roles=[telegraf], requestedTenant=null]"
    },
    "status": 403
}

Note the part between ** **
While with the admin user restoring snapshots:

{
    "error": {
        "root_cause": [
            {
                "type": "security_exception",
                "reason": "no permission**s for []** and User [name=admin_XXXX, backend_roles=[admin], requestedTenant=null]"
            }
        ],
        "type": "security_exception",
        "reason": "no permissions **for [] a**nd User [name=admin_XXXX, backend_roles=[admin], requestedTenant=null]"
    },
    "status": 403
}

Shouldnt the manage_snapshots be listed in the bold part?

@sandervandegeijn
Copy link
Author

sandervandegeijn commented Feb 1, 2022

Okay, it's getting stranger and stranger. Same user as before:

{
    "indices": ["-.opendistro_security", "-.kibana_1"],
    "include_global_state" : false
}

Result same as before:

{
    "error": {
        "root_cause": [
            {
                "type": "security_exception",
                "reason": "no permissions for [] and User [name=admin_XXXX, backend_roles=[admin], requestedTenant=null]"
            }
        ],
        "type": "security_exception",
        "reason": "no permissions for [] and User [name=admin_XXXX, backend_roles=[admin], requestedTenant=null]"
    },
    "status": 403
}

Opensearch logging:
[2022-02-01T15:18:09,678][WARN ][o.o.s.p.SecurityIndexAccessEvaluator] [opensearch-master-nodes-0] cluster:admin/snapshot/restore for '_all' indices is not allowed for a regular user

Admin_XXXX has admin access, so it's not an regular user

Okay, problem with the roles...nope, watch this:

{
    "indices": ["-.opendistro_security", "collector*" ],
    "include_global_state" : false
}

Gives:

{
    "error": {
        "root_cause": [
            {
                "type": "snapshot_restore_exception",
                "reason": "[CPS_POC_Elasticsearch:2022-01-31-21-18-4300002134/u-r3dC3FSpGH-icbXi9yng] cannot restore index [.kibana_1] because an open index with same name already exists in the cluster. Either close or delete the existing index or restore the index under a different name by providing a rename pattern and replacement name"
            }
        ],
        "type": "snapshot_restore_exception",
        "reason": "[CPS_POC_Elasticsearch:2022-01-31-21-18-4300002134/u-r3dC3FSpGH-icbXi9yng] cannot restore index [.kibana_1] because an open index with same name already exists in the cluster. Either close or delete the existing index or restore the index under a different name by providing a rename pattern and replacement name"
    },
    "status": 500
}

So, not a roles / security problem. Okay, I can restore, if I just exclude .kibana_1?

{
    "indices": ["-.opendistro_security", "-.kibana_1", "collector*"],
    "include_global_state" : false
}

Result:

{
    "error": {
        "root_cause": [
            {
                "type": "security_exception",
                "reason": "no permissions for [] and User [name=admin_XXXX, backend_roles=[admin], requestedTenant=null]"
            }
        ],
        "type": "security_exception",
        "reason": "no permissions for [] and User [name=admin_XXXX, backend_roles=[admin], requestedTenant=null]"
    },
    "status": 403
}

Okay, just to play with it, let's just delete the kibana index and try this:

DELETE .kibana*

{
    "indices": ["-.opendistro_security", "collector*"],
    "include_global_state" : false
}

Result:

{
    "accepted": true
}

Problem seems to be in the excluded indices combined with the security module?

@sandervandegeijn sandervandegeijn changed the title [BUG] 1.2.4 Restoring snapshot: security_exception (creating OK, listing OK) [BUG] 1.2.4 Restoring snapshot: indices exclusion triggers security_exception (creating OK, listing OK) Feb 1, 2022
@sandervandegeijn sandervandegeijn changed the title [BUG] 1.2.4 Restoring snapshot: indices exclusion triggers security_exception (creating OK, listing OK) [BUG] 1.1-1.2.4 Restoring snapshot: indices exclusion triggers security_exception (creating OK, listing OK) Feb 1, 2022
@anasalkouz anasalkouz added Plugins and removed untriaged Require the attention of the repository maintainers and may need to be prioritized labels Feb 1, 2022
@dblock
Copy link
Member

dblock commented Feb 2, 2022

@anirudha maybe you can help with this?

@sandervandegeijn
Copy link
Author

Any news? :) If this problem is occuring more often it could disrupt restoring failed clusters.

@dblock
Copy link
Member

dblock commented Mar 1, 2022

I'll move this into the security repo and someone should followup.

@dblock dblock transferred this issue from opensearch-project/OpenSearch Mar 1, 2022
@sandervandegeijn
Copy link
Author

Would be nice if a bug fix could make it to the next release. For now I just snapshot the stuff that is really really necessary, but it would be nice to just do it the default way.

@JeffBolle
Copy link

This appears to impact 1.3 as well.

Even adding the snapshot_restore and manage_snapshot roles to our admin users via OpenID did not resolve the issue. I was able to get past it by using the admin certificate.

@peternied
Copy link
Member

peternied commented Mar 25, 2022

[Triage] @cliu123 Could you look into this issue and recommend next steps?

@peternied peternied changed the title [BUG] 1.1-1.2.4 Restoring snapshot: indices exclusion triggers security_exception (creating OK, listing OK) [BUG] Restoring snapshot: indices exclusion triggers security_exception (creating OK, listing OK) May 2, 2022
@peternied peternied added the help wanted Community contributions are especially encouraged for these issues. label May 2, 2022
@igorneos
Copy link

Any news?

@peternied
Copy link
Member

@cliu123 did you have a chance to look into this?

@sandervandegeijn
Copy link
Author

Guys? :) would be cool if this could be part of the 2.0 :)

@sandervandegeijn
Copy link
Author

sandervandegeijn commented Jun 22, 2022

Guys, with 2.0.1 my workaround with restoring specific indexes doesn't work anymore. Basically I can't recover my cluster (which was wiped, don't ask ;) ) with my default admin credentials.

I do not want to be pushy, but this is getting really annoying and breaks the snapshotting functionality completely......

image

image

curl also failes.

@sandervandegeijn
Copy link
Author

Trick with the admin certificate works though.

@sandervandegeijn
Copy link
Author

Guys... although we have solved it by using certificates this is quite an irritating bug, if you don't understand what's happening you won't be able to restore. When people are already in panic with a cluster that has exploded this can be frustrating.

@smlx
Copy link

smlx commented Jul 29, 2022

@ict-one-nl could you elaborate exactly what the workaround is for this?

@sandervandegeijn
Copy link
Author

Use certificate based authentication instead of basic auth, then it won't be a problem. We switched over to certificate based auth because of this.

@hquan5397
Copy link

hquan5397 commented Aug 26, 2022

@ict-one-nl Can you provide how to generate certificate to perform certificate based authentication in AWS?

@sandervandegeijn
Copy link
Author

We have deployed it locally in kubernetes, so I'm afraid I have little knowledge of the AWS setup :)

@stephen-crawford stephen-crawford self-assigned this Dec 13, 2022
@stephen-crawford
Copy link
Collaborator

Update 12/13: I went ahead and setup a Kubernetes deployment to begin the diagnosis process. I am going to reproduce the issue and start moving out from the snapshotRestoreEvaluator as this seems central to the issue.

@sandervandegeijn
Copy link
Author

Sounds great, thanks for the effort :)

@stephen-crawford
Copy link
Collaborator

stephen-crawford commented Dec 14, 2022

Update 12/14:

  1. Completed setup of deployment using minikube, helm, postman. Am in the process of creating demo snapshot to test restore.
  2. Installed s3 repository plugin to all nodes and began looking at ways to isolate the issue. A big thing has been trying to determine the fastest way to modify the security version running with the helm chart installation.

@stephen-crawford
Copy link
Collaborator

stephen-crawford commented Dec 15, 2022

Update 12/15:

I have reproduced the main errors listed in the top couple of posts of this issue. I have also verified that the configuration is valid. As mentioned, it does not seem that restore works properly with basic auth unless the user is a super admin.

Will continue to look into fixes tomorrow.

@stephen-crawford
Copy link
Collaborator

stephen-crawford commented Dec 16, 2022

Update 12/16: I believe I have narrowed down the issue to the two most likely causes. As can be seen in the first attachment, when attempting to restore an index the SnapshotRestoreHelper is unable to read the information from the snapshot repository. In this case, I had 7 snapshots (the loop happened 7 times which is correct if the name is not found), but the names are not shown in the logs as they should be. The correct formatting of the log is also verified by the original code which reads log.debug("snapshot found: {} (UUID: {})", snapshotId.getName(), snapshotId.getUUID()); -- this shows it is not a datatype issue causing the log to not display the info.

Another possible issue is in the resolveIndexPatterns function of the IndexResolverReplacer. In the second attachment, you can see that the requested patterns has swapped from my original input of snapshot '3' to '*'. This would not be allowed by anyone other than a super admin (requiring the admin certificate) as we can see in the ProtectedIndexAccessEvalutor that requests for '_all' indices are not allowed by a regular user (this includes the admin user).

I am not yet sure which of these two problems is the root issue or if it is a combo of the two but hope this provides some insight and that you know it is being addressed.

Screen Shot 2022-12-16 at 4 05 57 PM

Screen Shot 2022-12-16 at 3 22 13 PM

Edit:

I also found that it is possible to get a hit in the SnapshotRestoreEvaluator for-loop on an old snapshot created with OpenSearch version 1.3 and now trying to be restored in 3.0.0. This behavior is odd because it shows that either something works differently in creating the snapshots from 2.0 onwards or that the snapshots are always discoverable but that the logger is not able to process the newer ones.

Screen Shot 2022-12-16 at 4 52 34 PM

@sandervandegeijn
Copy link
Author

Great work, good that other stuff became apparent as well. Next week I won't have access to a computer much, but if I can help after that week let me know.

@stephen-crawford
Copy link
Collaborator

Update 12/19: I added further logs today and went through trying to determine the exact flow of a request and where we could be running into an issue that would cause the request information to be lost. From what I could find, the earliest step of the process which is RestoreSnapshot specific is the Snapshot Restore Evaluator at which point we see that the request is still null. Continuing forward, I found that throughout the entire flow of the request the request information evaluates to null. To the best of my understanding this should not be the case. It seems like the Thread name is not being set either.

Attached is another snippet of the logs from my testing. A relevant section to this particular issue with the security exception you are encountering is that after failing to find snapshot repository 'my-fs-repository', snapshot 'my_snapshot_no_kib' the IndexResolverReplacer interprets the request as looking for [*] which is a Local All pattern and would be blocked from non-super-admins.

Screen Shot 2022-12-19 at 4 53 32 PM

@stephen-crawford
Copy link
Collaborator

Update 12/20: I went through the logs and found the point where the request gets swapped from its original form to the wildcard all. I believe it happens when the DlsFlsValveImpl invokes() the snapshot restore request. At this point the request elements read correctly as "indices" : ["-.opendistro_security"] but then after the resolved request becomes aliases=[*], allIndices=[*]. I could be mis-interpretting this but I will be looking into this specific exchange next.

@stephen-crawford
Copy link
Collaborator

stephen-crawford commented Dec 21, 2022

Update 12/21: Went through logs further and tried bypassing a check for marking complete the presponse in SnapshotRestoreEvaluator. This did not fix the issue but did allow me to see further steps of the execution path. I added further logs to the execution path to look at possible jumps in the execution over needed request logic. Will continue to work on the issue and may bring in another contributor. Current changes: https://github.com/scrawfor99/security/tree/restore-snapshot

Steps to test & reproduce:

  1. Clone/fork branch linked above
  2. Navigate to directory and execute ./gradlew clean and then ./gradlew assemble
  3. Follow OpenSearch the beginning few steps of the documentation for setting up the docker-compose version of OpenSearch https://opensearch.org/docs/latest/install-and-configure/install-opensearch/docker/#sample-docker-composeyml (be able to use docker-compose up to launch an OpenSearch instance)
  4. Create a local repository for your snapshots i.e. /Users/Username/MySnapshots and clone https://github.com/scrawfor99/OS-snapshot-restore
  5. Configure the paths properly to match your repository directory -- this will be to your snapshots repo in to wherever you end up putting the assembled security zip (check volumes section of the file under both nodes for this information)
  6. execute docker-compose up
  7. Navigate to dashboards on your web browser and sign in with the default credentials admin:admin
  8. Navigate to the dev tools page and execute
{
  "type": "fs",
  "settings": {
    "location": "/mnt/snapshots"
  }
}

This will create a snapshot repository for you to store a snapshot in. The repository name is just my-fs-repository
9. Add a snapshot to the repo with PUT _snapshot/my-fs-repository/my-snapshot
10. Now try to restore the snapshot using

POST _snapshot/my-fs-repository/my_snapshot/_restore
{
  "indices": "-.opendistro_security",
  "include_global_state": false
}

You should encounter the error discussed with the issue.
11. You can now return to the docker logs (the terminal you called docker-compose up from
12. You can search through the logs for the various debug statements I have added to help follow the execution process.

@stephen-crawford
Copy link
Collaborator

stephen-crawford commented Dec 22, 2022

Update 12/22: @cwperks lent me a second pair of eyes and a different angle today and was able to determine that the issue appears to be coming from the parsing of the indices done by the opensearch-core snapshot utils. Basically when it is reading the indices, it treats the ordering differently based on whether a negated indices statement appears first or not in the indices field. That being said, because this issue has been long standing, I am going to continue to prioritize a fix based on the core end. That way the issue does not get buried as it had before.

This line is the culprit. Testing will need to be added and hopefully a correction does not break any existing functionality with the SnapshotUtils but I am happy that @cwperks and I were able to make some progress on this today.

@cwperks
Copy link
Member

cwperks commented Dec 22, 2022

TLDR; Order matters in the indices section of the request. The issue is coming from SnapshotUtils in core when a negated index is in the front. Relevant lines: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/snapshots/SnapshotUtils.java#L91-L95

RCA for the bug:

To find the root cause of the issue I added a new debugging statement after this line to print out the resolved indices: https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/privileges/SnapshotRestoreEvaluator.java#L97

Then I ran the following queries on a snapshot that contained:

.opendistro_security
.kibana_1
opensearch_dashboards_sample_data_flights
security-auditlog-2022.12.22

These were the requests and resolved indices:

// -.opendistro_security in the front of the list
POST _snapshot/my-first-repo/20221222-snapshot-all/_restore
{
    "indices": ["-.opendistro_security", "open*", "sec*"],
    "include_global_state" : false
}

// Restoring indices: [opensearch_dashboards_sample_data_flights, security-auditlog-2022.12.22, .kibana_1]
// -.opendistro_security in the middle of the list
POST _snapshot/my-first-repo/20221222-snapshot-all/_restore
{
    "indices": ["open*", "-.opendistro_security", "sec*"],
    "include_global_state" : false
}

// Restoring indices: [opensearch_dashboards_sample_data_flights, security-auditlog-2022.12.22]

Wait a minute, why do these resolve to a different set of indices???

That led us to look into SnapshotUtils which is resolving the request to a list of indices to resolve and that's where we found specific logic for when a negation was the first in the list (https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/snapshots/SnapshotUtils.java#L91-L95):

// if its the first, fill it with all the indices...
if (i == 0) {
    result = new HashSet<>(availableIndices);
}

All availableIndices are being added to the HashSet and the negated index is removed later on here: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/snapshots/SnapshotUtils.java#L115

The thing is, if the negation is later on in the list then it does not add all available indices to the set and only removes the index in the negation if it was previously added to the HashSet.

The quickest solution I can think of is to add sorting logic in SnapshotUtils.filterIndices to sort selectedIndices so that negations always appear at the end of the array.

@cwperks
Copy link
Member

cwperks commented Dec 22, 2022

The issue really is with a mix of positive (my-index*) and negative (-.opendistro_security) patterns in the indices portion.

  1. If indices only contains negative. i.e. "indices": ["-.opendistro_security", "-.kibana_1"] then its interpreted as everything except these 2 indices.

  2. If indices only contains positive. i.e. "indices": ["open*", "sec*"] then its interpreted as everything matching these patterns.

  3. It gets tricky with a mix of positive and negative. i.e. "indices": ["-.opendistro_security", "open*"] which is where this behavior arose. If you only want indices matching a certain pattern then you don't need the negation in the restore snapshot query.

I'd like to think through scenarios, but when there is both positive and negative then I think the positive should be evaluated first and the negative taken away from what those have resolved to.

@sandervandegeijn
Copy link
Author

Thanks again for all the effort guys, sounds like you're homing in on the actual problem. Great work!

@stephen-crawford
Copy link
Collaborator

stephen-crawford commented Dec 23, 2022

Opened a PR to resolve the issue: opensearch-project/OpenSearch#5626

@stephen-crawford
Copy link
Collaborator

@ict-one-nl, @davidlago , @cwperks , @CEHENKLE

I am happy to share that the PR fixing this behavior has been merged into the main branch of the core repository and will be backported. I want to take the opportunity to thank you for your patience in the resolution of this issue since its original opening as well as taking the time to attend our public triaging meeting to bring further attention to this issue.

Supporting our community is always the number one priority. I have copied over the updated functionality so that you know what to expect when making calls under the corrected restore logic.

The new behavior is as follows:

  1. (Querying a single negated snapshot): If the body of the request reads indices: "-bar", the query will return all indices which are not "bar".
  2. (Querying a single negated snapshot wildcard): If the body of the request reads indices: "-bar*", the query will return all indices which are not a subset of "bar*".
  3. (Querying multiple negated snapshots): If the body of the request reads indices: "-bar, -baz", the query will return all indices which are not "bar" or `"baz".
  4. (Querying multiple negated snapshot wildcards): If the body of the request reads indices: "-bar*, -baz*", the query will return all indices which are not a subset of "bar*" or a subset of `"baz*".
  5. (Querying a single snapshot): If the body of the request reads indices: "bar", the query will return "bar".
  6. (Querying multiple snapshot wildcards): If the body of the request reads indices: "bar*, baz*", the query will return all indices which are a subset of "bar*" or a subset of `"baz*".
  7. (Querying multiple mixed snapshot wildcards): If the body of the request reads indices: "-bar*, ba*", the query will return all indices which are a subset of "ba*" but not a subset of "bar*". This was not the case previously.
  8. (Querying multiple mixed snapshot wildcards): If the body of the request reads indices: "ba*, -bar*", the query will return all indices which are a subset of "ba*" but not a subset of "bar*".

Now behavior is persevered irregardless of the order of positive and negated wildcards queries.

Thank you and Happy New Year!

@peternied
Copy link
Member

@scrawfor99 Did we add a test that verifies the bug is fixed in the security repo?

@stephen-crawford
Copy link
Collaborator

stephen-crawford commented Jan 3, 2023

@peternied we have not. Since the issue was related to the incorrect parsing of the snapshot indices in core, the tests were added in core. If you feel additional tests in security are required as well they can be added though it will mostly be duplicating the tests in core just going through the entire snapshot restore process as well instead of just checking the parsing which was failing. Do you think that is something that should be added given this?

Let me know and if so I will take care of it once the core build finishes! :)

@peternied
Copy link
Member

If we can I'd recommend we test in scenarios that most closely reflect what customers encounter, I'd hate for another change in core to go in that seems innocuous but it breaks snapshot restore via the security plugin

@sandervandegeijn
Copy link
Author

nice thank you!

@stephen-crawford
Copy link
Collaborator

@peternied, I opened a PR in the security codebase to add a test which will check that the behavior is consistent during the SnapshotRestore operation. So once this gets merged everything should be set and this issue will be double covered with new testing in core and security.

@sandervandegeijn
Copy link
Author

sandervandegeijn commented Mar 9, 2023

Has the PR been merged? I just tried to do a disaster recovery as a test and encountered a very familiar error:

image

2.5.0
image

2.6.0
image

Dashboards 2.5.0 and 2.6.0 tested.

Authentication over openid through AzureAD

:)

@peternied
Copy link
Member

@scrawfor99 When you get a chance can you confirm if your change was in 2.6.0?

@cwperks
Copy link
Member

cwperks commented Mar 10, 2023

@peternied @scrawfor99 I left a comment on the PR in core. The backport label was not added to the PR so it was not backported and released in 2.6.0. The soonest it will be released is 2.7.0.

@sandervandegeijn
Copy link
Author

It's on the list for 2.7? cool :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Community contributions are especially encouraged for these issues. Plugins
Projects
None yet
Development

No branches or pull requests