Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] refactored bulk update tags retry #147594

Merged
merged 13 commits into from
Dec 20, 2022

Conversation

juliaElastic
Copy link
Contributor

@juliaElastic juliaElastic commented Dec 15, 2022

Summary

Fixes #144161

As discussed here, the existing implementation of update tags doesn't work well with real agents, as there are many conflicts with checkin, even when trying to add/remove one tag.
Refactored the logic to make retries more efficient:

  • Instead of aborting the whole bulk action on conflicts, changed the conflict strategy to 'proceed'. This means, if an action of 50k agents has 1k conflicts, not all 50k is retried, but only the 1k conflicts, this makes it less likely to conflict on retry.
  • Because of this, on retry we have to know which agents don't yet have the tag added/removed. For this, added an additional filter to the updateByQuery request. Only adding the filter if there is exactly one tagsToAdd or one tagsToRemove. This is the main use case from the UI, and handling other cases would complicate the logic more (each additional tag to add/remove would result in another OR query, which would match more agents, making conflicts more likely).
  • Added this additional query on the initial request as well (not only retries) to save on unnecessary work e.g. if the user tries to add a tag on 50k agents, but 48k already have it, it is enough to update the remaining 2k agents.
  • This improvement has the effect that 'Agent activity' shows the real updated agent count, not the total selected. I think this is not really a problem for update tags.
  • Cleaned up some of the UI logic, because the conflicts are fully handled now on the backend.
  • Locally I couldn't reproduce the conflict with agent checkins, even with 1k horde agents. I'll try to test in cloud with more real agents.

To verify:

  • Enroll 50k agents (I used 50k with create_agents script, and 1k with horde). Enroll 50k with horde if possible.
  • Select all on UI and try to add/remove one or more tags
  • Expect the changes to propagate quickly (up to 1m). It might take a few refreshes to see the result on agent list and tags list, because the UI polls the agents every 30s. It is expected that the tags list temporarily shows incorrect data because the action is async.

E.g. removed test3 tag and added add tag quickly:
image
image

The logs show the details of how many version_conflicts were there, and it decreased with retries.

[2022-12-15T10:32:12.937+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000
[2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:16.477+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000
[2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet] {"took":9886,"timed_out":false,"total":52000,"updated":41143,"deleted":0,"batches":52,"version_conflicts":10857,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet] {"took":9518,"timed_out":false,"total":52000,"updated":25755,"deleted":0,"batches":52,"version_conflicts":26245,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet] Action failed: version conflict of 10857 agents
[2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:27.462+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet] Action failed: version conflict of 26245 agents
[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:31.480+01:00][INFO ][plugins.fleet] Running bulk action retry task
[2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000
[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed bulk action retry task
[2022-12-15T10:32:31.485+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet] {"took":2347,"timed_out":false,"total":10857,"updated":9857,"deleted":0,"batches":11,"version_conflicts":1000,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:34.556+01:00][INFO ][plugins.fleet] Running bulk action retry task
[2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000
[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed bulk action retry task
[2022-12-15T10:32:34.560+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de failed: version conflict of 1000 agents
[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
{"took":5509,"timed_out":false,"total":26245,"updated":26245,"deleted":0,"batches":27,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:42.722+01:00][INFO ][plugins.fleet] processed 26245 agents, took 5509ms
[2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:46.705+01:00][INFO ][plugins.fleet] Running bulk action retry task
[2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry #2 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000
[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed bulk action retry task
[2022-12-15T10:32:46.711+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet] {"took":379,"timed_out":false,"total":1000,"updated":1000,"deleted":0,"batches":1,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] processed 1000 agents, took 379ms
[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de

Checklist

@juliaElastic juliaElastic added release_note:skip Skip the PR/issue when compiling release notes v8.7.0 v8.6.1 labels Dec 15, 2022
@juliaElastic juliaElastic requested a review from a team as a code owner December 15, 2022 09:32
@juliaElastic juliaElastic self-assigned this Dec 15, 2022
@botelastic botelastic bot added the Team:Fleet Team label for Observability Data Collection Fleet team label Dec 15, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@juliaElastic juliaElastic added the ci:cloud-deploy Create or update a Cloud deployment label Dec 15, 2022
@juliaElastic juliaElastic requested a review from a team December 16, 2022 08:13
@juliaElastic
Copy link
Contributor Author

juliaElastic commented Dec 16, 2022

Tested on ECE with 20k horde agents. The new logic works fine, and the conflicted agents are updated in a few retries.
I only found that in Agent activity, the conflicted agent results are not recorded, investigating.

EDIT: found the reason: the logic of generating ids for action results was not giving unique ids for retries (always assigned 0,1,2...). Changed to generate uuid if the real agentId is not known.

image

image

@juliaElastic
Copy link
Contributor Author

With the latest fix, tried again with 20k horde agents, and I see the action results show up correctly now.
Hosted agents are ignored in async execution, e.g. 20001 were actioned, and 20000 are actually updated, because hosted agent is filtered out.
image
image

@jlind23
Copy link
Contributor

jlind23 commented Dec 16, 2022

Thanks @juliaElastic glad to hear that the retry mechanism solve almost all our problems at scale.

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Dec 16, 2022

I noticed some discrepancy in Agent activity around 40k agents.
The update tags action is stuck in In progress state, because the cardinality aggregation is not precise. In reality all agents completed the update tags action.
I'll tweak the logic on Agent activity so that we just use the doc count for update tags action (The cardinality agg was introduced for the cases where Agents have to ack an action, and that can happen more than once. Update tags doesn't involve Agent ack.)
See 3db4cb6

image

@juliaElastic
Copy link
Contributor Author

@elasticmachine merge upstream

{ pitId: '' }
).runActionAsyncWithRetry();
}
return await new UpdateAgentTagsActionRunner(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplified the logic to use retry for all update tags kuery actions
as reported here, the version conflict happened even with less than 10k agents, which didn't retry before

Could reproduce in an ECE instance by adding a tag on 5k horde agents and getting this response from bulk API:

{"statusCode":500,"error":"Internal Server Error","message":"version conflict of 1865 agents"}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested on pr cloud deployment that 5000 agents update tags is done successfully (async)
image

@nchaulet nchaulet self-requested a review December 19, 2022 17:39
Copy link
Member

@nchaulet nchaulet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Copy link
Contributor

@hop-dev hop-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with "fake" agents locally and works, a few nits but nothing blocking 👍

: Math.min(
docCount,
// only using cardinality count when count lower than precision threshold
docCount > PRECISION_THRESHOLD ? docCount : cardinalityCount,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: Why is cardinalityCount used, can't we always use the docCount here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cardinality was introduced for actions that can potentially be acked multiple times by agents e.g. upgrade. So we count acks by one agent once.

@juliaElastic juliaElastic enabled auto-merge (squash) December 20, 2022 08:42
@kibana-ci
Copy link
Collaborator

kibana-ci commented Dec 20, 2022

💚 Build Succeeded

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
fleet 898.9KB 898.7KB -218.0B
Unknown metric groups

ESLint disabled in files

id before after diff
osquery 1 2 +1

ESLint disabled line counts

id before after diff
enterpriseSearch 19 21 +2
fleet 61 67 +6
osquery 109 115 +6
securitySolution 439 445 +6
total +20

Total ESLint disabled count

id before after diff
enterpriseSearch 20 22 +2
fleet 70 76 +6
osquery 110 117 +7
securitySolution 516 522 +6
total +21

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @juliaElastic

@juliaElastic juliaElastic removed the ci:cloud-deploy Create or update a Cloud deployment label Dec 20, 2022
@juliaElastic juliaElastic merged commit 687987a into elastic:main Dec 20, 2022
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Dec 20, 2022
## Summary

Fixes elastic#144161

As discussed
[here](elastic#144161 (comment)),
the existing implementation of update tags doesn't work well with real
agents, as there are many conflicts with checkin, even when trying to
add/remove one tag.
Refactored the logic to make retries more efficient:
- Instead of aborting the whole bulk action on conflicts, changed the
conflict strategy to 'proceed'. This means, if an action of 50k agents
has 1k conflicts, not all 50k is retried, but only the 1k conflicts,
this makes it less likely to conflict on retry.
- Because of this, on retry we have to know which agents don't yet have
the tag added/removed. For this, added an additional filter to the
`updateByQuery` request. Only adding the filter if there is exactly one
`tagsToAdd` or one `tagsToRemove`. This is the main use case from the
UI, and handling other cases would complicate the logic more (each
additional tag to add/remove would result in another OR query, which
would match more agents, making conflicts more likely).
- Added this additional query on the initial request as well (not only
retries) to save on unnecessary work e.g. if the user tries to add a tag
on 50k agents, but 48k already have it, it is enough to update the
remaining 2k agents.
- This improvement has the effect that 'Agent activity' shows the real
updated agent count, not the total selected. I think this is not really
a problem for update tags.
- Cleaned up some of the UI logic, because the conflicts are fully
handled now on the backend.
- Locally I couldn't reproduce the conflict with agent checkins, even
with 1k horde agents. I'll try to test in cloud with more real agents.

To verify:
- Enroll 50k agents (I used 50k with create_agents script, and 1k with
horde). Enroll 50k with horde if possible.
- Select all on UI and try to add/remove one or more tags
- Expect the changes to propagate quickly (up to 1m). It might take a
few refreshes to see the result on agent list and tags list, because the
UI polls the agents every 30s. It is expected that the tags list
temporarily shows incorrect data because the action is async.

E.g. removed `test3` tag and added `add` tag quickly:
<img width="1776" alt="image"
src="https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png">
<img width="422" alt="image"
src="https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png">

The logs show the details of how many `version_conflicts` were there,
and it decreased with retries.

```
[2022-12-15T10:32:12.937+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd54-19ac-4738-b3d3-db32789233de, total agents: 52000
[2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:16.477+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da7-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000
[2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet] {"took":9886,"timed_out":false,"total":52000,"updated":41143,"deleted":0,"batches":52,"version_conflicts":10857,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet] {"took":9518,"timed_out":false,"total":52000,"updated":25755,"deleted":0,"batches":52,"version_conflicts":26245,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet] Action failed: version conflict of 10857 agents
[2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:27.462+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet] Action failed: version conflict of 26245 agents
[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:31.480+01:00][INFO ][plugins.fleet] Running bulk action retry task
[2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry elastic#1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd54-19ac-4738-b3d3-db32789233de, total agents: 52000
[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed bulk action retry task
[2022-12-15T10:32:31.485+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet] {"took":2347,"timed_out":false,"total":10857,"updated":9857,"deleted":0,"batches":11,"version_conflicts":1000,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:34.556+01:00][INFO ][plugins.fleet] Running bulk action retry task
[2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry elastic#1 of task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da7-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000
[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed bulk action retry task
[2022-12-15T10:32:34.560+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet] Retry elastic#1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de failed: version conflict of 1000 agents
[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
{"took":5509,"timed_out":false,"total":26245,"updated":26245,"deleted":0,"batches":27,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:42.722+01:00][INFO ][plugins.fleet] processed 26245 agents, took 5509ms
[2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:46.705+01:00][INFO ][plugins.fleet] Running bulk action retry task
[2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry elastic#2 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd54-19ac-4738-b3d3-db32789233de, total agents: 52000
[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed bulk action retry task
[2022-12-15T10:32:46.711+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet] {"took":379,"timed_out":false,"total":1000,"updated":1000,"deleted":0,"batches":1,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] processed 1000 agents, took 379ms
[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
```

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit 687987a)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.6

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Dec 20, 2022
# Backport

This will backport the following commits from `main` to `8.6`:
- [[Fleet] refactored bulk update tags retry
(#147594)](#147594)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Julia
Bardi","email":"90178898+juliaElastic@users.noreply.github.com"},"sourceCommit":{"committedDate":"2022-12-20T09:36:36Z","message":"[Fleet]
refactored bulk update tags retry (#147594)\n\n## Summary\r\n\r\nFixes
#144161
discussed\r\n[here](#144161 (comment)
existing implementation of update tags doesn't work well with
real\r\nagents, as there are many conflicts with checkin, even when
trying to\r\nadd/remove one tag.\r\nRefactored the logic to make retries
more efficient:\r\n- Instead of aborting the whole bulk action on
conflicts, changed the\r\nconflict strategy to 'proceed'. This means, if
an action of 50k agents\r\nhas 1k conflicts, not all 50k is retried, but
only the 1k conflicts,\r\nthis makes it less likely to conflict on
retry.\r\n- Because of this, on retry we have to know which agents don't
yet have\r\nthe tag added/removed. For this, added an additional filter
to the\r\n`updateByQuery` request. Only adding the filter if there is
exactly one\r\n`tagsToAdd` or one `tagsToRemove`. This is the main use
case from the\r\nUI, and handling other cases would complicate the logic
more (each\r\nadditional tag to add/remove would result in another OR
query, which\r\nwould match more agents, making conflicts more
likely).\r\n- Added this additional query on the initial request as well
(not only\r\nretries) to save on unnecessary work e.g. if the user tries
to add a tag\r\non 50k agents, but 48k already have it, it is enough to
update the\r\nremaining 2k agents.\r\n- This improvement has the effect
that 'Agent activity' shows the real\r\nupdated agent count, not the
total selected. I think this is not really\r\na problem for update
tags.\r\n- Cleaned up some of the UI logic, because the conflicts are
fully\r\nhandled now on the backend.\r\n- Locally I couldn't reproduce
the conflict with agent checkins, even\r\nwith 1k horde agents. I'll try
to test in cloud with more real agents.\r\n\r\nTo verify:\r\n- Enroll
50k agents (I used 50k with create_agents script, and 1k with\r\nhorde).
Enroll 50k with horde if possible.\r\n- Select all on UI and try to
add/remove one or more tags\r\n- Expect the changes to propagate quickly
(up to 1m). It might take a\r\nfew refreshes to see the result on agent
list and tags list, because the\r\nUI polls the agents every 30s. It is
expected that the tags list\r\ntemporarily shows incorrect data because
the action is async.\r\n\r\nE.g. removed `test3` tag and added `add` tag
quickly:\r\n<img width=\"1776\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png\">\r\n<img
width=\"422\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png\">\r\n\r\nThe
logs show the details of how many `version_conflicts` were there,\r\nand
it decreased with
retries.\r\n\r\n```\r\n[2022-12-15T10:32:12.937+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:16.477+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
29e9da7-7194-4e52-8004-2c1b19f6dfd5, total agents:
52000\r\n[2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet]
{\"took\":9886,\"timed_out\":false,\"total\":52000,\"updated\":41143,\"deleted\":0,\"batches\":52,\"version_conflicts\":10857,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet]
{\"took\":9518,\"timed_out\":false,\"total\":52000,\"updated\":25755,\"deleted\":0,\"batches\":52,\"version_conflicts\":26245,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet]
Action failed: version conflict of 10857
agents\r\n[2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:27.462+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet]
Action failed: version conflict of 26245
agents\r\n[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:29.353+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:31.480+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry #1
of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:31.481+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:31.485+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet]
{\"took\":2347,\"timed_out\":false,\"total\":10857,\"updated\":9857,\"deleted\":0,\"batches\":11,\"version_conflicts\":1000,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:34.556+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry #1
of task
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:34.557+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
29e9da7-7194-4e52-8004-2c1b19f6dfd5, total agents:
52000\r\n[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:34.560+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet]
Retry #1 of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
failed: version conflict of 1000
agents\r\n[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:35.468+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n{\"took\":5509,\"timed_out\":false,\"total\":26245,\"updated\":26245,\"deleted\":0,\"batches\":27,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:42.722+01:00][INFO
][plugins.fleet] processed 26245 agents, took
5509ms\r\n[2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:46.705+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry #2
of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:46.707+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:46.711+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet]
{\"took\":379,\"timed_out\":false,\"total\":1000,\"updated\":1000,\"deleted\":0,\"batches\":1,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:47.623+01:00][INFO
][plugins.fleet] processed 1000 agents, took
379ms\r\n[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n```\r\n\r\n###
Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>","sha":"687987aa9ce56ce359f722485330179a4807d79a","branchLabelMapping":{"^v8.7.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:Fleet","v8.7.0","v8.6.1"],"number":147594,"url":"#147594
refactored bulk update tags retry (#147594)\n\n## Summary\r\n\r\nFixes
#144161
discussed\r\n[here](#144161 (comment)
existing implementation of update tags doesn't work well with
real\r\nagents, as there are many conflicts with checkin, even when
trying to\r\nadd/remove one tag.\r\nRefactored the logic to make retries
more efficient:\r\n- Instead of aborting the whole bulk action on
conflicts, changed the\r\nconflict strategy to 'proceed'. This means, if
an action of 50k agents\r\nhas 1k conflicts, not all 50k is retried, but
only the 1k conflicts,\r\nthis makes it less likely to conflict on
retry.\r\n- Because of this, on retry we have to know which agents don't
yet have\r\nthe tag added/removed. For this, added an additional filter
to the\r\n`updateByQuery` request. Only adding the filter if there is
exactly one\r\n`tagsToAdd` or one `tagsToRemove`. This is the main use
case from the\r\nUI, and handling other cases would complicate the logic
more (each\r\nadditional tag to add/remove would result in another OR
query, which\r\nwould match more agents, making conflicts more
likely).\r\n- Added this additional query on the initial request as well
(not only\r\nretries) to save on unnecessary work e.g. if the user tries
to add a tag\r\non 50k agents, but 48k already have it, it is enough to
update the\r\nremaining 2k agents.\r\n- This improvement has the effect
that 'Agent activity' shows the real\r\nupdated agent count, not the
total selected. I think this is not really\r\na problem for update
tags.\r\n- Cleaned up some of the UI logic, because the conflicts are
fully\r\nhandled now on the backend.\r\n- Locally I couldn't reproduce
the conflict with agent checkins, even\r\nwith 1k horde agents. I'll try
to test in cloud with more real agents.\r\n\r\nTo verify:\r\n- Enroll
50k agents (I used 50k with create_agents script, and 1k with\r\nhorde).
Enroll 50k with horde if possible.\r\n- Select all on UI and try to
add/remove one or more tags\r\n- Expect the changes to propagate quickly
(up to 1m). It might take a\r\nfew refreshes to see the result on agent
list and tags list, because the\r\nUI polls the agents every 30s. It is
expected that the tags list\r\ntemporarily shows incorrect data because
the action is async.\r\n\r\nE.g. removed `test3` tag and added `add` tag
quickly:\r\n<img width=\"1776\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png\">\r\n<img
width=\"422\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png\">\r\n\r\nThe
logs show the details of how many `version_conflicts` were there,\r\nand
it decreased with
retries.\r\n\r\n```\r\n[2022-12-15T10:32:12.937+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:16.477+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
29e9da7-7194-4e52-8004-2c1b19f6dfd5, total agents:
52000\r\n[2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet]
{\"took\":9886,\"timed_out\":false,\"total\":52000,\"updated\":41143,\"deleted\":0,\"batches\":52,\"version_conflicts\":10857,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet]
{\"took\":9518,\"timed_out\":false,\"total\":52000,\"updated\":25755,\"deleted\":0,\"batches\":52,\"version_conflicts\":26245,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet]
Action failed: version conflict of 10857
agents\r\n[2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:27.462+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet]
Action failed: version conflict of 26245
agents\r\n[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:29.353+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:31.480+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry #1
of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:31.481+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:31.485+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet]
{\"took\":2347,\"timed_out\":false,\"total\":10857,\"updated\":9857,\"deleted\":0,\"batches\":11,\"version_conflicts\":1000,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:34.556+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry #1
of task
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:34.557+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
29e9da7-7194-4e52-8004-2c1b19f6dfd5, total agents:
52000\r\n[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:34.560+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet]
Retry #1 of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
failed: version conflict of 1000
agents\r\n[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:35.468+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n{\"took\":5509,\"timed_out\":false,\"total\":26245,\"updated\":26245,\"deleted\":0,\"batches\":27,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:42.722+01:00][INFO
][plugins.fleet] processed 26245 agents, took
5509ms\r\n[2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:46.705+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry #2
of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:46.707+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:46.711+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet]
{\"took\":379,\"timed_out\":false,\"total\":1000,\"updated\":1000,\"deleted\":0,\"batches\":1,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:47.623+01:00][INFO
][plugins.fleet] processed 1000 agents, took
379ms\r\n[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n```\r\n\r\n###
Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>","sha":"687987aa9ce56ce359f722485330179a4807d79a"}},"sourceBranch":"main","suggestedTargetBranches":["8.6"],"targetPullRequestStates":[{"branch":"main","label":"v8.7.0","labelRegex":"^v8.7.0$","isSourceBranch":true,"state":"MERGED","url":"#147594
refactored bulk update tags retry (#147594)\n\n## Summary\r\n\r\nFixes
#144161
discussed\r\n[here](#144161 (comment)
existing implementation of update tags doesn't work well with
real\r\nagents, as there are many conflicts with checkin, even when
trying to\r\nadd/remove one tag.\r\nRefactored the logic to make retries
more efficient:\r\n- Instead of aborting the whole bulk action on
conflicts, changed the\r\nconflict strategy to 'proceed'. This means, if
an action of 50k agents\r\nhas 1k conflicts, not all 50k is retried, but
only the 1k conflicts,\r\nthis makes it less likely to conflict on
retry.\r\n- Because of this, on retry we have to know which agents don't
yet have\r\nthe tag added/removed. For this, added an additional filter
to the\r\n`updateByQuery` request. Only adding the filter if there is
exactly one\r\n`tagsToAdd` or one `tagsToRemove`. This is the main use
case from the\r\nUI, and handling other cases would complicate the logic
more (each\r\nadditional tag to add/remove would result in another OR
query, which\r\nwould match more agents, making conflicts more
likely).\r\n- Added this additional query on the initial request as well
(not only\r\nretries) to save on unnecessary work e.g. if the user tries
to add a tag\r\non 50k agents, but 48k already have it, it is enough to
update the\r\nremaining 2k agents.\r\n- This improvement has the effect
that 'Agent activity' shows the real\r\nupdated agent count, not the
total selected. I think this is not really\r\na problem for update
tags.\r\n- Cleaned up some of the UI logic, because the conflicts are
fully\r\nhandled now on the backend.\r\n- Locally I couldn't reproduce
the conflict with agent checkins, even\r\nwith 1k horde agents. I'll try
to test in cloud with more real agents.\r\n\r\nTo verify:\r\n- Enroll
50k agents (I used 50k with create_agents script, and 1k with\r\nhorde).
Enroll 50k with horde if possible.\r\n- Select all on UI and try to
add/remove one or more tags\r\n- Expect the changes to propagate quickly
(up to 1m). It might take a\r\nfew refreshes to see the result on agent
list and tags list, because the\r\nUI polls the agents every 30s. It is
expected that the tags list\r\ntemporarily shows incorrect data because
the action is async.\r\n\r\nE.g. removed `test3` tag and added `add` tag
quickly:\r\n<img width=\"1776\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png\">\r\n<img
width=\"422\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png\">\r\n\r\nThe
logs show the details of how many `version_conflicts` were there,\r\nand
it decreased with
retries.\r\n\r\n```\r\n[2022-12-15T10:32:12.937+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:16.477+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
29e9da7-7194-4e52-8004-2c1b19f6dfd5, total agents:
52000\r\n[2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet]
{\"took\":9886,\"timed_out\":false,\"total\":52000,\"updated\":41143,\"deleted\":0,\"batches\":52,\"version_conflicts\":10857,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet]
{\"took\":9518,\"timed_out\":false,\"total\":52000,\"updated\":25755,\"deleted\":0,\"batches\":52,\"version_conflicts\":26245,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet]
Action failed: version conflict of 10857
agents\r\n[2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:27.462+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet]
Action failed: version conflict of 26245
agents\r\n[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:29.353+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:31.480+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry #1
of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:31.481+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:31.485+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet]
{\"took\":2347,\"timed_out\":false,\"total\":10857,\"updated\":9857,\"deleted\":0,\"batches\":11,\"version_conflicts\":1000,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:34.556+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry #1
of task
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:34.557+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
29e9da7-7194-4e52-8004-2c1b19f6dfd5, total agents:
52000\r\n[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:34.560+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet]
Retry #1 of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
failed: version conflict of 1000
agents\r\n[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:35.468+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n{\"took\":5509,\"timed_out\":false,\"total\":26245,\"updated\":26245,\"deleted\":0,\"batches\":27,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:42.722+01:00][INFO
][plugins.fleet] processed 26245 agents, took
5509ms\r\n[2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:46.705+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry #2
of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:46.707+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:46.711+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet]
{\"took\":379,\"timed_out\":false,\"total\":1000,\"updated\":1000,\"deleted\":0,\"batches\":1,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:47.623+01:00][INFO
][plugins.fleet] processed 1000 agents, took
379ms\r\n[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n```\r\n\r\n###
Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>","sha":"687987aa9ce56ce359f722485330179a4807d79a"}},{"branch":"8.6","label":"v8.6.1","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
crespocarlos pushed a commit to crespocarlos/kibana that referenced this pull request Dec 23, 2022
## Summary

Fixes elastic#144161

As discussed
[here](elastic#144161 (comment)),
the existing implementation of update tags doesn't work well with real
agents, as there are many conflicts with checkin, even when trying to
add/remove one tag.
Refactored the logic to make retries more efficient:
- Instead of aborting the whole bulk action on conflicts, changed the
conflict strategy to 'proceed'. This means, if an action of 50k agents
has 1k conflicts, not all 50k is retried, but only the 1k conflicts,
this makes it less likely to conflict on retry.
- Because of this, on retry we have to know which agents don't yet have
the tag added/removed. For this, added an additional filter to the
`updateByQuery` request. Only adding the filter if there is exactly one
`tagsToAdd` or one `tagsToRemove`. This is the main use case from the
UI, and handling other cases would complicate the logic more (each
additional tag to add/remove would result in another OR query, which
would match more agents, making conflicts more likely).
- Added this additional query on the initial request as well (not only
retries) to save on unnecessary work e.g. if the user tries to add a tag
on 50k agents, but 48k already have it, it is enough to update the
remaining 2k agents.
- This improvement has the effect that 'Agent activity' shows the real
updated agent count, not the total selected. I think this is not really
a problem for update tags.
- Cleaned up some of the UI logic, because the conflicts are fully
handled now on the backend.
- Locally I couldn't reproduce the conflict with agent checkins, even
with 1k horde agents. I'll try to test in cloud with more real agents.

To verify:
- Enroll 50k agents (I used 50k with create_agents script, and 1k with
horde). Enroll 50k with horde if possible.
- Select all on UI and try to add/remove one or more tags
- Expect the changes to propagate quickly (up to 1m). It might take a
few refreshes to see the result on agent list and tags list, because the
UI polls the agents every 30s. It is expected that the tags list
temporarily shows incorrect data because the action is async.

E.g. removed `test3` tag and added `add` tag quickly:
<img width="1776" alt="image"
src="https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png">
<img width="422" alt="image"
src="https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png">

The logs show the details of how many `version_conflicts` were there,
and it decreased with retries.

```
[2022-12-15T10:32:12.937+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd54-19ac-4738-b3d3-db32789233de, total agents: 52000
[2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:16.477+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da7-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000
[2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet] {"took":9886,"timed_out":false,"total":52000,"updated":41143,"deleted":0,"batches":52,"version_conflicts":10857,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet] {"took":9518,"timed_out":false,"total":52000,"updated":25755,"deleted":0,"batches":52,"version_conflicts":26245,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet] Action failed: version conflict of 10857 agents
[2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:27.462+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet] Action failed: version conflict of 26245 agents
[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:31.480+01:00][INFO ][plugins.fleet] Running bulk action retry task
[2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd54-19ac-4738-b3d3-db32789233de, total agents: 52000
[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed bulk action retry task
[2022-12-15T10:32:31.485+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet] {"took":2347,"timed_out":false,"total":10857,"updated":9857,"deleted":0,"batches":11,"version_conflicts":1000,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:34.556+01:00][INFO ][plugins.fleet] Running bulk action retry task
[2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da7-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000
[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed bulk action retry task
[2022-12-15T10:32:34.560+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de failed: version conflict of 1000 agents
[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
{"took":5509,"timed_out":false,"total":26245,"updated":26245,"deleted":0,"batches":27,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:42.722+01:00][INFO ][plugins.fleet] processed 26245 agents, took 5509ms
[2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:46.705+01:00][INFO ][plugins.fleet] Running bulk action retry task
[2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry #2 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd54-19ac-4738-b3d3-db32789233de, total agents: 52000
[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed bulk action retry task
[2022-12-15T10:32:46.711+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet] {"took":379,"timed_out":false,"total":1000,"updated":1000,"deleted":0,"batches":1,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] processed 1000 agents, took 379ms
[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
```

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release_note:skip Skip the PR/issue when compiling release notes Team:Fleet Team label for Observability Data Collection Fleet team v8.6.0 v8.6.1 v8.7.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Fleet] ActionId for Tag removal completes but the tag is still in the list of tags
7 participants