Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ILM stuck waiting for snapshot while already executed #62581

Open
cambierr opened this issue Sep 17, 2020 · 3 comments
Open

ILM stuck waiting for snapshot while already executed #62581

cambierr opened this issue Sep 17, 2020 · 3 comments
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team

Comments

@cambierr
Copy link

cambierr commented Sep 17, 2020

Elasticsearch version (bin/elasticsearch --version): 7.8.0

Plugins installed: none (well, just xpack basic)

JVM version (java -version): OpenJDK 64-Bit Server VM AdoptOpenJDK (build 14.0.1+7, mixed mode, sharing)

OS version (uname -a if on a Unix-like system): Debian 10.1

Description of the problem including expected versus actual behavior:

I configured a snapshot policy called monthly-export that runs every first of the months at 5AM and targets all the indices of the previous months using the <*_{now/M-1M{yyyy.MM}}.*> pattern. Snapshots are being executed without any failure.

I then configured an ILM with a delete phase 7 days after the creation of my indices, conditioned with the wait_for_snapshot set to my monthly-export SLM.

I would expect each 1st of the month to see all indices of the previous month from the first to the 23rd to be deleted according to the ILM.

Instead of that, they all stay in a status like waiting for policy 'monthly-export' to be executed since Mon Aug 10 14:46:58 UTC 2020

Steps to reproduce:
Create a SLM with this body

{
  "name": "<monthly-{now/M-1M{yyyy.MM}}>",
  "schedule": "0 0 5 1 * ?",
  "repository": "eu-west-2-elasticsearch-snapshots",
  "config": {
    "indices": "<*_{now/M-1M{yyyy.MM}}.*>",
    "ignore_unavailable": true,
    "include_global_state": false
  },
  "retention": {
    "expire_after": "365d"
  }
}

Then this ILM:

PUT _ilm/policy/logs_production
{
  "policy": {
    "phases": {
      "delete": {
        "min_age": "7d",
        "actions": {
          "wait_for_snapshot": {
            "policy": "monthly-export"
          },
          "delete": {}
        }
      }
    }
  }
}

wait for the first of next month and..... nothing happen; indices are not deleted :(

Provide logs (if relevant):

[2020-09-17T18:37:38,032][ERROR][o.e.c.s.MasterService    ] [xxx-elastic-1] exception thrown by listener notifying of failure from [ilm-execute-cluster-state-steps [{"phase":"delete","action":"wait_for_snapshot","name":"wait-for-snapshot"} => {"phase":"delete","action":"delete","name":"wait-for-shard-history-leases"}]]
org.elasticsearch.ElasticsearchException: policy [metrics-raw_production] for index [xxx-production_metrics-raw_2020.08.04] failed on step [{"phase":"delete","action":"wait_for_snapshot","name":"wait-for-snapshot"}].
	at org.elasticsearch.xpack.ilm.ExecuteStepsUpdateTask.onFailure(ExecuteStepsUpdateTask.java:203) ~[?:?]
	at org.elasticsearch.cluster.service.MasterService$SafeClusterStateTaskListener.onFailure(MasterService.java:513) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.cluster.service.MasterService$TaskOutputs.notifyFailedTasks(MasterService.java:446) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:220) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:636) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.8.0.jar:7.8.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]
	Suppressed: java.lang.IllegalArgumentException: step [{"phase":"delete","action":"delete","name":"wait-for-shard-history-leases"}] for index [xxx-production_metrics-raw_2020.08.04] with policy [metrics-raw_production] does not exist
		at org.elasticsearch.xpack.ilm.IndexLifecycleTransition.validateTransition(IndexLifecycleTransition.java:84) ~[?:?]
		at org.elasticsearch.xpack.ilm.IndexLifecycleTransition.moveClusterStateToStep(IndexLifecycleTransition.java:105) ~[?:?]
		at org.elasticsearch.xpack.ilm.ExecuteStepsUpdateTask.execute(ExecuteStepsUpdateTask.java:135) ~[?:?]
		at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-7.8.0.jar:7.8.0]
		at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:702) ~[elasticsearch-7.8.0.jar:7.8.0]
		at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:324) ~[elasticsearch-7.8.0.jar:7.8.0]
		at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:219) [elasticsearch-7.8.0.jar:7.8.0]
		at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [elasticsearch-7.8.0.jar:7.8.0]
		at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-7.8.0.jar:7.8.0]
		at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.8.0.jar:7.8.0]
		at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.8.0.jar:7.8.0]
		at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:636) [elasticsearch-7.8.0.jar:7.8.0]
		at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.8.0.jar:7.8.0]
		at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.8.0.jar:7.8.0]
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
		at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: java.lang.IllegalArgumentException: step [{"phase":"delete","action":"delete","name":"wait-for-shard-history-leases"}] for index [xxx-production_metrics-raw_2020.08.04] with policy [metrics-raw_production] does not exist
	at org.elasticsearch.xpack.ilm.IndexLifecycleTransition.validateTransition(IndexLifecycleTransition.java:84) ~[?:?]
	at org.elasticsearch.xpack.ilm.IndexLifecycleTransition.moveClusterStateToStep(IndexLifecycleTransition.java:105) ~[?:?]
	at org.elasticsearch.xpack.ilm.ExecuteStepsUpdateTask.execute(ExecuteStepsUpdateTask.java:135) ~[?:?]
	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:702) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:324) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:219) ~[elasticsearch-7.8.0.jar:7.8.0]
	... 10 more
@cambierr cambierr added >bug needs:triage Requires assignment of a team area label labels Sep 17, 2020
@cambierr
Copy link
Author

Also, could you confirm that the wait_for_snapshot rule requires at least one snapshot of that index and not at least one "run" of the given policy, event if it did not contained the index ?

@costin costin added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Sep 18, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Sep 18, 2020
@danielmitterdorfer danielmitterdorfer removed the needs:triage Requires assignment of a team area label label Oct 13, 2020
@stefnestor
Copy link
Contributor

FWIW potentially relates to #69642, #62164 when using SLM config.indices date math?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

5 participants