ILM stuck waiting for snapshot while already executed #62581

cambierr · 2020-09-17T18:05:08Z

Elasticsearch version (bin/elasticsearch --version): 7.8.0

Plugins installed: none (well, just xpack basic)

JVM version (java -version): OpenJDK 64-Bit Server VM AdoptOpenJDK (build 14.0.1+7, mixed mode, sharing)

OS version (uname -a if on a Unix-like system): Debian 10.1

Description of the problem including expected versus actual behavior:

I configured a snapshot policy called monthly-export that runs every first of the months at 5AM and targets all the indices of the previous months using the <*_{now/M-1M{yyyy.MM}}.*> pattern. Snapshots are being executed without any failure.

I then configured an ILM with a delete phase 7 days after the creation of my indices, conditioned with the wait_for_snapshot set to my monthly-export SLM.

I would expect each 1st of the month to see all indices of the previous month from the first to the 23rd to be deleted according to the ILM.

Instead of that, they all stay in a status like waiting for policy 'monthly-export' to be executed since Mon Aug 10 14:46:58 UTC 2020

Steps to reproduce:
Create a SLM with this body

{
  "name": "<monthly-{now/M-1M{yyyy.MM}}>",
  "schedule": "0 0 5 1 * ?",
  "repository": "eu-west-2-elasticsearch-snapshots",
  "config": {
    "indices": "<*_{now/M-1M{yyyy.MM}}.*>",
    "ignore_unavailable": true,
    "include_global_state": false
  },
  "retention": {
    "expire_after": "365d"
  }
}

Then this ILM:

PUT _ilm/policy/logs_production
{
  "policy": {
    "phases": {
      "delete": {
        "min_age": "7d",
        "actions": {
          "wait_for_snapshot": {
            "policy": "monthly-export"
          },
          "delete": {}
        }
      }
    }
  }
}

wait for the first of next month and..... nothing happen; indices are not deleted :(

Provide logs (if relevant):

[2020-09-17T18:37:38,032][ERROR][o.e.c.s.MasterService    ] [xxx-elastic-1] exception thrown by listener notifying of failure from [ilm-execute-cluster-state-steps [{"phase":"delete","action":"wait_for_snapshot","name":"wait-for-snapshot"} => {"phase":"delete","action":"delete","name":"wait-for-shard-history-leases"}]]
org.elasticsearch.ElasticsearchException: policy [metrics-raw_production] for index [xxx-production_metrics-raw_2020.08.04] failed on step [{"phase":"delete","action":"wait_for_snapshot","name":"wait-for-snapshot"}].
	at org.elasticsearch.xpack.ilm.ExecuteStepsUpdateTask.onFailure(ExecuteStepsUpdateTask.java:203) ~[?:?]
	at org.elasticsearch.cluster.service.MasterService$SafeClusterStateTaskListener.onFailure(MasterService.java:513) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.cluster.service.MasterService$TaskOutputs.notifyFailedTasks(MasterService.java:446) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:220) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:636) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.8.0.jar:7.8.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]
	Suppressed: java.lang.IllegalArgumentException: step [{"phase":"delete","action":"delete","name":"wait-for-shard-history-leases"}] for index [xxx-production_metrics-raw_2020.08.04] with policy [metrics-raw_production] does not exist
		at org.elasticsearch.xpack.ilm.IndexLifecycleTransition.validateTransition(IndexLifecycleTransition.java:84) ~[?:?]
		at org.elasticsearch.xpack.ilm.IndexLifecycleTransition.moveClusterStateToStep(IndexLifecycleTransition.java:105) ~[?:?]
		at org.elasticsearch.xpack.ilm.ExecuteStepsUpdateTask.execute(ExecuteStepsUpdateTask.java:135) ~[?:?]
		at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-7.8.0.jar:7.8.0]
		at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:702) ~[elasticsearch-7.8.0.jar:7.8.0]
		at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:324) ~[elasticsearch-7.8.0.jar:7.8.0]
		at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:219) [elasticsearch-7.8.0.jar:7.8.0]
		at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [elasticsearch-7.8.0.jar:7.8.0]
		at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-7.8.0.jar:7.8.0]
		at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.8.0.jar:7.8.0]
		at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.8.0.jar:7.8.0]
		at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:636) [elasticsearch-7.8.0.jar:7.8.0]
		at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.8.0.jar:7.8.0]
		at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.8.0.jar:7.8.0]
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
		at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: java.lang.IllegalArgumentException: step [{"phase":"delete","action":"delete","name":"wait-for-shard-history-leases"}] for index [xxx-production_metrics-raw_2020.08.04] with policy [metrics-raw_production] does not exist
	at org.elasticsearch.xpack.ilm.IndexLifecycleTransition.validateTransition(IndexLifecycleTransition.java:84) ~[?:?]
	at org.elasticsearch.xpack.ilm.IndexLifecycleTransition.moveClusterStateToStep(IndexLifecycleTransition.java:105) ~[?:?]
	at org.elasticsearch.xpack.ilm.ExecuteStepsUpdateTask.execute(ExecuteStepsUpdateTask.java:135) ~[?:?]
	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:702) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:324) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:219) ~[elasticsearch-7.8.0.jar:7.8.0]
	... 10 more

The text was updated successfully, but these errors were encountered:

cambierr · 2020-09-17T20:57:32Z

Also, could you confirm that the wait_for_snapshot rule requires at least one snapshot of that index and not at least one "run" of the given policy, event if it did not contained the index ?

elasticmachine · 2020-09-18T07:30:40Z

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

stefnestor · 2022-02-24T16:53:17Z

FWIW potentially relates to #69642, #62164 when using SLM config.indices date math?

cambierr added >bug needs:triage Requires assignment of a team area label labels Sep 17, 2020

costin added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Sep 18, 2020

elasticmachine added the Team:Data Management Meta label for data/management team label Sep 18, 2020

danielmitterdorfer removed the needs:triage Requires assignment of a team area label label Oct 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ILM stuck waiting for snapshot while already executed #62581

ILM stuck waiting for snapshot while already executed #62581

cambierr commented Sep 17, 2020 •

edited

Loading

cambierr commented Sep 17, 2020

elasticmachine commented Sep 18, 2020

stefnestor commented Feb 24, 2022

ILM stuck waiting for snapshot while already executed #62581

ILM stuck waiting for snapshot while already executed #62581

Comments

cambierr commented Sep 17, 2020 • edited Loading

cambierr commented Sep 17, 2020

elasticmachine commented Sep 18, 2020

stefnestor commented Feb 24, 2022

cambierr commented Sep 17, 2020 •

edited

Loading