Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 2.16.0 Auto-expend replicas causes cluster yellow state when cluster nodes are above low watermark #15919

Open
sandervandegeijn opened this issue Sep 12, 2024 · 1 comment
Labels

Comments

@sandervandegeijn
Copy link

sandervandegeijn commented Sep 12, 2024

Describe the bug

We have encountered this bug multiple times, also before 2.16.0.

When cluster nodes are already above the low watermark causing new indices being distributed to other nodes, it can happen that the cluster goes to yellow. The cause seems to be that the default policy on system indices is: auto_expand_replicas: "1-all". It tries to allocate replicas to nodes that are not able to accept more data because of the watermark situation.

This seems to happen when kubernetes is reallocating opensearch nodes to different k8s compute nodes.

Cluster state:

{
  "cluster_name": "xxxxx",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 17,
  "number_of_data_nodes": 12,
  "discovered_master": true,
  "discovered_cluster_manager": true,
  "active_primary_shards": 2631,
  "active_shards": 3140,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 3,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 99.90454979319122
}

It tries to allocate the replicas:

{
  "index": ".opendistro_security",
  "shard": 0,
  "primary": false,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "CLUSTER_RECOVERED",
    "at": "2024-09-12T14:45:27.211Z",
    "last_allocation_status": "no_attempt"
  },
  "can_allocate": "no",
  "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions": [
    {
      "node_id": "02CeBVQKTa2lD1Qx0GAS3Q",
      "node_name": "opensearch-data-nodes-hot-6",
      "transport_address": "10.244.33.33:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [8.175061087167675%]"
        }
      ]
    },
    {
      "node_id": "Balhhxf2T2uNpUP6rq88Ag",
      "node_name": "opensearch-data-nodes-hot-2",
      "transport_address": "10.244.86.36:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [9.615515861288957%]"
        }
      ]
    },
    {
      "node_id": "DppvPjxgR0u8CVQVyAX0UA",
      "node_name": "opensearch-data-nodes-hot-7",
      "transport_address": "10.244.97.29:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "a copy of this shard is already allocated to this node [[.opendistro_security][0], node[DppvPjxgR0u8CVQVyAX0UA], [R], s[STARTED], a[id=Q9PoLV1wRGumidM22EKveQ]]"
        },
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [12.463841799195983%]"
        }
      ]
    },
    {
      "node_id": "LQSYXzHbTfqowAOj3nrU3w",
      "node_name": "opensearch-data-nodes-hot-4",
      "transport_address": "10.244.70.30:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [7.916677463242952%]"
        }
      ]
    },
    {
      "node_id": "Ls8ptyo7ROGtFeO8hY5c5Q",
      "node_name": "opensearch-data-nodes-hot-9",
      "transport_address": "10.244.54.37:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "a copy of this shard is already allocated to this node [[.opendistro_security][0], node[Ls8ptyo7ROGtFeO8hY5c5Q], [R], s[STARTED], a[id=j_FrjkN7R0aCEokKa4tjCA]]"
        }
      ]
    },
    {
      "node_id": "O_CCkTbmRtiuJU3cV93EaA",
      "node_name": "opensearch-data-nodes-hot-1",
      "transport_address": "10.244.83.46:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [8.445263138130201%]"
        }
      ]
    },
    {
      "node_id": "OfBmEaQsSsuJtJ4TKadLnQ",
      "node_name": "opensearch-data-nodes-hot-10",
      "transport_address": "10.244.37.46:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [11.538695394244522%]"
        }
      ]
    },
    {
      "node_id": "RC5KMwpWRMCVrGaF_7oGBA",
      "node_name": "opensearch-data-nodes-hot-0",
      "transport_address": "10.244.99.67:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [12.185368398769644%]"
        }
      ]
    },
    {
      "node_id": "S_fk2yqhQQuby8HM4hJXVA",
      "node_name": "opensearch-data-nodes-hot-8",
      "transport_address": "10.244.45.64:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [10.432421573093784%]"
        }
      ]
    },
    {
      "node_id": "_vxbOtloQmapzz0DbXBsjA",
      "node_name": "opensearch-data-nodes-hot-5",
      "transport_address": "10.244.79.58:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "a copy of this shard is already allocated to this node [[.opendistro_security][0], node[_vxbOtloQmapzz0DbXBsjA], [P], s[STARTED], a[id=hY9WcHR-S_6TN3kTj4NZJA]]"
        }
      ]
    },
    {
      "node_id": "pP5muAyTSA2Z45yO8Ws0VA",
      "node_name": "opensearch-data-nodes-hot-3",
      "transport_address": "10.244.101.66:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [9.424146099675534%]"
        }
      ]
    },
    {
      "node_id": "zRdO9ndKSbuJ97t77-OLLw",
      "node_name": "opensearch-data-nodes-hot-11",
      "transport_address": "10.244.113.26:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "a copy of this shard is already allocated to this node [[.opendistro_security][0], node[zRdO9ndKSbuJ97t77-OLLw], [R], s[STARTED], a[id=O7z4RvkiQXGMcfhRSPm8lQ]]"
        },
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [11.883587901703455%]"
        }
      ]
    }
  ]
}

So if we have 12 nodes, it tries to allocate 11 replicas on the restart of the node. But that seems to fail because several nodes are above the low watermark (why not distribute the free space more evenly?). The only solutions seems to be to lower the auto expand setting or to manually redistribute shards across the nodes to even out the disk space usage.

Cluster storage state:

n                            id   v      r rp      dt      du   dup hp load_1m load_5m load_15m
opensearch-master-nodes-0    twM5 2.16.0 m 60   9.5gb 518.1mb  5.32 56    1.74    1.38     1.17
opensearch-data-nodes-hot-5  _vxb 2.16.0 d 96 960.1gb 649.6gb 67.66 41    1.14    1.14     1.10
opensearch-master-nodes-2    nQD7 2.16.0 m 59   9.5gb 518.1mb  5.32 37    1.15    1.06     1.09
opensearch-data-nodes-hot-11 zRdO 2.16.0 d 92 960.1gb   859gb 89.47 31    2.33    3.13     3.62
opensearch-data-nodes-hot-6  02Ce 2.16.0 d 90 960.1gb 848.5gb 88.38 62    1.40    1.40     1.60
opensearch-data-nodes-hot-4  LQSY 2.16.0 d 95 960.1gb 886.5gb 92.33 35    2.33    2.40     2.56
opensearch-data-nodes-hot-10 OfBm 2.16.0 d 96 960.1gb 861.7gb 89.75 58    3.69    4.27     4.21
opensearch-ingest-nodes-0    bx4Z 2.16.0 i 65    19gb  1016mb  5.21 73    2.31    2.60     2.54
opensearch-data-nodes-hot-3  pP5m 2.16.0 d 61 960.1gb 869.6gb 90.58 35    1.71    1.64     1.89
opensearch-data-nodes-hot-9  Ls8p 2.16.0 d 95 960.1gb 643.2gb 66.99 27    0.72    1.00     1.02
opensearch-data-nodes-hot-7  Dppv 2.16.0 d 91 960.1gb 842.4gb 87.74 53    1.29    1.87     1.74
opensearch-data-nodes-hot-2  Balh 2.16.0 d 63 960.1gb 867.8gb 90.38 31    1.93    1.73     1.45
opensearch-data-nodes-hot-8  S_fk 2.16.0 d 64 960.1gb 859.9gb 89.57 42    0.66    0.66     0.71
opensearch-data-nodes-hot-1  O_CC 2.16.0 d 89 960.1gb 884.9gb 92.17 11    1.53    1.48     1.33
opensearch-data-nodes-hot-0  RC5K 2.16.0 d 85 960.1gb 844.8gb 87.99 62    0.77    0.90     1.10
opensearch-master-nodes-1    r70_ 2.16.0 m 58   9.5gb 518.1mb  5.32 58    0.76    0.88     1.05
opensearch-ingest-nodes-1    NX1N 2.16.0 i 61    19gb  1016mb  5.21 17    0.49    1.12     1.77

Related component

Storage

To Reproduce

Cluster is nearing capacity ( good from a storage cost perspective )
Cluster gets rebooted or individual nodes get rebooted
Cluster goes to yellow state

Expected behavior

Rebalance shards proactively based on storage usage of nodes
System indices might take priority ignoring the low/high watermark untill cluster disk usage really becomes critical

Additional Details

Plugins
Default

Screenshots
N/A

Host/Environment (please complete the following information):
Default 2.16.0 docker images

Additional context
N/A

@sandervandegeijn sandervandegeijn added bug Something isn't working untriaged labels Sep 12, 2024
@github-actions github-actions bot added the Storage Issues and PRs relating to data and metadata storage label Sep 12, 2024
@ashking94
Copy link
Member

@sandervandegeijn Thanks for filing this issue, please feel free to submit a pull request.

@ashking94 ashking94 added ShardManagement:Routing and removed Storage Issues and PRs relating to data and metadata storage labels Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Next (Next Quarter)
Status: 🆕 New
Development

No branches or pull requests

2 participants