Failure to create OSD (obtaining uniq OSD.id) blocks further provisioning and possibly does not reach desired multiple OSDs per device #14238

bdowling · 2024-05-17T18:20:40Z

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:

While discovery finds new devices and attempts to create OSDs on the devices, there appears to be a race condition.

"exec: stderr: Error EEXIST: entity osd.2 exists but key does not match" -- it felt like given the number of OSDs that were creating at the same time different nodes may have tried to get the same OSD id.

When this happened, the prepare job failed and crashed out. When I came back to look at the results, I found that a number of nodes did not have the full 4 OSDs per device that I have configured, so now I am left with unusued space. Further discoveries do not appear to identify this unused space and try to create the remaining OSDs.

As shown below, this condition also seems to block any new provisioning, because an ceph auth entry is created, but the osd is not actually created in the tree/map, so future attempts to create the OSD seems to try the same (lowest available OSD.id).

Expected behavior:

The script creating the OSDs better handles the key conflict on OSD creating and retries
The discovery process can identifies unused space on existing devices and the fact that the OSDs per device has not been reached and allocates the additional storage.

How to reproduce it (minimal and precise):

Add multiple nodes, with multiple devices each and see if the race condition occurs.

File(s) to submit:

Cluster CR (custom resource), typically called cluster.yaml, if necessary

 nvme1n1                                                                                               259:5    0 894.3G  0 disk
  └─ceph--eb0de184--ddb6--4fa2--9937--ad897dd93407-osd--block--26a7835c--0105--4b0c--8fc5--f16991b79284 253:0    0 223.6G  0 lvm
  nvme2n1                                                                                               259:6    0     7T  0 disk
  ├─ceph--9bd13d31--4cc2--4da5--950c--6cc40686825f-osd--block--aba445d0--8a06--4316--9b07--9b9f7d89e4af 253:3    0   1.7T  0 lvm
  ├─ceph--9bd13d31--4cc2--4da5--950c--6cc40686825f-osd--block--f6636619--37b3--4da9--8fa6--3870a3683424 253:4    0   1.7T  0 lvm
  ├─ceph--9bd13d31--4cc2--4da5--950c--6cc40686825f-osd--block--f8a40784--0a98--4183--bb60--16298eee1b95 253:5    0   1.7T  0 lvm
  └─ceph--9bd13d31--4cc2--4da5--950c--6cc40686825f-osd--block--12166941--6942--4349--92bd--85f51f57f52c 253:6    0   1.7T  0 lvm
  nvme3n1                                                                                               259:7    0     7T  0 disk
  ├─ceph--99cb45b2--d01e--4f6b--8df5--105b05df2b25-osd--block--3067192f--c765--4603--b30b--bc649376a60c 253:1    0   1.7T  0 lvm
  └─ceph--99cb45b2--d01e--4f6b--8df5--105b05df2b25-osd--block--382035ea--9895--4a80--af48--97451bbce512 253:2    0   1.7T  0 lvm

Operator's logs, if necessary
Crashing pod(s) logs, if necessary

To get logs, use kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use the insert code button from the Github UI.
Read GitHub documentation if you need help.

Cluster Status to submit:

Environment:

OS (e.g. from /etc/os-release): Ubuntu
Rook version (use rook version inside of a Rook Pod): rook/ceph:v1.13.4 (helm rook-ceph-v1.13.7)
Storage backend version (e.g. for ceph do ceph -v): 18.2.2. (helm rook-ceph-cluster-v1.13.7)
Kubernetes version (use kubectl version): v1.28.4
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): IKS
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):

  cluster:
    id:     3aaaed8f-7f05-4a4e-a780-70d0b19416fb
    health: HEALTH_WARN
            1 daemons have recently crashed

  services:
    mon: 5 daemons, quorum a,b,c,d,e (age 14h)
    mgr: a(active, since 43m), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 231 osds: 212 up (since 39m), 212 in (since 39m)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   12 pools, 16489 pgs
    objects: 152.89k objects, 493 GiB
    usage:   1.7 TiB used, 139 TiB / 141 TiB avail
    pgs:     16489 active+clean

The text was updated successfully, but these errors were encountered:

BlaineEXE · 2024-05-17T19:01:13Z

I have seen logs with text that is similar to "OSD ID exists and does not match my key" when there is an old OSD present on a device that hasn't been fully wiped after an old Rook/Ceph deployment. It's likely that you need to run sgdisk zap --all on the disk in question.

I suspect this may also resolve the unused space issues you're seeing.

bdowling · 2024-05-17T19:58:12Z

I'll check, but I'm not sure that is the same issue here. These are all new hosts that have never had OSDs on these disks; they did actually have LVMs that were cleared before doing discovery. We did remove a number of old OSDs off other nodes recently, as we are trying to migrate storage to new nodes..

bdowling · 2024-05-18T00:02:21Z

I may have more issues going on; I found a number of OSDs that were in Crashloop state reporting:

debug 2024-05-17T23:29:41.506+0000 7f048da93700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]

What I discovered is that they key on the keyring file for these OSDs (/var/lib/rook/rook-ceph/*/keyring with corresponding whoami files) does not match what 'ceph auth get osd.[id]' returns. Manually fixing these keys I see the OSDs are able to start; but the keyring files are replaced at some point again with the incorrect keys (restarting the OSD pod seems to trigger this).

Not sure exactly what process is creating these keyring files or why they have the incorrect key in them.

bdowling · 2024-05-18T17:48:13Z

Finding we have 31 duplicate OSD IDs on active storage nodes, a sample:

  ng-xdzv-ef254 /var/lib/rook/rook-ceph/3aaaed8f-7f05-4a4e-a780-70d0b19416fb_06040623-9669-47d0-83ec-2c3b796474a5/whoami:429
  ng-xdzv-2a0c7 /var/lib/rook/rook-ceph/3aaaed8f-7f05-4a4e-a780-70d0b19416fb_c28193f2-f527-43d6-aa1f-14287d5a13ec/whoami:429

  ng-xdzv-65657 /var/lib/rook/rook-ceph/3aaaed8f-7f05-4a4e-a780-70d0b19416fb_e39861f0-89b7-4c80-8b32-e23653d8f651/whoami:431
  ng-xdzv-da6f6 /var/lib/rook/rook-ceph/3aaaed8f-7f05-4a4e-a780-70d0b19416fb_359674c3-74d0-4836-846c-a22d2249efc9/whoami:431

  ng-xdzv-65657 /var/lib/rook/rook-ceph/3aaaed8f-7f05-4a4e-a780-70d0b19416fb_b649ea39-c19d-4577-9388-a2a108804df7/whoami:432
  ng-xdzv-2a0c7 /var/lib/rook/rook-ceph/3aaaed8f-7f05-4a4e-a780-70d0b19416fb_293a8678-d9a3-4227-9df6-562284b9bdbf/whoami:432

bdowling · 2024-05-21T03:30:58Z

fwiw, here is the error that occurs on creating new ODSs in prepare:

2024-05-21 03:24:59.707950 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log lvm batch --prepare --bluestore --yes --osds-per-device 4 /dev/nvme5n1
2024-05-21 03:25:00.324811 D | exec: --> DEPRECATION NOTICE
2024-05-21 03:25:00.324835 D | exec: --> You are using the legacy automatic disk sorting behavior
2024-05-21 03:25:00.324838 D | exec: --> The Pacific release will change the default to --no-auto
2024-05-21 03:25:00.324840 D | exec: --> passed data devices: 1 physical, 0 LVM
2024-05-21 03:25:00.324842 D | exec: --> relative data size: 0.25
2024-05-21 03:25:00.324843 D | exec: Running command: /usr/bin/ceph-authtool --gen-print-key
2024-05-21 03:25:00.324846 D | exec: Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 89db0ce4-5283-4628-b183-ae28a03a52d9
2024-05-21 03:25:00.324848 D | exec:  stderr: Error EEXIST: entity osd.2 exists but key does not match
2024-05-21 03:25:00.325070 D | exec: Traceback (most recent call last):
2024-05-21 03:25:00.325073 D | exec:   File "/usr/sbin/ceph-volume", line 11, in <module>
2024-05-21 03:25:00.325074 D | exec:     load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
2024-05-21 03:25:00.325076 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in __init__
2024-05-21 03:25:00.325078 D | exec:     self.main(self.argv)
2024-05-21 03:25:00.325080 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
2024-05-21 03:25:00.325082 D | exec:     return f(*a, **kw)
2024-05-21 03:25:00.325084 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main
2024-05-21 03:25:00.325086 D | exec:     terminal.dispatch(self.mapper, subcommand_args)
2024-05-21 03:25:00.325088 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
2024-05-21 03:25:00.325089 D | exec:     instance.main()
2024-05-21 03:25:00.325091 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/main.py", line 46, in main
2024-05-21 03:25:00.325092 D | exec:     terminal.dispatch(self.mapper, self.argv)
2024-05-21 03:25:00.325094 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
2024-05-21 03:25:00.325095 D | exec:     instance.main()
2024-05-21 03:25:00.325097 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
2024-05-21 03:25:00.325099 D | exec:     return func(*a, **kw)
2024-05-21 03:25:00.325100 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/batch.py", line 414, in main
2024-05-21 03:25:00.325102 D | exec:     self._execute(plan)
2024-05-21 03:25:00.325104 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/batch.py", line 429, in _execute
2024-05-21 03:25:00.325105 D | exec:     p.safe_prepare(argparse.Namespace(**args))
2024-05-21 03:25:00.325107 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/prepare.py", line 196, in safe_prepare
2024-05-21 03:25:00.325108 D | exec:     self.prepare()
2024-05-21 03:25:00.325110 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
2024-05-21 03:25:00.325112 D | exec:     return func(*a, **kw)
2024-05-21 03:25:00.325113 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/prepare.py", line 236, in prepare
2024-05-21 03:25:00.325115 D | exec:     self.osd_id = prepare_utils.create_id(osd_fsid, json.dumps(secrets), osd_id=self.args.osd_id)
2024-05-21 03:25:00.325116 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/util/prepare.py", line 154, in create_id
2024-05-21 03:25:00.325117 D | exec:     raise RuntimeError('Unable to create a new OSD id')
2024-05-21 03:25:00.325119 D | exec: RuntimeError: Unable to create a new OSD id

bdowling · 2024-05-21T03:48:56Z

Can someone provide any insight as to where to view/repair the source of this conflict? rook has a number of rook-ceph-osd-NN deployments that do not seem to match to active OSDs that are in the cluster anymore, and this is preventing new OSDs from being prepared. I tried to for example use the purge job to remove osd.2 in conflict above, and it is not in the osd tree/dump...

2024-05-21 03:37:12.809757 I | clusterdisruption-controller: osd "rook-ceph-osd-12" is down but no node drain is detected
2024-05-21 03:37:12.809839 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2024-05-21 03:37:12.809920 I | clusterdisruption-controller: osd "rook-ceph-osd-424" is down but no node drain is detected
2024-05-21 03:37:12.810020 I | clusterdisruption-controller: osd "rook-ceph-osd-54" is down but no node drain is detected
2024-05-21 03:37:12.810101 I | clusterdisruption-controller: osd "rook-ceph-osd-251" is down but no node drain is detected
2024-05-21 03:37:12.810178 I | clusterdisruption-controller: osd "rook-ceph-osd-20" is down but no node drain is detected
2024-05-21 03:37:12.810259 I | clusterdisruption-controller: osd "rook-ceph-osd-255" is down but no node drain is detected
2024-05-21 03:37:12.810338 I | clusterdisruption-controller: osd "rook-ceph-osd-6" is down but no node drain is detected

2024-05-21 03:41:04.404666 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2024-05-21 03:41:04.733780 I | cephosd: validating status of osd.2
2024-05-21 03:41:04.733835 C | rookcmd: failed to get osd status for osd 2: not found osd.2 in OSDDump

satoru-takeuchi · 2024-05-21T04:22:33Z

Can someone provide any insight as to where to view/repair the source of this conflict?

I'm the assignee of this issue. I'll read this issue carefully tomorrow(I'm currently on PTO).

bdowling · 2024-05-21T06:24:20Z

What I think I discovered is two things:

some of the duplicate OSDs existed on nodes and had auth that did not match what was in ceph, so those OSDs could not start as another node was using the same ID with correct key.
The complaints during discover were when it was trying to reuse an old OSD id that did not exist anymore in ceph, but the 'user' auth still existed. Removing the stale user auth for non-existant OSDs has now allowed it to create new OSDs.

bash-4.4$ for id in $(cat); do ceph auth del osd.$id; done
2       12      54      242     245     247     255     258     298     316     396     412     424
6       20      72      244     246     251     256     288     305     371     406     416     431

However I still believe there is a race-condition conflict in selecting and creating auth for new OSDs I have multiple new-nodes/drives being discovered and the prepare jobs are crashing out with the same exec: stderr: Error EEXIST: entity osd.NN exists but key does not match error message. I think it would be great if this ID selection was somehow a more atomic operation.

bdowling · 2024-05-21T15:27:05Z

FYI, I monitored the provisioning process and anytime I saw a auth conflict, I deleted the auth entry and it was able to retry and proceed... Below is an example where 4 separate jobs all tried to assign osd.433. While provisioning some 300 OSDs, it had times where it conflicted 44 times and I had to go and wipe the disks and let it retry. Other times, as below I was able to stay on top of the auth conflicts and help it along.

while true; do for pod in $(kubectl get pods --no-headers -l app=rook-ceph-osd-prepare |grep -P 'Error|Crash'|awk '{print $1}'); do kubectl logs $pod |grep EEXIST; done; sleep 5; done

2024-05-21 06:52:23.076728 D | exec:  stderr: Error EEXIST: entity osd.433 exists but key does not match
[2024-05-21 06:52:23,072][ceph_volume.process][INFO  ] stderr Error EEXIST: entity osd.433 exists but key does not match

rook-ceph-osd-prepare-ng-xdzv7327xu-37018-bhx9l   0/1     CrashLoopBackOff   5 (30s ago)   5m26s
rook-ceph-osd-prepare-ng-xdzv7327xu-5662e-rcksc   0/1     CrashLoopBackOff   5 (32s ago)   5m
rook-ceph-osd-prepare-ng-xdzv7327xu-947c0-8bxn9   0/1     CrashLoopBackOff   5 (30s ago)   5m55s
rook-ceph-osd-prepare-ng-xdzv7327xu-da6f6-fcdhj   0/1     CrashLoopBackOff   5 (28s ago)   5m32s

Check for auth existing without corresponding OSD (I know this could be faster and also would be more robust using json, just quick script..):

for osd in $(ceph auth ls |grep ^osd.); do ceph osd tree |grep -q -w $osd || echo "DANGLING AUTH: $osd"; done

satoru-takeuchi · 2024-05-22T13:19:57Z

@bdowling Many people have also encountered the similar or the same problem. To resolve this problem, ceph auth del has used. You hit this problem in high frequency because discovery daemons tries to create OSD in parallel and you have both many disks and many nodes.

I think it would be great if this ID selection was somehow a more atomic operation.

OSD ID allocation is done in ceph osd new command. Since Rook can't touch this logic, please open an issue in Ceph issue tracker if you'd like to make this logic completely atomic.

As a workaround, disabling discovery daemon (it's disabled by default) might help you. Could you try the following steps?

Set "ROOK_ENABLE_DISCOVERY_DAEMON" to "false" in rook-ceph-operator-config configmap.
Restart the operator. Then operator create osd-prepare pods if necessary.

It doesn't resolve this problem completely. However, I believe it reduces the prarallelism of OSD creation and then reduces the OSD ID confliction.

travisn · 2024-05-22T21:23:49Z

@guits How exactly does ceph-volume allocate the OSD ID? This issue with a large number of OSDs being created in parallel makes it clear that it is not atomic, which causes quite a problem for large clusters.

bdowling · 2024-05-24T05:00:00Z

I'm going to mark this issue closed for now. In testing, I was unable to get duplicate osd IDs with ceph osd new, so I suspect something else was going on. Like the old auth was never deleted when prior OSDs were purged or something. I'll revisit if I see this recur.

I went looking for where that code actually does this work in ceph, but got lost in the indirection where to find ceph osd new functions.

e.g. simple testing...

(ceph osd new $(uuidgen) & ceph osd new $(uuidgen) &ceph osd new $(uuidgen) & ceph osd new $(uuidgen) &ceph osd new $(uuidgen) & ceph osd new $(uuidgen) &ceph osd new $(uuidgen) & ceph osd new $(uuidgen) & ceph osd new $(uuidgen) & ceph osd new $(uuidgen) &ceph osd new $(uuidgen) & ceph osd new $(uuidgen) &ceph osd new $(uuidgen) & ceph osd new $(uuidgen) &ceph osd new $(uuidgen) & ceph osd new $(uuidgen) &) > /tmp/osd-ids
% sort /tmp/osd-ids |uniq -dc

bdowling added the bug label May 17, 2024

travisn assigned satoru-takeuchi May 17, 2024

bdowling changed the title ~~Failure to create OSD on device prevents further multiple OSDs per device~~ Failure to create OSD (obtaining uniq OSD.id) blocks further provisioning and possibly does not reach desired multiple OSDs per device May 21, 2024

bdowling closed this as not planned Won't fix, can't repro, duplicate, stale May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to create OSD (obtaining uniq OSD.id) blocks further provisioning and possibly does not reach desired multiple OSDs per device #14238

Failure to create OSD (obtaining uniq OSD.id) blocks further provisioning and possibly does not reach desired multiple OSDs per device #14238

bdowling commented May 17, 2024 •

edited

BlaineEXE commented May 17, 2024 •

edited

bdowling commented May 17, 2024 •

edited

bdowling commented May 18, 2024

bdowling commented May 18, 2024

bdowling commented May 21, 2024

bdowling commented May 21, 2024 •

edited

satoru-takeuchi commented May 21, 2024

bdowling commented May 21, 2024

bdowling commented May 21, 2024

satoru-takeuchi commented May 22, 2024

travisn commented May 22, 2024

bdowling commented May 24, 2024

Failure to create OSD (obtaining uniq OSD.id) blocks further provisioning and possibly does not reach desired multiple OSDs per device #14238

Failure to create OSD (obtaining uniq OSD.id) blocks further provisioning and possibly does not reach desired multiple OSDs per device #14238

Comments

bdowling commented May 17, 2024 • edited

BlaineEXE commented May 17, 2024 • edited

bdowling commented May 17, 2024 • edited

bdowling commented May 18, 2024

bdowling commented May 18, 2024

bdowling commented May 21, 2024

bdowling commented May 21, 2024 • edited

satoru-takeuchi commented May 21, 2024

bdowling commented May 21, 2024

bdowling commented May 21, 2024

satoru-takeuchi commented May 22, 2024

travisn commented May 22, 2024

bdowling commented May 24, 2024

bdowling commented May 17, 2024 •

edited

BlaineEXE commented May 17, 2024 •

edited

bdowling commented May 17, 2024 •

edited

bdowling commented May 21, 2024 •

edited