Rook-Ceph

May 13, 2022, 12:32 pm

Here I will gather a knowledge base of running Rook-Ceph in Kubernetes.

Some pre-requisites to follow this KB are;

Access to the rook toolbox container.
You might have to restart the rook operator container sometimes to trigger stuff, delete the Pod to do this.
Never, ever, restart a kube node with an OSD without first draining it with kubectl drain, Rook-Ceph can safely handle the OSD then.

Remove an OSD

Presumably the OSD you want to remove is already in down status in ceph osd tree, might have been in autoout status in ceph osd status too.

First of all ensure all PGs (Placement Group) are active+clean in ceph status.

If you lose too many OSDs you won’t have anywhere to recover PGs to (or move data around in other words), so adding another node just to recover PGs on has worked for me. It can always be drained and removed later.

Once another OSD is available a recovery will start, track the progress in osd statusand once it’s done you can finally remove your old OSDs that are down.

It should already be marked out but if not, according to docs you can ceph osd out osd.<id>.

This is when backfilling should really start, meaning it starts shifting data around to other OSDs.

Then you can ceph osd purge <id> --yes-i-really-mean-it to remove the OSD from the list.

Verify with ceph osd tree.

And finally remove the Deployments from kubernetes. This should trigger the operator to create new Deployments within a few minutes but I always end up restarting the rook-ceph operator Pod manually to trigger this.

Now the new OSDs will most likely have old auth configured in Ceph, view that with ceph auth list and compare with each OSDs keyring file located under /var/lib/rook/rook-ceph/<osd-id>/keyring on the host node.

If this is the case continue to OSD pod will not start after deleting it on this page and do the manual import of keys.

OSD Pod will not start after deleting it

You may find an OSD is down, or in status autoout and to restore it you might want to delete the corresponding osd Pod and have it re-created. A sort of restart service in Kubernetes.

But it won’t start back up, maybe because the OSD is missing the correct auth key.

The Pod log output might look like this.

failed to fetch mon config (--no-mon-config to skip)

It needs a correct key to communicate with the monitor (mon).

Let’s say the broken OSD is osd.2 and it’s running on node2, just for simplicity. It can run on any node so the numbers mean nothing.

Look at the osd list to find which node it was attached to.

ceph osd status
ID  Host
 0  node2   83.8G   195G      1     5734       0        0   exists,up
 1  node3   87.0G   192G      0     3275       0        0   exists,up
 2  node1   0      0       0        0       0        0      autoout,exists

Now we must compare the key on the osd node, with the one in the ceph auth list.

Login to the node1 server, using rook-ceph the current key is located in /var/lib/rook/rook-ceph/<ceph-node-name>/keyring, with ceph-node-name being some unique name for your node.

If this key does not match the corresponding osd key in the ceph auth list output then we must import the key from the node keyring file, into the ceph auth command like this.

ceph auth import -i -
[osd.2]
key=<secret key from keyring file>
caps mgr = "allow profile osd"
caps mon = "allow profile osd"
caps osd = "allow *"

I’m using stdin here because the command is running in a Pod where I am not allowed to write to the FS, but otherwise the RedHat KB mentions writing a file with the key info and importing that.

Now you can delete the corresponding osd Pod again and it will be re-created, and the key should now match and it should start.

Ceph alerts

Ceph has internal alerts that might cause ceph_health_status to be warning, and the ceph status command might not immediately show why.

Grafana Ceph dashboard will show there are alerts so you can run ceph crash ls to list the alerts, and ceph crash info <id> to view one.

Then ceph crash archive-all when you’re done.

Kubernetes