Mastering Ceph OSD Removal


This article shares a step-by-step guide for identifying, troubleshooting, and safely removing a faulty disk/ OSD in a large-scale Ceph storage cluster integrated with OpenStack. It assumes basic familiarity with Ceph commands and requires access to the Ceph management node and storage nodes for effective cluster maintenance.
Step 1: Log In to the Ceph Controller Node
SSH into the Ceph management node as root or a privileged user.
Step 2: Check Overall Ceph Cluster Health
Monitor the cluster status in real-time to identify issues like degraded PGs or OSDs down.
Run:
watch ceph -s
Look for warnings such as HEALTH_WARN
related to OSDs or PGs. Note any OSDs marked as down
or PGs that are inconsistent
or degraded
.
Step 3: Identify Disks with Read Errors or Inconsistent PGs
Search Ceph logs for read errors on the management node:
zgrep "read error" /var/log/ceph/ceph.log /var/log/ceph/ceph.log.1*
If no results appear, check for inconsistent PGs and associated OSDs:
ceph health detail
This will list problematic PGs (e.g., "PG 1.2 is inconsistent") and the OSDs involved (e.g., osd.5, osd.10). Note the OSD ID (e.g., 5 for osd.5) and PG IDs for later repair.
Step 4: Locate the Problematic Disk
For each problematic OSD (replace $OSD_ID
with the actual number, e.g., 5):
ceph osd find $OSD_ID
This outputs the host (node) where the OSD resides, along with details like the device path (e.g., /dev/sda).
Step 5: SSH to the Affected Ceph Storage Node
Log in to the node identified in Step 4.
Check kernel logs for XFS or I/O errors (these often correlate with the faulty disk):
zgrep "end_request" /var/log/kern.log /var/log/kern.log.1*
Verify the OSD mount point matches the expected device (e.g., /dev/sda for osd.5):
mount | grep sd$
Step 6: Check for Media Errors and Gather Disk Details
Identify the physical slot number (e.g., slot 1 for sda; this varies by hardware). Use storcli64
to inspect the drive:
storcli64 /c0 /eall /s$SLOT_NUMBER show all | egrep "Error|SN"
Replace
$SLOT_NUMBER
with the actual slot (e.g., 1).Note the Serial Number (SN) and any error counts (e.g., media errors).
If preserved cache exists, clear it forcefully to avoid issues:
storcli64 /call /vall remove preservedcache force
Copy the OSD ID, SN, error details, and logs. Keep this information for RMA.
Step 7: Verify No Incomplete PGs Before Removal
On the management node, ensure the cluster is stable:
ceph -s | grep incomplete
If any incomplete PGs are reported, resolve them first (e.g., via repairs in Step 9) before proceeding.
Step 8: Remove the OSD from the Cluster
Now, let’s remove the target OSD from the cluster, but it requires several steps. On the management node, disable scrubs to reduce load during removal:
ceph osd set noscrub
ceph osd set nodeep-scrub
a. SSH to the OSD host and stop the OSD service:
sudo systemctl stop ceph-osd@<osd_id>
This ensures the OSD is down and no longer active
b. Purge the OSD from the Cluster
ceph osd purge <osd_id> --yes-i-really-mean-it
This removes the OSD from the CRUSH map, deletes its auth key, and clears it from the OSD map in one step.
Monitor recovery progress:
watch ceph -s
Wait until backfilling/degraded objects reach a low number (ideally 0). This may take time depending on cluster size and load.
c. Re-enable scrubs:
ceph osd unset noscrub
ceph osd unset nodeep-scrub
Do not remove another drive until recovery completes.
Step 9: Repair Inconsistent PGs
If inconsistent PGs persist (from Step 3), repair them on the management node. Replace $PG_ID
with the actual PG (e.g., 1.2). Run multiple times if needed:
ceph pg repair $PG_ID
Monitor progress with ceph health detail
or watch ceph -s
.
Subscribe to my newsletter
Read articles from Bruce L directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Bruce L
Bruce L
I’ve been rocking the DevOps journey for a decade, starting with building Cisco’s software-defined datacenters for multi-region OpenStack infrastructures. I then shifted to serverless and container deployments for finance institutions. Now, I’m deep into service meshes like Consul, automating with Ansible and Terraform, and running workloads on Kubernetes and Nomad. Stick around for some new tech and DevOps adventures!