This article shares a step-by-step guide for identifying, troubleshooting, and safely removing a faulty disk/ OSD in a large-scale Ceph storage cluster integrated with OpenStack. It assumes basic familiarity with Ceph commands and requires access to the Ceph management node and storage nodes for effective cluster maintenance.

Step 1: Log In to the Ceph Controller Node

SSH into the Ceph management node as root or a privileged user.

Step 2: Check Overall Ceph Cluster Health

Monitor the cluster status in real-time to identify issues like degraded PGs or OSDs down.

Run:

watch ceph -s

Look for warnings such as HEALTH_WARN related to OSDs or PGs. Note any OSDs marked as down or PGs that are inconsistent or degraded.

Step 3: Identify Disks with Read Errors or Inconsistent PGs

Search Ceph logs for read errors on the management node:

zgrep "read error" /var/log/ceph/ceph.log /var/log/ceph/ceph.log.1*

If no results appear, check for inconsistent PGs and associated OSDs:

ceph health detail

This will list problematic PGs (e.g., "PG 1.2 is inconsistent") and the OSDs involved (e.g., osd.5, osd.10). Note the OSD ID (e.g., 5 for osd.5) and PG IDs for later repair.

Step 4: Locate the Problematic Disk

For each problematic OSD (replace $OSD_ID with the actual number, e.g., 5):

ceph osd find $OSD_ID

This outputs the host (node) where the OSD resides, along with details like the device path (e.g., /dev/sda).

Step 5: SSH to the Affected Ceph Storage Node

Check kernel logs for XFS or I/O errors (these often correlate with the faulty disk):

zgrep "end_request" /var/log/kern.log /var/log/kern.log.1*

Verify the OSD mount point matches the expected device (e.g., /dev/sda for osd.5):

mount | grep sd$

Step 6: Check for Media Errors and Gather Disk Details

Identify the physical slot number (e.g., slot 1 for sda; this varies by hardware). Use storcli64 to inspect the drive:

storcli64 /c0 /eall /s$SLOT_NUMBER show all | egrep "Error|SN"

Replace $SLOT_NUMBER with the actual slot (e.g., 1).
Note the Serial Number (SN) and any error counts (e.g., media errors).

If preserved cache exists, clear it forcefully to avoid issues:

storcli64 /call /vall remove preservedcache force

Copy the OSD ID, SN, error details, and logs. Keep this information for RMA.

Step 7: Verify No Incomplete PGs Before Removal

On the management node, ensure the cluster is stable:

ceph -s | grep incomplete

If any incomplete PGs are reported, resolve them first (e.g., via repairs in Step 9) before proceeding.

Step 8: Remove the OSD from the Cluster

Now, let’s remove the target OSD from the cluster, but it requires several steps. On the management node, disable scrubs to reduce load during removal:

ceph osd set noscrub
ceph osd set nodeep-scrub

a. SSH to the OSD host and stop the OSD service:

sudo systemctl stop ceph-osd@<osd_id>

This ensures the OSD is down and no longer active

b. Purge the OSD from the Cluster

ceph osd purge <osd_id> --yes-i-really-mean-it

This removes the OSD from the CRUSH map, deletes its auth key, and clears it from the OSD map in one step.

Monitor recovery progress:

watch ceph -s

Wait until backfilling/degraded objects reach a low number (ideally 0). This may take time depending on cluster size and load.

c. Re-enable scrubs:

ceph osd unset noscrub
ceph osd unset nodeep-scrub

Do not remove another drive until recovery completes.

Step 9: Repair Inconsistent PGs

If inconsistent PGs persist (from Step 3), repair them on the management node. Replace $PG_ID with the actual PG (e.g., 1.2). Run multiple times if needed:

ceph pg repair $PG_ID

Monitor progress with ceph health detail or watch ceph -s.

Mastering Ceph OSD Removal