HON’s Wiki # Ceph

Home / Linux Servers

Contents

Resources

Info

Guidelines

Usage

Failure Handling

Down + peering:

The placement group is offline because an is unavailable and is blocking peering.

  1. ceph pg <pg> query
  2. Try to restart the blocked OSD.
  3. If restarting didn’t help, mark OSD as lost: ceph osd lost <osd>
    • No data loss should occur if using an appropriate replication factor.

Active degraded (X objects unfound):

Data loss has occurred, but metadata about the missing files exist.

  1. Check the hardware.
  2. Identify object names: ceph pg <pg> query
  3. Check which images the objects belong to: ceph pg <pg list_missing>
  4. Either restore or delete the lost objects: ceph pg <pg> mark_unfound_lost <revert|delete>

Inconsistent:

Typically combined with other states. May come up during scrubbing. Typically an early indicator of faulty hardware, so take note of which disk it is.

  1. Find inconsistent PGs: ceph pg dump pgs_brief | grep -i inconsistent
    • Alternatively: rados list-inconsistent pg <pool>
  2. Repair the PG: ceph pg repair <pg>

OSD Replacement

  1. Stop the daemon: systemctl stop ceph-osd@<id>
    • Check: systemctl status ceph-osd@<id>
  2. Destroy OSD: ceph osd destroy osd.<id> [--yes-i-really-mean-it]
    • Check: ceph osd tree
  3. Remove OSD from CRUSH map: ceph osd crush remove osd.<id>
  4. Wait for rebalancing: ceph -s [-w]
  5. Remove the OSD: ceph osd rm osd.<id>
    • Check that it’s unmounted: lsblk
    • Unmount it if not: umount <dev>
  6. Replace the physical disk.
  7. Zap the new disk: ceph-disk zap <dev>
  8. Create new OSD: pveceph osd create <dev> [options] (Proxmox VE)
    • Optionally specify any WAL or DB devices.
    • See PVE: pveceph(1).
    • Without PVE’s pveceph(1), a series of steps are required.
    • Check that the new OSD is up: ceph osd tree
  9. Start the OSD daemon: systemctl start ceph-osd@<id>
  10. Wait for rebalancing: ceph -s [-w]
  11. Check the health: ceph health [detail]

hon.one | HON95/wiki | Edit page