HON’s Wiki # Ceph
Home / Linux Servers
Contents
Resources
Info
- Distributed storage for HA.
- Redundant and self-healing without any single point of failure.
- The Ceph Storeage Cluster consists of:
- Monitors (typically one per node) for monitoring the state of itself and other nodes.
- Managers (at least two for HA) for serving metrics and statuses to users and external services.
- OSDs (object storage daemon) (one per disk) for handles storing of data, replication, etc.
- Metadata Servers (MDSes) for storing metadata for POSIX file systems to function properly and efficiently.
- At least three monitors are required for HA, because of quorum.
- Each node connects directly to OSDs when handling data.
- Pools consist of a number of placement groups (PGs) and OSDs, where each PG uses a number of OSDs.
- Replication factor (aka size):
- Replication factor n/m (e.g. 3/2) means replication factor n with minimum replication factor m. One of them is often omitted.
- The replication factor specifies how many copies of the data will be stored.
- The minimum replication factor describes the number of OSDs that must have received the data before the write is considered successful and unblocks the write operation.
- Replication factor n means the data will be stored on n different OSDs/disks on different nodes,
and that n-1 nodes may fail without losing data.
- When an OSD fails, Ceph will try to rebalance the data (with replication factor over 1) onto other OSDs to regain the correct replication factor.
- A PG must have state active in order to be accessible for RW operations.
- The number of PGs in an existing pool can be increased but not decreased.
- Clients only interact with the primary OSD in a PG.
- The CRUSH algorithm is used for determining storage locations based on hashing the pool and object names. It avoids having to index file locations.
- BlueStore (default OSD back-end):
- Creates two partitions on the disk: One for metadata and one for data.
- The metadata partition uses an XFS FS and is mounted to
/var/lib/ceph/osd/ceph-<osd-id>
.
- The metadata file
block
points to the data partition.
- The metadata file
block.wal
points to the journal device/partition if it exists (it does not by default).
- Separate OSD WAL/journal and DB devices may be set up, typically when using HDDs or a mix of HDDs and SSDs.
- One OSD WAL device can serve multiple OSDs.
- OSD WAL devices should be sized according to how much data they should “buffer”.
- OSD DB devices should be at least 4% as large as the backing OSDs. If they fill up, they will spill onto the OSDs and reduce performance.
- If the fast storage space is limited (e.g. less than 1GB), use it as an OSD WAL. If it is large, use it as an OSD DB.
- Using a DB device will also provide the benefits of a WAL device, as the journal is always placed on the fastest device.
- A lost OSD WAL/DB will be equivalent to lose all OSDs. (For the older Filestore back-end, it used to be possible to recover it.)
Guidelines
- Nodes:
- 3+ required (1 can fail), 4+ recommended (2 can fail).
- CPU:
- MDSes an somewhat OSDs are CPU intensive, but managers and monitors are not.
- RAM:
- Depends, check the docs. More is better.
- The recommended target for each OSD is 4GB. 2GB may work, but any less may cause extremely low performance.
- Disks:
- Recommended minimum disk size is 1TB.
- Behcnmark the drives before using them. See the docs.
- Network:
- Use an isolated separete physical network for internal cluster traffic between nodes.
- Consider using 10G or higher with a spine-leaf topology.
- Disk setup:
- SAS/SATA drives should have 1 OSD each, but NVMe drives may yield better performance if using multiple.
- Use a replication factor of at least 3/2.
- Run OSes, OSD data and OSD journals on separate drives.
- Local, fast SSDs may be used for CephFS metadata pools, while keeping the file contents on the “main pool”.
- Consider disabling drive HW write caches, it might increase performance with Ceph.
- Pool PG count:
- <5 OSDs: 128
- 5-10 OSDs: 512
- 10-50 OSDs: 4096
- >50 OSDs: See (pgcalc)[https://ceph.com/pgcalc/].
Usage
- General:
- List pools:
rados lspools
or ceph osd lspools
- Show utilization:
rados df
ceph df [detail]
deph osd df
- Show health and status:
ceph status
ceph health [detail]
ceph osd stat
ceph osd tree
ceph mon stat
ceph osd perf
ceph osd pool stats
ceph pg dump pgs_brief
- Pools:
- Create:
ceph osd pool create <pool> <pg-num>
- Delete:
ceph osd pool delete <pool> [<pool> --yes-i-really-mean-it]
- Rename:
ceph osd pool rename <old-name> <new-name>
- Make or delete snapshot:
ceph osd pool <mksnap|rmsnap> <pool> <snap>
- Set or get values:
ceph osd pool <set|get> <pool> <key>
- Set quota:
ceph osd pool set-quota <pool> [max_objects <count>] [max_bytes <bytes>]
- Interact with pools directly using RADOS:
- Ceph is built on based on RADOS.
- List files:
rados -p <pool> ls
- Put file:
rados -p <pool> put <name> <file>
- Get file:
rados -p <pool> get <name> <file>
- Delete file:
rados -p <pool> rm <name>
- Manage RBD (Rados Block Device) images:
- Images are spread over multiple objects.
- List images:
rbd -p <pool> ls
- Show usage:
rbd -p <pool> du
- Show image info:
rbd info <pool/image>
- Create image:
rbd create <pool/image> --object-size=<obj-size> --size=<img-size>
- Export image to file:
rbd export <pool/image> <file>
- Mount image: TODO
Failure Handling
Down + peering:
The placement group is offline because an is unavailable and is blocking peering.
ceph pg <pg> query
- Try to restart the blocked OSD.
- If restarting didn’t help, mark OSD as lost:
ceph osd lost <osd>
- No data loss should occur if using an appropriate replication factor.
Active degraded (X objects unfound):
Data loss has occurred, but metadata about the missing files exist.
- Check the hardware.
- Identify object names:
ceph pg <pg> query
- Check which images the objects belong to:
ceph pg <pg list_missing>
- Either restore or delete the lost objects:
ceph pg <pg> mark_unfound_lost <revert|delete>
Inconsistent:
Typically combined with other states. May come up during scrubbing.
Typically an early indicator of faulty hardware, so take note of which disk it is.
- Find inconsistent PGs:
ceph pg dump pgs_brief | grep -i inconsistent
- Alternatively:
rados list-inconsistent pg <pool>
- Repair the PG:
ceph pg repair <pg>
OSD Replacement
- Stop the daemon:
systemctl stop ceph-osd@<id>
- Check:
systemctl status ceph-osd@<id>
- Destroy OSD:
ceph osd destroy osd.<id> [--yes-i-really-mean-it]
- Remove OSD from CRUSH map:
ceph osd crush remove osd.<id>
- Wait for rebalancing:
ceph -s [-w]
- Remove the OSD:
ceph osd rm osd.<id>
- Check that it’s unmounted:
lsblk
- Unmount it if not:
umount <dev>
- Replace the physical disk.
- Zap the new disk:
ceph-disk zap <dev>
- Create new OSD:
pveceph osd create <dev> [options]
(Proxmox VE)
- Optionally specify any WAL or DB devices.
- See PVE: pveceph(1).
- Without PVE’s
pveceph(1)
, a series of steps are required.
- Check that the new OSD is up:
ceph osd tree
- Start the OSD daemon:
systemctl start ceph-osd@<id>
- Wait for rebalancing:
ceph -s [-w]
- Check the health:
ceph health [detail]
hon.one
| HON95/wiki
| Edit page