Zeyuan Hu's page

"Ceph: A Scalable, High-Performance Distributed File System"


How can we design a distributed file system that:

  • are scalable (e.g., supports hundreds of petabytes and beyond; extreme workload case)
  • flexible to adjust to different workloads

while maintaining good performance?


  • Object-based storage (an abstraction layer between application and hard disks)

    • Instead of hard disks, use intelligent object storage devices (OSD) (= CPU + network interface + local cache with an underlying disk or RAID)
    • OSDs allows clients to read or write byte ranges to much larger (variably sized) named objects (no block-level inteface)
    • Distribute low-level block allocation decisions to device themselves
  • Traditional architecture

    • Contact metadata server (MDS) for metadata ops + contact OSDS to perform file I/O
    • Problems:
      • Single MDS is bottleneck
      • Traditional FS interface becomes legacy: allocation list, inode tables
      • OSDs can do more than just storing data

System Design

  • Design assumptions

    • Large systems are built incrementally
    • Node failures are normal
    • Quality and character of workloads are shifting over time
  • High-level design

    • Replace MDS with MDS cluster with dynamic metadata workload distribution
    • Replace file allocation tables with data distribution function (e.g., CRUSH)
    • Use OSDs for management of object replication, cluster expansion, failure detection, and recovery besides just data storage
  • Architecture

    Ceph Architecture

    • Three components:
      • Client
      • MDS cluster: manages namespace (file names and directories); coordinate security, consistency, and coherence
      • OSDs cluster: stores data + metadata
  • Client

    • Runs on each host executing application code
    • Expose a file system interface to applications
    • Can be linked directly by application or mounted as a FUSE-based file system
    • File I/O:
      • client sends a request to MDS on file open; MDS returns file info + striping strategy (i.e., how th e file is mapped into a sequence of objects) + capability (i.e., permitted operations by clients)
    • Synchronization:
      • Client I/O for the same file access has to be synchronized (i.e., blocked until acked by OSDs)
      • For performance-focus scenaro, allow application to relax consistency by providing POSIX I/O interface extensions


    POSIX semantics require: 1. reads reflect any data previously written 2. writes are atomic (i.e., the result of overlapping, concurrent writes will reflect a particular order of occurrence)

    • Namespace operations:
      • Read operations (readdir, stat) and updates (unlink, chmod) are synchronized by MDS
      • Optimize common metadata access pattern (readdir followed by stat) (trade coherence for performance)
      • Offer POSIX interface extension (statlite) for application that don't need coherent behavior
      • Extend existing interface for performance (stat example)
  • Metadata management

    • No file allocation metadata: object names = file inum + stripe number
    • Objects distributed to OSDs using CRUSH
    • Metadata storage
      • Use journals for MSDs to stream updated metadata to the OSD cluster and for MDS failure recovery
      • Inodes are embedded within directories, allowing the MDS to prefetch entire directories with a single OSD read request
      • Use anchor table to keep the rare inode with multiple hard links globally addressable by inum (avoid large but sparse inode table)
    • Dynamic Subtree Partitioning
      • Adjustable to dynamic workloads (vs. static subtree paritioning) and maintain metadata locality and opportunities for metadata prefetching and storage (vs. hash)
      • How it works:
        • Each MDS measures the popularity of metadata within the directory hierarchy using counters with an exponential time decay
        • Any operation increments the counter on the affected inode and all if its ancestors up to the root directory
        • MDS load values (i.e., counters) are periodically compared, and appropriately-sized subtrees of the directory hierarchy are migrated to maintain load balancing
    • Traffic control
      • The contents of heavily read directories (e.g., many opens) are selectively replicated across multiple nodes
      • Directories that are particularly large or experiencing a heavy write workload (e.g., many file creations) have their contents hashed by file name across the cluster
      • Clients can contact MDS server directly for rare metadata and are provided different MDS node for accessing popular metadata

    Dynamic Subtree Partitioning

  • Distributed object storage

    • Logically as a single logical object store and namespace: Reliable Autonomic Distributed Object Store (RADOS)
    • Data distribution with CRUSH
      • Distributes new data randomly; migrates a random subsample of existing data to new devices; uniformly redistributes data from removed devices
      • How it works:
        • Ceph maps objects into placement groups (PGs) using a simple hash function, with an adjustable bit mask to control the number of PGs
        • PGs are assigned to OSDs using CRUSH (Controlled Replication Under Scalable Hashing): a pseudo-random data distribution function that efficiently maps each PG to an ordered list of OSDs upon which to store object replicas
          • To locate any object, CRUSH requires only the placement group and an OSD cluster map: a compact, hierarchical description of the devices comprising the storage cluster
          • Cluster map incorporates clusters physical or logical composition and potential sources of failure
    • Replication
      • Using a variant of primary-copy replication
      • How it works:
        • Data is replicated in terms of PGs, each of which is mapped to an ordered list of \(n\) OSDs (for \(n\)-way replication)
        • Clients send all writes to the first non-failed OSD in an object's PG (the primary), which assigns a new version number for the object and PG and forwards the write to any additional replica OSDs
        • After each replica has applied the update and responded to the primary, the primary applies the update locally and the write is acknowledged to the client
        • Reads are directed at the primary
    • Data safety

      • Two requirements:
        1. Low-latency updates (updates should be visible to other clients asap)
        2. Data is safely replicated after writes
      • How it works:
        • The primary forwards the update to replicas, and replies with an ack after it is applied to all OSDs' in-memory buffer caches, allowing synchronous POSIX calls on the client to return (satisfy requirement 1)
        • A final commit is sent when data is safely committed to disk (satisfy requirement 2)

      Data Safety Mechanism in Ceph

    • Failure detection

      • Each OSD monitors those peers with which it shares PGs (existing replication traffic as liveness signal); an explicit ping is sent if an OSD has not heard from a peer recently
      • An unresponsive OSD will have its responsbility (update serialization, replication) temporarily pass to the next OSD in each of its PGs
      • OSD that cannot be recovered will be out of data distribution and another OSD joins to re-replicate its contents
      • A small cluster of monitors collects failure reports and filters transient or systemic problems centrally (to ensure correct and availability of cluster map)
    • Recovery and cluster updates

      • OSDs maintain a version number for each object and a log of recent changes for each PG
      • On cluster updates, OSD checks local PGs and adjust itself to the new PG groups
      • Version number is used to determine the latest PG version number
      • Log is used to determine the correct PG contents
      • Once OSD has the correct PG membership, each OSD independently updates its data by contacting peers

      PGs distributed to OSDs

    • Object storage with EBOFS

      • OSD manages its local object storage with EBOFS (Extent and B-tree based Object File System)
      • Why new file system (instead of using ext3)?
        • POSIX interface donesn't support atomic data and metadata update transactions
        • High latency for journaling and synchronous writes
      • EBOFS
        • User-space file system
        • Update serialization (for synchronization) is different from on-disk commits (for safety)
        • Supports atomic transactions (writes and attribute updates on multiple objects)
        • update function returns when in-memory cache updated with async callbacks on commit
        • Use B-tree to locate objects, manage block allocation, and index collections (PGs)
        • Use extents (instead of block list) for block allocation (for metadata compact)
        • Free block extents are binned by size and sorted by location for locality and avoid fragmentation
        • Metadata (execpt blokc allocation info) is in-memory
        • Use copy-on-write

Additional Reading

comments powered by Disqus