Introduction to LustreFS, BeeGFS, and CephFS

Linh B. Ngo

Lustre

  • Started as a Carnegie Mellon research project in 1999
  • Lustre = Linux + cluster
  • Funding came from the Advanced Simulation and Computing Program (DOE/NNSA - National Nuclear Security Administration)
  • Acquired by Sun Microsystem in 2008

Lustre

  • Massively parallel distributed file system
    • Thousands of clients
    • Large capacities (55PB at LLNAL)
    • High bandwidth (1.5TB/s at ORNL)
    • POSIX compliance
  • Open source (GPLv2)
  • Used by many of the TOP500 supercomputers

Lustre Features

  • File striping across disks and servers
  • Multiple metadata servers
  • Online file system checking
  • HSM (Hierarchical Storage Management) integration (thinks backup tape!)
  • User and group quotas

Lustre Features

  • Pluggable Network Request Scheduler
    • Two queues: One global queues and one per-object queue
    • Order execution of I/O request that belong to the same data object, by offset, as close as possible to reduce disk seeks.
  • RDMA support (Remote Direct Memory Access)
  • High availability

Lustre Features

  • I/O routing between networks
  • Multiple backend storage formats (ldiskfs and zfs)
  • Storage pools
  • CPU partitions
  • Recovery features

Lustre Components

  • MDS: Manages filenames and directories, file stripe locations, locking, ACLs, etc.
  • MDT: Block device used by MDS to store metadata information
  • OSS: Handles I/O requests for file data
  • OST: Block device used by OSS to store file data.
  • MGS: Stores configuration information for one or more Lustre file systems
  • MGT: Block device used by MGS

LNET (Lustre Networking) Transport Layer

  • Provides the underlying communication infrastructure
  • Is an abstraction for underlying networking type
    • TCP/IP
    • Infiniband
    • Cray high-speed interconnects
  • Allows fine-grained control of data flow

Lustre File Striping

  • Basic properties:
    • stripe_count (the number of OSTs to stripe across)
    • stripe_size (how much data is written to an OST)
  • Can be customized
  • First stripe_size bytes are written to the first OST, second to the second OST, etc.

I/O Flow in Lustre

  • I/O request is sent to MDS server
  • MDS server's response:
    • Which OSTs are used
    • What is the stripe size of the file
    • etc
  • Client calculates which OST holds the data
  • Client directly contacts appropriate OTS to read/write data

I/O Recommendation

  • Avoid over-striping
    • Having more stripes does not mean faster access
    • For file sizes of O(1GB), stripe_count = 1 is recommended
  • Avoid under-striping
    • Very large files with low counts can fill up an OST
    • Low stripe count can cause contention if many clients are reading/writing to separate portions of the same files
  • Avoid small I/O requests
  • Know your application's I/O pattern

CephFS

  • Presented as a paper at OSDI (Operating System Design and Implementation) in 2006
  • Funded by DOE
  • Open source (LGPL 2.1)
  • Motivation
    • As amount of data storage increases, and amount of read/write requests increase, the amount of workload on metadata servers also increase.
    • Increase cost of reliability
    • Limited scalability and performance

Design assumptions

  • Large systems are inevitably built increamentally
  • Node failures are the norm rather than the exception
  • Quality and character of workloads are constantly shifting over time

Key System Overview

  • Decoupled Data and Metadata
  • Dynamic Distributed Metadata Management
  • Reliable Autonomic Distributed Object Storage

Decoupled Data and Metadata

  • Metadata operations (open, rename, etc) are managed by MDS cluster
  • I/O requests managed by OSD (Object Storage Device) cluster
  • There is no allocation list to map files to specific names on OSD
    • No lookup table (reduce workload on MDS
    • stripe objects' name are generaed using a data distribution function

Dynamic Distributed Metadata Management

  • Previous work shows that metadata operations make up as much as half of typical file system workloads
  • Dynamic Subtree Partitioning distribute the file system directory hierarchy among MDS (scale up to hundreds)
  • Dynamic partitioning of the file system allows workload to be adapted beased on current access patterns

Reliable Automonmic Distributed Object Storage

  • The followings are delegated to OSD
    • Data migration
    • Replication
    • Failure detection
    • Failure recovery
  • Leverage OSD's computing resources to manage themselves

BeeGFS

  • In-house effort from Fraunhofer ITWM
    • Institute for Industrial Mathematics
    • Software simulation of mathematical modesl
  • First beta available in 2007
  • Spinoff company: ThinkParQ
  • Open source (GPL v2)

Goals

  • Maximum performance and scalability
  • High level of flexibility
  • Robustness and ease of use

Main Services

  • Management service
  • Storage service
  • Metadata service
  • Client service

Management Service

  • Lightweight
  • Monitor and adminstrative tools
  • GUI

Metadata Service

  • Stores information about
    • File/directory paths (abstract)
    • Permission, ownership
    • Location (stripe pattern)
  • Scale-out service
  • POSIX compliance
  • Each MS instance is responsible for an exclusive fraction of the global namespace
  • Storage targets have priority based on storage availability
  • Support storage pools

Storage Service

  • Scale-out service
  • Each instance can handle multiple storage targets
  • Utilizes all memory on storage server for caching purposes
  • Stripe size and number of targets per file is managed by ME (can be defaulted or customized)
  • One chunk file per storage target per file

Reliability and Fault-tolerance

  • Buddy Mirroring
  • Storage servers pair up storage targets (internally or cross servers) and replicate the targets' contents

Other Features

  • BeeOnDemand
    • Quickly spin up a dynamic BeeGFS system directly on storage devices of compute nodes
  • Cloud Integration
    • Available on AWS and Microsoft Azure

So many PFS, what to choose?

  • Is it free?
  • Does it require high-end hardware?
  • How are the metadata servers implemented?
  • How is the support system?
  • How is the reliability of the system?
  • How is the scalability of the system
  • How is the storage performance?
  • How is the logging facility?
  • How is the performance with regard to different network infrastructures?
  • How is the performance for small file I/O
  • How is the performance on interactive commands (ls, du, find, grep)
  • How is the adminstrative tools?

Why Lustre/Cepth/OrangeFS was not chosen

  • Lustre:
  • Cepth:
    • Tested in 2013
    • Not ready for production use
  • OrangeFS (PVFS):
    • Too few traffic and mentions on the Internet
    • Clemson is also moving away from OrangeFS