| notebook.community

notebook.community

Introduction to LustreFS, BeeGFS, and CephFS

Linh B. Ngo

Lustre

Started as a Carnegie Mellon research project in 1999
Lustre = Linux + cluster
Funding came from the Advanced Simulation and Computing Program (DOE/NNSA - National Nuclear Security Administration)
Acquired by Sun Microsystem in 2008

Lustre

Massively parallel distributed file system
- Thousands of clients
- Large capacities (55PB at LLNAL)
- High bandwidth (1.5TB/s at ORNL)
- POSIX compliance
Open source (GPLv2)
Used by many of the TOP500 supercomputers

Lustre Features

File striping across disks and servers
Multiple metadata servers
Online file system checking
HSM (Hierarchical Storage Management) integration (thinks backup tape!)
User and group quotas

Lustre Features

Pluggable Network Request Scheduler
- Two queues: One global queues and one per-object queue
- Order execution of I/O request that belong to the same data object, by offset, as close as possible to reduce disk seeks.
RDMA support (Remote Direct Memory Access)
High availability

Lustre Features

I/O routing between networks
Multiple backend storage formats (ldiskfs and zfs)
Storage pools
CPU partitions
Recovery features

Lustre Components

MDS: Manages filenames and directories, file stripe locations, locking, ACLs, etc.
MDT: Block device used by MDS to store metadata information
OSS: Handles I/O requests for file data
OST: Block device used by OSS to store file data.
MGS: Stores configuration information for one or more Lustre file systems
MGT: Block device used by MGS

LNET (Lustre Networking) Transport Layer

Provides the underlying communication infrastructure
Is an abstraction for underlying networking type
- TCP/IP
- Infiniband
- Cray high-speed interconnects
Allows fine-grained control of data flow

Lustre File Striping

Basic properties:
- stripe_count (the number of OSTs to stripe across)
- stripe_size (how much data is written to an OST)
Can be customized
First stripe_size bytes are written to the first OST, second to the second OST, etc.

I/O Flow in Lustre

I/O request is sent to MDS server
MDS server's response:
- Which OSTs are used
- What is the stripe size of the file
- etc
Client calculates which OST holds the data
Client directly contacts appropriate OTS to read/write data

I/O Recommendation

Avoid over-striping
- Having more stripes does not mean faster access
- For file sizes of O(1GB), stripe_count = 1 is recommended
Avoid under-striping
- Very large files with low counts can fill up an OST
- Low stripe count can cause contention if many clients are reading/writing to separate portions of the same files
Avoid small I/O requests
Know your application's I/O pattern

CephFS

Presented as a paper at OSDI (Operating System Design and Implementation) in 2006
Funded by DOE
Open source (LGPL 2.1)
Motivation
- As amount of data storage increases, and amount of read/write requests increase, the amount of workload on metadata servers also increase.
- Increase cost of reliability
- Limited scalability and performance

Design assumptions

Large systems are inevitably built increamentally
Node failures are the norm rather than the exception
Quality and character of workloads are constantly shifting over time

Key System Overview

Decoupled Data and Metadata
Dynamic Distributed Metadata Management
Reliable Autonomic Distributed Object Storage

Decoupled Data and Metadata

Metadata operations (open, rename, etc) are managed by MDS cluster
I/O requests managed by OSD (Object Storage Device) cluster
There is no allocation list to map files to specific names on OSD
- No lookup table (reduce workload on MDS
- stripe objects' name are generaed using a data distribution function

Dynamic Distributed Metadata Management

Previous work shows that metadata operations make up as much as half of typical file system workloads
Dynamic Subtree Partitioning distribute the file system directory hierarchy among MDS (scale up to hundreds)
Dynamic partitioning of the file system allows workload to be adapted beased on current access patterns

Reliable Automonmic Distributed Object Storage

The followings are delegated to OSD
- Data migration
- Replication
- Failure detection
- Failure recovery
Leverage OSD's computing resources to manage themselves

BeeGFS

In-house effort from Fraunhofer ITWM
- Institute for Industrial Mathematics
- Software simulation of mathematical modesl
First beta available in 2007
Spinoff company: ThinkParQ
Open source (GPL v2)

Goals

Maximum performance and scalability
High level of flexibility
Robustness and ease of use

Main Services

Management service
Storage service
Metadata service
Client service

Management Service

Lightweight
Monitor and adminstrative tools
GUI

Metadata Service

Stores information about
- File/directory paths (abstract)
- Permission, ownership
- Location (stripe pattern)
Scale-out service
POSIX compliance
Each MS instance is responsible for an exclusive fraction of the global namespace
Storage targets have priority based on storage availability
Support storage pools

Storage Service

Scale-out service
Each instance can handle multiple storage targets
Utilizes all memory on storage server for caching purposes
Stripe size and number of targets per file is managed by ME (can be defaulted or customized)
One chunk file per storage target per file

Reliability and Fault-tolerance

Buddy Mirroring
Storage servers pair up storage targets (internally or cross servers) and replicate the targets' contents

Other Features

BeeOnDemand
- Quickly spin up a dynamic BeeGFS system directly on storage devices of compute nodes
Cloud Integration
- Available on AWS and Microsoft Azure

So many PFS, what to choose?

Example: http://moo.nac.uci.edu/~hjm/fhgfs_vs_gluster.html (2014)

Is it free?
Does it require high-end hardware?
How are the metadata servers implemented?
How is the support system?
How is the reliability of the system?
How is the scalability of the system

How is the storage performance?
How is the logging facility?
How is the performance with regard to different network infrastructures?
How is the performance for small file I/O
How is the performance on interactive commands (ls, du, find, grep)
How is the adminstrative tools?

Why Lustre/Cepth/OrangeFS was not chosen

Lustre:
- Fragile to software stack and driver changes
- Lustre was removed from Linux kernel (https://lkml.org/lkml/2018/6/16/126)
Cepth:
- Tested in 2013
- Not ready for production use
OrangeFS (PVFS):
- Too few traffic and mentions on the Internet
- Clemson is also moving away from OrangeFS