notebook.community
Edit and run
Introduction to LustreFS, BeeGFS, and CephFS
Linh B. Ngo
Lustre
Started as a Carnegie Mellon research project in 1999
Lustre = Linux + cluster
Funding came from the Advanced Simulation and Computing Program (DOE/NNSA - National Nuclear Security Administration)
Acquired by Sun Microsystem in 2008
Lustre
Massively parallel distributed file system
Thousands of clients
Large capacities (55PB at LLNAL)
High bandwidth (1.5TB/s at ORNL)
POSIX compliance
Open source (GPLv2)
Used by many of the TOP500 supercomputers
Lustre Features
File striping across disks and servers
Multiple metadata servers
Online file system checking
HSM (Hierarchical Storage Management) integration (thinks backup tape!)
User and group quotas
Lustre Features
Pluggable Network Request Scheduler
Two queues: One global queues and one per-object queue
Order execution of I/O request that belong to the same data object, by offset, as close as possible to reduce disk seeks.
RDMA support (Remote Direct Memory Access)
High availability
Lustre Features
I/O routing between networks
Multiple backend storage formats (ldiskfs and zfs)
Storage pools
CPU partitions
Recovery features
Lustre Components
MDS: Manages filenames and directories, file stripe locations, locking, ACLs, etc.
MDT: Block device used by MDS to store metadata information
OSS: Handles I/O requests for file data
OST: Block device used by OSS to store file data.
MGS: Stores configuration information for one or more Lustre file systems
MGT: Block device used by MGS
LNET (Lustre Networking) Transport Layer
Provides the underlying communication infrastructure
Is an abstraction for underlying networking type
TCP/IP
Infiniband
Cray high-speed interconnects
Allows fine-grained control of data flow
Lustre File Striping
Basic properties:
stripe_count
(the number of OSTs to stripe across)
stripe_size
(how much data is written to an OST)
Can be customized
First
stripe_size
bytes are written to the first OST, second to the second OST, etc.
I/O Flow in Lustre
I/O request is sent to MDS server
MDS server's response:
Which OSTs are used
What is the stripe size of the file
etc
Client calculates which OST holds the data
Client directly contacts appropriate OTS to read/write data
I/O Recommendation
Avoid over-striping
Having more stripes does not mean faster access
For file sizes of O(1GB), stripe_count = 1 is recommended
Avoid under-striping
Very large files with low counts can fill up an OST
Low stripe count can cause contention if many clients are reading/writing to separate portions of the same files
Avoid small I/O requests
Know your application's I/O pattern
CephFS
Presented as a paper at OSDI (Operating System Design and Implementation) in 2006
Funded by DOE
Open source (LGPL 2.1)
Motivation
As amount of data storage increases, and amount of read/write requests increase, the amount of workload on metadata servers also increase.
Increase cost of reliability
Limited scalability and performance
Design assumptions
Large systems are inevitably built increamentally
Node failures are the norm rather than the exception
Quality and character of workloads are constantly shifting over time
Key System Overview
Decoupled Data and Metadata
Dynamic Distributed Metadata Management
Reliable Autonomic Distributed Object Storage
Decoupled Data and Metadata
Metadata operations (open, rename, etc) are managed by MDS cluster
I/O requests managed by OSD (Object Storage Device) cluster
There is no allocation list to map files to specific names on OSD
No lookup table (reduce workload on MDS
stripe objects' name are generaed using a data distribution function
Dynamic Distributed Metadata Management
Previous work shows that metadata operations make up as much as half of typical file system workloads
Dynamic Subtree Partitioning distribute the file system directory hierarchy among MDS (scale up to hundreds)
Dynamic partitioning of the file system allows workload to be adapted beased on current access patterns
Reliable Automonmic Distributed Object Storage
The followings are delegated to OSD
Data migration
Replication
Failure detection
Failure recovery
Leverage OSD's computing resources to manage themselves
BeeGFS
In-house effort from Fraunhofer ITWM
Institute for Industrial Mathematics
Software simulation of mathematical modesl
First beta available in 2007
Spinoff company: ThinkParQ
Open source (GPL v2)
Goals
Maximum performance and scalability
High level of flexibility
Robustness and ease of use
Main Services
Management service
Storage service
Metadata service
Client service
Management Service
Lightweight
Monitor and adminstrative tools
GUI
Metadata Service
Stores information about
File/directory paths (abstract)
Permission, ownership
Location (stripe pattern)
Scale-out service
POSIX compliance
Each MS instance is responsible for an exclusive fraction of the global namespace
Storage targets have priority based on storage availability
Support storage pools
Storage Service
Scale-out service
Each instance can handle multiple storage targets
Utilizes all memory on storage server for caching purposes
Stripe size and number of targets per file is managed by ME (can be defaulted or customized)
One chunk file per storage target per file
Reliability and Fault-tolerance
Buddy Mirroring
Storage servers pair up storage targets (internally or cross servers) and replicate the targets' contents
Other Features
BeeOnDemand
Quickly spin up a dynamic BeeGFS system directly on storage devices of compute nodes
Cloud Integration
Available on AWS and Microsoft Azure
So many PFS, what to choose?
Example:
http://moo.nac.uci.edu/~hjm/fhgfs_vs_gluster.html
(2014)
Is it free?
Does it require high-end hardware?
How are the metadata servers implemented?
How is the support system?
How is the reliability of the system?
How is the scalability of the system
How is the storage performance?
How is the logging facility?
How is the performance with regard to different network infrastructures?
How is the performance for small file I/O
How is the performance on interactive commands (ls, du, find, grep)
How is the adminstrative tools?
Why Lustre/Cepth/OrangeFS was not chosen
Lustre:
Fragile to software stack and driver changes
Lustre was removed from Linux kernel (
https://lkml.org/lkml/2018/6/16/126
)
Cepth:
Tested in 2013
Not ready for production use
OrangeFS (PVFS):
Too few traffic and mentions on the Internet
Clemson is also moving away from OrangeFS