| notebook.community

notebook.community

Edit and run

Distributed File Systems

Linh B. Ngo

How to arrange read/write accesses with processes running on computer that are part of a computing cluster?

Networked File System

Allow transparent access to files stored on a remote disk

Clustered File System

Allow transparent access to files stored on a large set of disks, which could be distributed across multiple computers

Parallel File System

Enable parallel access to files

Networked File System

Sandberg, R., Goldberg, D., Kleiman, S., Walsh, D., & Lyon, B. (1985, June). Design and implementation of the Sun network filesystem. In Proceedings of the Summer USENIX conference (pp. 119-130)

Sun Network Filesystem Protocol (NFS)
Current version: 4.2v
Design Goals
- Machine and operating system independence
- Crash recovery
- Transparent access
- UNIX semantics maintained on client
- Reasonable performance (target 80% as fast as local disk)

NFS Design:

NFS Protocol
Server side
Client side

NFS Protocol:

Remote Procedure Call mechanism
Stateless protocol
Transport independence (UDP/IP)

NFS Server:

Must commit modifications before return results
Generation number in inode and filesystem id in superblock

NFS Client:

Additional virtual file system interface in the Linux kernel
Attach remote file system via mount

Clustered File System

Additional middleware layers such as the tasks of a file system server can be distributed among a cluster of computers
Example: The Zettabyte File System by Sun Microsystem

Bonwick, Jeff, Matt Ahrens, Val Henson, Mark Maybee, and Mark Shellenbaum. "The zettabyte file system." In Proc. of the 2nd Usenix Conference on File and Storage Technologies, vol. 215. 2003.

"One of the most striking design principles in modern file systems is the one-to-one association between a file system and a particular storage device (or portion thereof). Volume managers do virtualize the underlying storage to some degree, but in the end, a file system is still assigned to some particular range of blocks of the logical storage device. This is counterintuitive because a file system is intended to virtualize physical storage, and yet there remains a fixed binding between a logical namespace and a specific device (logical or physical, they both look the same to the user)."

Design Principles:

Simple administration: simplify and automate adminstration of storage to a much greater degree
Pooled storage: decouple file systems from physical storage with allocation being done on the pooled storage side rather than the file system side
Dynamic file system size
Always consistenton-disk data
Immense capacity (Prediction in 2003: 16 Exabyte datasets to appear in 10.5 years)
Error detection and correction
Integration of volume manager
Excellent performance

Pararell File Systems

Ross, Robert, Philip Carns, and David Metheny. "Parallel file systems." In Data Engineering, pp. 143-168. Springer, Boston, MA, 2009.

Fundamental Design Concepts

Single namespace, including files and directories hierarchy
Actual data are distributed over storage servers
- Only large files are split up into contiguous data regions
Metadata regarding namespace and data distribution are stored:
- Dedicated metadata servers (PVFS)
- Distributed across storage servers (CephFS)

Parallel file access mechanisms

Shared-file (N-to-1): A single file is created, and all application tasks write to that file (usually to disjoint regions)
- Increased usability: only one file is needed
- Can create lock contention and reduce performance
File-per-process (N-to-N): Each application task creates a separate file, and writes only to that file.
- Avoids lock contention
- Can create massive amount of small files
- Does not support application restart on different number of tasks

Data Distribution in Parallel File Systems

Original File: Sequence of Bytes
Sequence of bytes are converted into sequence of offsets (each offset can cover multiple bytes)
Offsets are mapped to objects
- not necessarily ordered mapping
- reversible to allow clients to contact specific PFS server for specific data content
Objects are distributed across PFS servers
- Information about where the objects are is stored at the metadata server

Object Placement

Round robin is reasonable default solution
Work consistently on most systems
Default solutions for: GPFS, Lustre, PVFS
Potential scalability issue with massive scaling of file servers and file size
- Two dimensional distribution
- Limit number of servers per file

Design Challenges

Performance
- How well the file system interfaces with applications
Consistency Semantics
Interoperability:
- POSIX/UNIX
- MPI/IO
Fault Tolerance:
- Amplifies due to PFS' multiple storage devices and I/O Path
Management Tools