Distributed File Systems

Linh B. Ngo

How to arrange read/write accesses with processes running on computer that are part of a computing cluster?

Networked File System

  • Allow transparent access to files stored on a remote disk

Clustered File System

  • Allow transparent access to files stored on a large set of disks, which could be distributed across multiple computers

Parallel File System

  • Enable parallel access to files

Networked File System

Sandberg, R., Goldberg, D., Kleiman, S., Walsh, D., & Lyon, B. (1985, June). Design and implementation of the Sun network filesystem. In Proceedings of the Summer USENIX conference (pp. 119-130)

  • Sun Network Filesystem Protocol (NFS)
  • Current version: 4.2v
  • Design Goals
    • Machine and operating system independence
    • Crash recovery
    • Transparent access
    • UNIX semantics maintained on client
    • Reasonable performance (target 80% as fast as local disk)

NFS Design:

  • NFS Protocol
  • Server side
  • Client side

NFS Protocol:

  • Remote Procedure Call mechanism
  • Stateless protocol
  • Transport independence (UDP/IP)

NFS Server:

  • Must commit modifications before return results
  • Generation number in inode and filesystem id in superblock

NFS Client:

  • Additional virtual file system interface in the Linux kernel
  • Attach remote file system via mount

Clustered File System

  • Additional middleware layers such as the tasks of a file system server can be distributed among a cluster of computers
  • Example: The Zettabyte File System by Sun Microsystem

Bonwick, Jeff, Matt Ahrens, Val Henson, Mark Maybee, and Mark Shellenbaum. "The zettabyte file system." In Proc. of the 2nd Usenix Conference on File and Storage Technologies, vol. 215. 2003.

"One of the most striking design principles in modern file systems is the one-to-one association between a file system and a particular storage device (or portion thereof). Volume managers do virtualize the underlying storage to some degree, but in the end, a file system is still assigned to some particular range of blocks of the logical storage device. This is counterintuitive because a file system is intended to virtualize physical storage, and yet there remains a fixed binding between a logical namespace and a specific device (logical or physical, they both look the same to the user)."

Design Principles:

  • Simple administration: simplify and automate adminstration of storage to a much greater degree
  • Pooled storage: decouple file systems from physical storage with allocation being done on the pooled storage side rather than the file system side
  • Dynamic file system size
  • Always consistenton-disk data
  • Immense capacity (Prediction in 2003: 16 Exabyte datasets to appear in 10.5 years)
  • Error detection and correction
  • Integration of volume manager
  • Excellent performance

Pararell File Systems

Ross, Robert, Philip Carns, and David Metheny. "Parallel file systems." In Data Engineering, pp. 143-168. Springer, Boston, MA, 2009.

Fundamental Design Concepts

  • Single namespace, including files and directories hierarchy
  • Actual data are distributed over storage servers
    • Only large files are split up into contiguous data regions
  • Metadata regarding namespace and data distribution are stored:
    • Dedicated metadata servers (PVFS)
    • Distributed across storage servers (CephFS)

Parallel file access mechanisms

  • Shared-file (N-to-1): A single file is created, and all application tasks write to that file (usually to disjoint regions)
    • Increased usability: only one file is needed
    • Can create lock contention and reduce performance
  • File-per-process (N-to-N): Each application task creates a separate file, and writes only to that file.
    • Avoids lock contention
    • Can create massive amount of small files
    • Does not support application restart on different number of tasks

Data Distribution in Parallel File Systems

  • Original File: Sequence of Bytes
  • Sequence of bytes are converted into sequence of offsets (each offset can cover multiple bytes)
  • Offsets are mapped to objects
    • not necessarily ordered mapping
    • reversible to allow clients to contact specific PFS server for specific data content
  • Objects are distributed across PFS servers
    • Information about where the objects are is stored at the metadata server

Object Placement

  • Round robin is reasonable default solution
  • Work consistently on most systems
  • Default solutions for: GPFS, Lustre, PVFS
  • Potential scalability issue with massive scaling of file servers and file size
    • Two dimensional distribution
    • Limit number of servers per file

Design Challenges

  • Performance
    • How well the file system interfaces with applications
  • Consistency Semantics
  • Interoperability:
    • POSIX/UNIX
    • MPI/IO
  • Fault Tolerance:
    • Amplifies due to PFS' multiple storage devices and I/O Path
  • Management Tools