Introduction to PVFS, a Parallel File System for Linux

Linh B. Ngo

Why not NFS?

The issue is with NFS's centralized storage model:

  • Scalability: The increasing number of computing nodes will overwhelm the performance capacity of the NFS server.
  • Availability: If the NFS server goes down, all processing nodes will have to wait. This issue becomes more severe as the number of compute nodes increases.

(One of) The Solution(s): Parallel Virtual File System

  • Research and development began in 1993
  • Funded through NASA's grant to study I/O patterns of parallel programs
    • This is also about time when NASA funded the seminal work of Dr. Thomas Sterling to study how to develop an inexpensive supercomputer.
  • First version released in 1998/1999
  • Second version released in 2005
  • Omnibond, a Clemson spin-off company was formed to sell supports and services (all versions of PVFS are open source)
  • Name changed to OrangeFS since 2011
    • OrangeFS supported by Omnibond/Clemson team (main)
    • Blue branch supported by Argonne National Lab for an IMB Blue-Gene supercomputer

Timeliness

  • Thomas Sterling's seminal paper on the design of a Beowulf cluster
    • Large cluster consisting of inexpensive COTS (commodity off-the-shelf) machines
  • Open source Linux operating system
  • Standardization of MPI as a tool for large-scale programming
  • An open source parallel file system was the only thing that was missing

Goals

  • High performance
    • High concurrent read/write bandwidth from multiple proceses or threads to a common file
    • Mutiple APIs: native PVFS, UNIX/POSIX, and MPI-IO
    • Common UNIX shell commands must work
    • Application developed for UNIX must be able to access PVFS files without recompiling
  • Robust and scalable
  • Easy to install and use

Parallel Virtual File System (PVFS)

  • Parallel: Data are physically stored on multiple independent machines with separate network connections
  • Virtual: There exists a set of virtual interfaces (maintained through daemon processes) between the physical storage and a user-space abstraction.
  • File System: Through this abstraction, users can store and retrieve data using common file access methods applicable to traditional file systems

Parallel Virtual File System (PVFS)

  • Unlike the centrallized model of NFS, PVFS has N servers making portions of a file available to the tasks of a parallel application running on multiple processing nodes over a network
  • The aggregate bandwidth exceeds that of a single machine
  • This works in a manner similar to how a RAID0 disk array works

System Architecture

  • I/O Nodes: stores the actual files, connected to disks holding the physical bytes of the data. Each file is striped across the disks on the I/O nodes
  • Management Node: handles metadata operations
  • Compute Nodes: where parallal applications are executed. These applications will interact with PVFS via APIs (native PVFS, MPI-IO, or UNIX/POSIX I/O)
  • A physical node may perform more than one role

PVFS I/O Nodes

  • By spreading data acrosss multiple I/O Nodes, applications have multiple paths to data through the network and through multiple disks on which data is stored.
  • Significantly reduces potential bottles nects in the I/O path.
  • Significantly increase total potential bandwidth for multiple clients.

PVFS Software Components

  • There are four major components:
    • Metadata server (mgr)
    • I/O server (iod)
    • API (native PVFS, MPI-IO, POSIX)
    • PVFS Linux kernel support
  • The first two components are deamons which run on nodes in the cluster

Metadata Server

  • Manage metadata for PVFS files
  • Version 1: A single manager daemon is responsible for storaging and accessing of all metadata
  • Current: Manager servers are distributed across I/O nodes
  • Metadata:
    • Permissions
    • Ownership
    • Physical distribution on I/O nodes

File distribution

  • Data in a file is striped across a set of I/O nodes in order to facilitate parallel access
  • The specific of this striping process (distribution process) can be described with three metadata parameters
    • Base I/O node number
    • Number of I/O nodes
    • Stripe size
  • These parameters, together with an ordering of the I/O nodes for the file system, allow the file distribution to be completely specified

The I/O Server (IOD)

  • Handles storing and retrieving of data file stored on local disks connected to the node.
  • When application processs access a PVFS file, the PVFS manager informs them of the locations of the I/O daemons

The processes then establish connections with the I/O daemons directly

PVFS native API (libpvfs)

  • Provides users with a mean to transparently access PVFS servers
  • Handles the scatter/gather operations necessary to move data between user buffers and PVFS servers
  • Handles communications related to metadata

PVFS Linux Kernel Support

  • Provides the functionality necessary to mount PVFS file systems on Linux nodes
  • Enables existing programs to access PVFS without any modification

Conclusion

  • Pros:
    • Better performance than NFS
    • Better scalability (both performance and storage size)
    • Ready and easy to use
    • Optimize for large amount of reading/writing on small amount of files
  • Cons:
    • Multiple points of failure
    • Not good for interactive work

Hands-on

  • Open a Linux terminal
  • SSH to 130.127.132.226 (student/goram)
    • ssh student@130.127.132.226
  • Run the following commands
    $ cd /scratch
    $ ls -l /scratch/ml-20m
    $ pvfs2-stat ml-20m/ratings.csv
    $ pvfs2-viewdist -f ml-20m/ratings.csv

References

Ligon III, Walter B., and Robert B. Ross. "An overview of the parallel virtual file system." In Proceedings of the 1999 Extreme Linux Workshop. 1999.