| notebook.community

Linh B. Ngo

The issue is with NFS's centralized storage model:

Scalability: The increasing number of computing nodes will overwhelm the performance capacity of the NFS server.
Availability: If the NFS server goes down, all processing nodes will have to wait. This issue becomes more severe as the number of compute nodes increases.

Research and development began in 1993
Funded through NASA's grant to study I/O patterns of parallel programs
- This is also about time when NASA funded the seminal work of Dr. Thomas Sterling to study how to develop an inexpensive supercomputer.
First version released in 1998/1999
Second version released in 2005
Omnibond, a Clemson spin-off company was formed to sell supports and services (all versions of PVFS are open source)
Name changed to OrangeFS since 2011
- OrangeFS supported by Omnibond/Clemson team (main)
- Blue branch supported by Argonne National Lab for an IMB Blue-Gene supercomputer

Thomas Sterling's seminal paper on the design of a Beowulf cluster
- Large cluster consisting of inexpensive COTS (commodity off-the-shelf) machines
Open source Linux operating system
Standardization of MPI as a tool for large-scale programming
An open source parallel file system was the only thing that was missing

High performance
- High concurrent read/write bandwidth from multiple proceses or threads to a common file
- Mutiple APIs: native PVFS, UNIX/POSIX, and MPI-IO
- Common UNIX shell commands must work
- Application developed for UNIX must be able to access PVFS files without recompiling
Robust and scalable
Easy to install and use

Parallel: Data are physically stored on multiple independent machines with separate network connections
Virtual: There exists a set of virtual interfaces (maintained through daemon processes) between the physical storage and a user-space abstraction.
File System: Through this abstraction, users can store and retrieve data using common file access methods applicable to traditional file systems

Unlike the centrallized model of NFS, PVFS has N servers making portions of a file available to the tasks of a parallel application running on multiple processing nodes over a network
The aggregate bandwidth exceeds that of a single machine
This works in a manner similar to how a RAID0 disk array works

I/O Nodes: stores the actual files, connected to disks holding the physical bytes of the data. Each file is striped across the disks on the I/O nodes
Management Node: handles metadata operations
Compute Nodes: where parallal applications are executed. These applications will interact with PVFS via APIs (native PVFS, MPI-IO, or UNIX/POSIX I/O)
A physical node may perform more than one role

By spreading data acrosss multiple I/O Nodes, applications have multiple paths to data through the network and through multiple disks on which data is stored.
Significantly reduces potential bottles nects in the I/O path.
Significantly increase total potential bandwidth for multiple clients.

There are four major components:
- Metadata server (mgr)
- I/O server (iod)
- API (native PVFS, MPI-IO, POSIX)
- PVFS Linux kernel support
The first two components are deamons which run on nodes in the cluster

Manage metadata for PVFS files
Version 1: A single manager daemon is responsible for storaging and accessing of all metadata
Current: Manager servers are distributed across I/O nodes
Metadata:
- Permissions
- Ownership
- Physical distribution on I/O nodes

Data in a file is striped across a set of I/O nodes in order to facilitate parallel access
The specific of this striping process (distribution process) can be described with three metadata parameters
- Base I/O node number
- Number of I/O nodes
- Stripe size
These parameters, together with an ordering of the I/O nodes for the file system, allow the file distribution to be completely specified

Handles storing and retrieving of data file stored on local disks connected to the node.
When application processs access a PVFS file, the PVFS manager informs them of the locations of the I/O daemons

The processes then establish connections with the I/O daemons directly

Provides users with a mean to transparently access PVFS servers
Handles the scatter/gather operations necessary to move data between user buffers and PVFS servers
Handles communications related to metadata

Pros:
- Better performance than NFS
- Better scalability (both performance and storage size)
- Ready and easy to use
- Optimize for large amount of reading/writing on small amount of files
Cons:
- Multiple points of failure
- Not good for interactive work

Run the following commands

$ cd /scratch
$ ls -l /scratch/ml-20m
$ pvfs2-stat ml-20m/ratings.csv
$ pvfs2-viewdist -f ml-20m/ratings.csv

Ligon III, Walter B., and Robert B. Ross. "An overview of the parallel virtual file system." In Proceedings of the 1999 Extreme Linux Workshop. 1999.