Files and file systems

The following overview relies heavily on materials freely available on the web. Follow the provided links to specific pages for more information/background.

Files contain data

We've already discussed what 'data' is, and that to store the bits it consists of persistently requires a physical medium. Writing a string of bits to a medium is obviously only one part of the picture: you also need to find the data again when you need it!.

This is where files and file systems come in.

What is required of a data container?

What sort of meta-information is required in order to locate and access data that has been stored onto a medium? These are some of the constituents of the fundamental unit of data containerisation: a file:

File type
- is this a 'regular' file, e.g., an image, text document, application
- or perhaps a special one, such as a directory a.k.a. folder that may act as a container for other files?
Owner and group owner of the file.
- depending on OS, 'ownership' of data can be strongly (Unix) or weakly (Windows) enforced
Permissions on the file
- which user(s) may read, write or execute the file?
Date and time of creation, last read and change.
File size
An address defining the actual location of the file data.
- just like RAM has an address space, so do physical media, e.g. Head 7, Track 38, Sector 230, ...

Reference

Types of files: Textual or binary?

In the next notebook we will learn how to determine the type of a file. Roughly speaking, 'regular' files can only be one of two types: textual or binary. The former is human-readable, the latter is not.

By 'textual' we mean any data that is stored in files using a character encoding suited for our brains. This could mean a novel (e.g., for language analysis), but more generally textual file formats are used as the container for both unstructured and structured data (the Twitter firehose is an example of the latter on massive scale).

Comma-Separated Values (CSV)

One the most used textual data formats is CSV. It is suitable for data that can be tabulated into rows and columns (think: Excel). More generally, CSV is a suitable container for data in a relational database. Amongst (many, many) organisations, the WHO shares health-related datasets in CSV-format.

The name csv refers to the default field separator character: the comma (,). You need to be aware of this, e.g., if using the comma as the decimal separator in floating point numbers (Danes beware). The field separator can most often be freely chosen to be another character, such as the tab key (spaces); files using this strategy are sometimes saved with the tsv-extension.

Binary data formats

File types such as mp3, jpeg, zip, doc, mat, etc. are associated with a decoding key that enables reading the payload of the file correctly. Depending on the file format, the decoding information may be saved in the file itself, according to a documented order of how bytes in the file are to be interpreted.

Another common strategy for binary data storage is to save only measurement data in a binary file, whereas meta-information on the data structure are saved into a header file that is often textual (e.g., EEG header files contain information on the sampling rate, number of channels, etc.).

Demo: raw bytes manipulations

Files are organised into file systems

In computing, a file system or filesystem is used to control how data is stored and retrieved. Without a file system, information placed in a storage medium would be one large body of data with no way to tell where one piece of information stops and the next begins. By separating the data into pieces and giving each piece a name, the information is easily isolated and identified. Taking its name from the way paper-based information systems are named, each group of data is called a "file". The structure and logic rules used to manage the groups of information and their names is called a "file system". (Wikipedia)

The OS takes care of interacting with the file system when the user issues a command to, e.g., read or write (to) a file. The user can therefore remain agnostic as to the physical location on the storage medium of the data you are accessing.

Types of file systems

Many different types of file systems exist, each with their use cases.

Disk file system

This is the most common type a user will be confronted with. As the name implies, disk file systems are used on the storage media physically connected to a computer, such as the internal hard drive, and USB thumb drive (or other removable medium). Examples of disk file systems include

NTFS (Windows)
HFS+ (Mac OS Extended)
ext4 (Linux)
FAT/exFAT
ISO 9660 (CD/DVD disk)

The common scenario in which a user will experience file system-related problems is when trying to use a USB thumb drive initialised on one OS (e.g. Windows) on another (e.g. Mac): unless the OS includes 'drivers' for a specific file systen, it will not be able to read the data on it. In some cases, reading may be possible, but not writing.

Hint: Use the exFAT file system for USB drives, this will make it possible to read from and write to them on both Windows and Mac hosts (though not Linux).

Initialisation of a disk file system: partitioning and formatting

Consider an 'empty' 16GB USB drive, in which all bits are zero (this is an extremely unlikely scenario, but the point is: its has 'nothing' on it). The first thing to do is to define partitions on the physical disk. Consider these the hard boundaries outside which files cannot be written. On personal computers, it is common to have a single large partition for the OS and all user files. However, several 'hidden' partitions typically exist too, related to the initial startup (boot) process of the OS, and the possibility to 'recover' from certain unwanted situations.

Once a physical disk is partitioned, it needs to be formatted: this is the point at which a file system is initialised and the partition can be referred to by a name (e.g. "C:" on Windows or "Macintosh HD" on Mac).

Network file system

A common data science scenario is that of a centralised server infrastructure on which data is stored. Although at the server-level, physical disks are partitioned and file systems initialised on them just as on our PCs, we the users cannot directly access the remote disks. Instead, the server makes some specific location (e.g. directory) on its drives available on the local network. This is done using one of many access protocols: usually a user needs to authenticate herself before access is granted by providing a valid username and password.

From the point of view of access, network file systems have changes the way people use devices. Using one of the many (free and non-free) 'syncing' services, we can have access to a file regardless of which device we are currently using. Privacy and data-security issues are only beginning to be considered.

In the context of working on data, e.g. in health science, the large size of many datasets necessitates the use of high-capacity servers for performing analyses. Another reason for the need for remote data access and processing is related to data security: patient data must never be compromised by moving it out of secure server environments. Our data is therefore often non-local, which may have implications on how we can interact with it. For example, whereas data can be read into memory from a modern SSD hard drive at rates of hundreds of megabytes per second, transfer over a network may be two orders of magnitude slower.

Exercises

If you didn't already...

Make a CSV-file

Open the text editor of your choice. Why is MS Word not a text editor? In case you haven't already, you should consider an alternative to the Win/Mac defaults. Google has no shortage of 'best of' selections in this category, take your pick. To highlight two common options, you may consider:

Atom (multi-OS, in rapid development, OSS)
Notepad++ (Windows-only, free software)

Create a new file with the names, birth years and birth places of your parents, as well as any siblings or children.

use one line per person
separate the data fields using a comma
name the data columns on the first line of the file
save the file with the extension .csv

The file you created above should appear in the Files-tab of JupyterLab. Check that the format of the file is correct by double-clicking on it: a new view to it should open, showing you the contents in tabular form.

Partitions and file systems

Find out how your hard drive is partitioned and what file system the main, system-drive is formatted with.

Windows

Open Control Panel -> System and Security -> Administrative Tools, right-click on Computer Management and select Run as Administrator.
Under Storage, select Disk Management

Mac

The Disk Utility app will not show you hidden partitions (typical of macOS: if you know enough to ask this question, you can also handle the fact that no graphical utilities exist to view them!). Instead, you need to open the Utilities -> Terminal app, and into the window, type:

diskutil list

For more details on the disk(s) and partition(s), you can say e.g.:

diskutil info disk1

Linux

sudo fdisk -l

Yes, even the VM operates with the concept of partitions.

Network file system: Dropbox

If you have the Dropbox desktop extension installed, you can place a file in a folder on your PC, after which that file will be accessible from other devices you have access to. Try to 'follow the bits' when you do this; what is physically going on?