NumPy Seralisation and I/O

In this notebook we will focus on NumPy built-in support for Serialisation and I/O. In other words, we will learn how to save and load NumPy ndarray objects in native (binary) format for easy sharing. Moreover we are going to discover how NumPy can load data from external files.


In [1]:
import numpy as np

Comma-separated values (CSV)

A very common file format for data files are the comma-separated values (CSV), or related format such as TSV (tab-separated values).

To read data from such file into Numpy arrays we can use the numpy.genfromtxt function.


In [2]:
# In Jupyter, all commands starting with ! are mapped as SHELL commands
!head stockholm_td_adj.dat


Year Month Day T_6 T12 T18 Valid 
1800  1  1    -6.1    -6.1    -6.1 1
1800  1  2   -15.4   -15.4   -15.4 1
1800  1  3   -15.0   -15.0   -15.0 1
1800  1  4   -19.3   -19.3   -19.3 1
1800  1  5   -16.8   -16.8   -16.8 1
1800  1  6   -11.4   -11.4   -11.4 1
1800  1  7    -7.6    -7.6    -7.6 1
1800  1  8    -7.1    -7.1    -7.1 1
1800  1  9   -10.1   -10.1   -10.1 1

In [3]:
np.genfromtxt?

In [4]:
st_temperatures = np.genfromtxt('stockholm_td_adj.dat', 
                                skip_header=1)

In [5]:
st_temperatures.shape


Out[5]:
(77431, 7)

DYI

Let's play a bit with the data loaded st_temperatures to combine fancy indexing (i.e. defining conditions to get subset of data) and very simple statistics.

For example:


In [6]:
st_temperatures[:10, ]


Out[6]:
array([[ 1.80e+03,  1.00e+00,  1.00e+00, -6.10e+00, -6.10e+00, -6.10e+00,
         1.00e+00],
       [ 1.80e+03,  1.00e+00,  2.00e+00, -1.54e+01, -1.54e+01, -1.54e+01,
         1.00e+00],
       [ 1.80e+03,  1.00e+00,  3.00e+00, -1.50e+01, -1.50e+01, -1.50e+01,
         1.00e+00],
       [ 1.80e+03,  1.00e+00,  4.00e+00, -1.93e+01, -1.93e+01, -1.93e+01,
         1.00e+00],
       [ 1.80e+03,  1.00e+00,  5.00e+00, -1.68e+01, -1.68e+01, -1.68e+01,
         1.00e+00],
       [ 1.80e+03,  1.00e+00,  6.00e+00, -1.14e+01, -1.14e+01, -1.14e+01,
         1.00e+00],
       [ 1.80e+03,  1.00e+00,  7.00e+00, -7.60e+00, -7.60e+00, -7.60e+00,
         1.00e+00],
       [ 1.80e+03,  1.00e+00,  8.00e+00, -7.10e+00, -7.10e+00, -7.10e+00,
         1.00e+00],
       [ 1.80e+03,  1.00e+00,  9.00e+00, -1.01e+01, -1.01e+01, -1.01e+01,
         1.00e+00],
       [ 1.80e+03,  1.00e+00,  1.00e+01, -9.50e+00, -9.50e+00, -9.50e+00,
         1.00e+00]])

In [7]:
st_temperatures.dtype


Out[7]:
dtype('float64')

In [8]:
## Calculate which and how many years we have in our data
years = np.unique(st_temperatures[:, 0]).astype(np.int)
years, len(years)


Out[8]:
(array([1800, 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809, 1810,
        1811, 1812, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1821,
        1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831, 1832,
        1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840, 1841, 1842, 1843,
        1844, 1845, 1846, 1847, 1848, 1849, 1850, 1851, 1852, 1853, 1854,
        1855, 1856, 1857, 1858, 1859, 1860, 1861, 1862, 1863, 1864, 1865,
        1866, 1867, 1868, 1869, 1870, 1871, 1872, 1873, 1874, 1875, 1876,
        1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887,
        1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898,
        1899, 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909,
        1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920,
        1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931,
        1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942,
        1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953,
        1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964,
        1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975,
        1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986,
        1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997,
        1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008,
        2009, 2010, 2011]), 212)

In [10]:
years.min(), years.max()


Out[10]:
(1800, 2011)

In [11]:
!head stockholm_td_adj.dat


Year Month Day T_6 T12 T18 Valid 
1800  1  1    -6.1    -6.1    -6.1 1
1800  1  2   -15.4   -15.4   -15.4 1
1800  1  3   -15.0   -15.0   -15.0 1
1800  1  4   -19.3   -19.3   -19.3 1
1800  1  5   -16.8   -16.8   -16.8 1
1800  1  6   -11.4   -11.4   -11.4 1
1800  1  7    -7.6    -7.6    -7.6 1
1800  1  8    -7.1    -7.1    -7.1 1
1800  1  9   -10.1   -10.1   -10.1 1

In [12]:
mask_year = st_temperatures[:, 0] == 1984

In [24]:
mask_feb = st_temperatures[:, 1] == 2

In [25]:
mask_feb.shape


Out[25]:
(77431,)

In [26]:
mask_year.dtype


Out[26]:
dtype('bool')

In [27]:
type(mask_year)


Out[27]:
numpy.ndarray

In [28]:
## Calculate the mean temperature of mid-days on February in 1984
feb_noon_temps = st_temperatures[(mask_year & mask_feb), 4]

In [29]:
type(feb_noon_temps)


Out[29]:
numpy.ndarray

In [30]:
feb_noon_temps.dtype


Out[30]:
dtype('float64')

In [31]:
feb_noon_temps.mean()


Out[31]:
-1.7344827586206901

In [21]:
## ....

Numpy's native file format

  • Useful when storing and reading back numpy array data.

  • Use the functions np.save and np.load:

np.save


In [22]:
np.save("st_temperatures.npy", st_temperatures)

See also:

  • np.savez : save several NumPy arrays into one single file
  • np.savez_compressed
  • np.savetxt

np.load


In [23]:
T = np.load("st_temperatures.npy")
print(T.shape, T.dtype)


(77431, 7) float64

NumPy for Matlab Users (really?)

If you are a MATLAB® user I do recommend to read Numpy for MATLAB Users.

Numpy can load and save native MATLAB® files:


The Matrix Array Type

In addition to the numpy.ndarray type, NumPy also support a very specific data type called Matrix.

This special type of object has been introduced to allow for API and programming compatibility with MATLAB®.

Note: The most relevant feature of this new array type is the behavior of the standard arithmetic operators +, -, * to use matrix algebra, which work as they would in MATLAB.


In [2]:
from numpy import matrix

In [3]:
a = np.arange(0, 5)
A = np.array([[n+m*10 for n in range(5)] for m in range(5)])

In [4]:
a


Out[4]:
array([0, 1, 2, 3, 4])

In [5]:
A


Out[5]:
array([[ 0,  1,  2,  3,  4],
       [10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34],
       [40, 41, 42, 43, 44]])

In [6]:
M = matrix(A)
v = matrix(a).T # make it a column vector

In [7]:
a


Out[7]:
array([0, 1, 2, 3, 4])

In [8]:
M * M


Out[8]:
matrix([[ 300,  310,  320,  330,  340],
        [1300, 1360, 1420, 1480, 1540],
        [2300, 2410, 2520, 2630, 2740],
        [3300, 3460, 3620, 3780, 3940],
        [4300, 4510, 4720, 4930, 5140]])

In [9]:
A @ A  # @ operator equivalent to np.dot(A, A)


Out[9]:
array([[ 300,  310,  320,  330,  340],
       [1300, 1360, 1420, 1480, 1540],
       [2300, 2410, 2520, 2630, 2740],
       [3300, 3460, 3620, 3780, 3940],
       [4300, 4510, 4720, 4930, 5140]])

In [10]:
# Element wise multiplication in NumPy
A * A


Out[10]:
array([[   0,    1,    4,    9,   16],
       [ 100,  121,  144,  169,  196],
       [ 400,  441,  484,  529,  576],
       [ 900,  961, 1024, 1089, 1156],
       [1600, 1681, 1764, 1849, 1936]])

In [11]:
M * v


Out[11]:
matrix([[ 30],
        [130],
        [230],
        [330],
        [430]])

In [12]:
A * a


Out[12]:
array([[  0,   1,   4,   9,  16],
       [  0,  11,  24,  39,  56],
       [  0,  21,  44,  69,  96],
       [  0,  31,  64,  99, 136],
       [  0,  41,  84, 129, 176]])

In [13]:
# inner product
v.T * v


Out[13]:
matrix([[30]])

In [14]:
# with matrix objects, standard matrix algebra applies
v + M*v


Out[14]:
matrix([[ 30],
        [131],
        [232],
        [333],
        [434]])

If we try to add, subtract or multiply objects with incomplatible shapes we get an error:


In [15]:
v_incompat = matrix(list(range(1, 7))).T

In [16]:
M.shape, v_incompat.shape


Out[16]:
((5, 5), (6, 1))

In [17]:
M * v_incompat


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-bd00a30033f6> in <module>
----> 1 M * v_incompat

~/anaconda3/envs/numpy-euroscipy/lib/python3.7/site-packages/numpy/matrixlib/defmatrix.py in __mul__(self, other)
    218         if isinstance(other, (N.ndarray, list, tuple)) :
    219             # This promotes 1-D vectors to row vectors
--> 220             return N.dot(self, asmatrix(other))
    221         if isscalar(other) or not hasattr(other, '__rmul__') :
    222             return N.dot(self, other)

<__array_function__ internals> in dot(*args, **kwargs)

ValueError: shapes (5,5) and (6,1) not aligned: 5 (dim 1) != 6 (dim 0)

See also the related functions: inner, outer, cross, kron, tensordot.

Try for example help(inner).


Loading and Saving .mat file

Let's create a numpy.ndarray object


In [21]:
A = np.random.rand(10000, 300, 50)  # note: this may take a while

In [22]:
A


Out[22]:
array([[[0.30788845, 0.60569692, 0.74159203, ..., 0.99513856,
         0.86615676, 0.65581839],
        [0.29972906, 0.1727805 , 0.73877596, ..., 0.57321798,
         0.52657155, 0.15148499],
        [0.91677054, 0.30289045, 0.47086303, ..., 0.91076997,
         0.15659756, 0.74502433],
        ...,
        [0.16246413, 0.57601666, 0.64519549, ..., 0.04166688,
         0.71115738, 0.75984878],
        [0.99626814, 0.89529207, 0.89520696, ..., 0.927474  ,
         0.46998733, 0.809978  ],
        [0.52545775, 0.42922203, 0.40999633, ..., 0.7497839 ,
         0.26582518, 0.68821719]],

       [[0.93763072, 0.68660253, 0.03060252, ..., 0.08489496,
         0.3368953 , 0.0040575 ],
        [0.17680589, 0.44922269, 0.32552186, ..., 0.49081397,
         0.7718607 , 0.91216332],
        [0.48935017, 0.28293444, 0.57762148, ..., 0.64988995,
         0.96036063, 0.62395338],
        ...,
        [0.77554755, 0.23174591, 0.80126054, ..., 0.34982511,
         0.13648038, 0.63953428],
        [0.4502637 , 0.74376194, 0.47531237, ..., 0.94077276,
         0.64544446, 0.20241967],
        [0.65158873, 0.93520847, 0.1153165 , ..., 0.92607143,
         0.42194542, 0.49231582]],

       [[0.60652634, 0.55707594, 0.7861307 , ..., 0.49618863,
         0.26073645, 0.57230289],
        [0.33445447, 0.51254754, 0.89760192, ..., 0.20161607,
         0.54935607, 0.97355349],
        [0.82742407, 0.13811956, 0.77549593, ..., 0.97417726,
         0.75828111, 0.20726388],
        ...,
        [0.89885131, 0.95168761, 0.04908857, ..., 0.26560786,
         0.19828306, 0.34056713],
        [0.37462286, 0.00294645, 0.46417234, ..., 0.98287275,
         0.63560479, 0.37498829],
        [0.80824186, 0.77414402, 0.27137252, ..., 0.97397635,
         0.73792667, 0.47235421]],

       ...,

       [[0.79534194, 0.19495982, 0.69419483, ..., 0.98484659,
         0.07524489, 0.35898295],
        [0.75246125, 0.1448565 , 0.31596133, ..., 0.97989236,
         0.66466035, 0.09253075],
        [0.13218267, 0.24674062, 0.93687433, ..., 0.26530807,
         0.64653497, 0.25848279],
        ...,
        [0.01839164, 0.4127106 , 0.36428583, ..., 0.97212349,
         0.867556  , 0.58971199],
        [0.49075206, 0.80264193, 0.82420669, ..., 0.13249282,
         0.70465219, 0.97575252],
        [0.2735621 , 0.37780973, 0.19581884, ..., 0.55415141,
         0.33630774, 0.62376131]],

       [[0.95740591, 0.6409855 , 0.29668168, ..., 0.85582114,
         0.02653775, 0.07433918],
        [0.97968508, 0.7192658 , 0.96627464, ..., 0.25708965,
         0.60037787, 0.8001345 ],
        [0.98598865, 0.7660025 , 0.05743886, ..., 0.84864957,
         0.5717346 , 0.48107095],
        ...,
        [0.04048004, 0.24279597, 0.43556563, ..., 0.74962769,
         0.71872639, 0.08429666],
        [0.09697323, 0.51034331, 0.6199531 , ..., 0.95157892,
         0.52082535, 0.36331146],
        [0.91967882, 0.47842183, 0.55403126, ..., 0.99053768,
         0.68606411, 0.4186365 ]],

       [[0.83101977, 0.7800826 , 0.52552153, ..., 0.45411436,
         0.96688267, 0.14787061],
        [0.76365986, 0.97841123, 0.99583821, ..., 0.96043423,
         0.72406206, 0.97100977],
        [0.92772653, 0.01373546, 0.59448744, ..., 0.64587074,
         0.13641851, 0.40625453],
        ...,
        [0.24169963, 0.22511255, 0.85599095, ..., 0.75448232,
         0.42633244, 0.31373371],
        [0.28480721, 0.83815003, 0.77828307, ..., 0.52597019,
         0.88834579, 0.09847287],
        [0.32613764, 0.67313394, 0.82862416, ..., 0.87137257,
         0.13503096, 0.0888404 ]]])

Introducing SciPy (ecosystem)

scipy.io


In [20]:
from scipy import io as spio

NumPy $\mapsto$ MATLAB : scipy.io.savemat


In [23]:
spio.savemat('numpy_to.mat', {'A': A}, oned_as='row')  # savemat expects a dictionary

MATLAB $\mapsto$ NumPy: scipy.io.loadmat


In [24]:
data_dictionary = spio.loadmat('numpy_to.mat')

In [25]:
list(data_dictionary.keys())


Out[25]:
['__header__', '__version__', '__globals__', 'A']

In [26]:
data_dictionary['A']


Out[26]:
array([[[0.30788845, 0.60569692, 0.74159203, ..., 0.99513856,
         0.86615676, 0.65581839],
        [0.29972906, 0.1727805 , 0.73877596, ..., 0.57321798,
         0.52657155, 0.15148499],
        [0.91677054, 0.30289045, 0.47086303, ..., 0.91076997,
         0.15659756, 0.74502433],
        ...,
        [0.16246413, 0.57601666, 0.64519549, ..., 0.04166688,
         0.71115738, 0.75984878],
        [0.99626814, 0.89529207, 0.89520696, ..., 0.927474  ,
         0.46998733, 0.809978  ],
        [0.52545775, 0.42922203, 0.40999633, ..., 0.7497839 ,
         0.26582518, 0.68821719]],

       [[0.93763072, 0.68660253, 0.03060252, ..., 0.08489496,
         0.3368953 , 0.0040575 ],
        [0.17680589, 0.44922269, 0.32552186, ..., 0.49081397,
         0.7718607 , 0.91216332],
        [0.48935017, 0.28293444, 0.57762148, ..., 0.64988995,
         0.96036063, 0.62395338],
        ...,
        [0.77554755, 0.23174591, 0.80126054, ..., 0.34982511,
         0.13648038, 0.63953428],
        [0.4502637 , 0.74376194, 0.47531237, ..., 0.94077276,
         0.64544446, 0.20241967],
        [0.65158873, 0.93520847, 0.1153165 , ..., 0.92607143,
         0.42194542, 0.49231582]],

       [[0.60652634, 0.55707594, 0.7861307 , ..., 0.49618863,
         0.26073645, 0.57230289],
        [0.33445447, 0.51254754, 0.89760192, ..., 0.20161607,
         0.54935607, 0.97355349],
        [0.82742407, 0.13811956, 0.77549593, ..., 0.97417726,
         0.75828111, 0.20726388],
        ...,
        [0.89885131, 0.95168761, 0.04908857, ..., 0.26560786,
         0.19828306, 0.34056713],
        [0.37462286, 0.00294645, 0.46417234, ..., 0.98287275,
         0.63560479, 0.37498829],
        [0.80824186, 0.77414402, 0.27137252, ..., 0.97397635,
         0.73792667, 0.47235421]],

       ...,

       [[0.79534194, 0.19495982, 0.69419483, ..., 0.98484659,
         0.07524489, 0.35898295],
        [0.75246125, 0.1448565 , 0.31596133, ..., 0.97989236,
         0.66466035, 0.09253075],
        [0.13218267, 0.24674062, 0.93687433, ..., 0.26530807,
         0.64653497, 0.25848279],
        ...,
        [0.01839164, 0.4127106 , 0.36428583, ..., 0.97212349,
         0.867556  , 0.58971199],
        [0.49075206, 0.80264193, 0.82420669, ..., 0.13249282,
         0.70465219, 0.97575252],
        [0.2735621 , 0.37780973, 0.19581884, ..., 0.55415141,
         0.33630774, 0.62376131]],

       [[0.95740591, 0.6409855 , 0.29668168, ..., 0.85582114,
         0.02653775, 0.07433918],
        [0.97968508, 0.7192658 , 0.96627464, ..., 0.25708965,
         0.60037787, 0.8001345 ],
        [0.98598865, 0.7660025 , 0.05743886, ..., 0.84864957,
         0.5717346 , 0.48107095],
        ...,
        [0.04048004, 0.24279597, 0.43556563, ..., 0.74962769,
         0.71872639, 0.08429666],
        [0.09697323, 0.51034331, 0.6199531 , ..., 0.95157892,
         0.52082535, 0.36331146],
        [0.91967882, 0.47842183, 0.55403126, ..., 0.99053768,
         0.68606411, 0.4186365 ]],

       [[0.83101977, 0.7800826 , 0.52552153, ..., 0.45411436,
         0.96688267, 0.14787061],
        [0.76365986, 0.97841123, 0.99583821, ..., 0.96043423,
         0.72406206, 0.97100977],
        [0.92772653, 0.01373546, 0.59448744, ..., 0.64587074,
         0.13641851, 0.40625453],
        ...,
        [0.24169963, 0.22511255, 0.85599095, ..., 0.75448232,
         0.42633244, 0.31373371],
        [0.28480721, 0.83815003, 0.77828307, ..., 0.52597019,
         0.88834579, 0.09847287],
        [0.32613764, 0.67313394, 0.82862416, ..., 0.87137257,
         0.13503096, 0.0888404 ]]])

In [27]:
A_load = data_dictionary['A']

In [28]:
np.all(A == A_load)


Out[28]:
True

In [30]:
type(A_load)


Out[30]:
numpy.ndarray