Zarr is a new storage format which, thanks to its simple yet well-designed specification, makes large datasets easily accessible to distributed computing. In Zarr datasets, the arrays are divided into chunks and compressed. These individual chunks can be stored as files on a filesystem or as objects in a cloud storage bucket. The metadata are stored in lightweight .json files. Zarr works well on both local filesystems and cloud-based object stores. Existing datasets can easily be converted to zarr via xarray’s zarr functions.
In this example we show how to use zarr format from S3 shared by Planet OS. When data is read in, we show some easy operations with the data.
In [1]:
%matplotlib notebook
import xarray as xr
import datetime
import numpy as np
from dask.distributed import LocalCluster, Client
import s3fs
import cartopy.crs as ccrs
import boto3
First we look into the era5-pds bucket zarr folder to find out what variables are available. Assuming that all the variables are available for all the years, we look into a random year-month data.
In [2]:
bucket = 'era5-pds'
#Make sure you provide / in the end
prefix = 'zarr/2008/01/data/'
client = boto3.client('s3')
result = client.list_objects(Bucket=bucket, Prefix=prefix, Delimiter='/')
for o in result.get('CommonPrefixes'):
print (o.get('Prefix'))
In [3]:
client = Client()
client
Out[3]:
In [4]:
fs = s3fs.S3FileSystem(anon=False)
Here we define some functions to read in zarr data.
In [5]:
def inc_mon(indate):
if indate.month < 12:
return datetime.datetime(indate.year, indate.month+1, 1)
else:
return datetime.datetime(indate.year+1, 1, 1)
def gen_d_range(start, end):
rr = []
while start <= end:
rr.append(start)
start = inc_mon(start)
return rr
def get_z(dtime,var):
f_zarr = 'era5-pds/zarr/{year}/{month:02d}/data/{var}.zarr/'.format(year=dtime.year, month=dtime.month,var=var)
return xr.open_zarr(s3fs.S3Map(f_zarr, s3=fs))
def gen_zarr_range(start, end,var):
return [get_z(tt,var) for tt in gen_d_range(start, end)]
This is where we read in the data. We need to define the time range and variable name. In this example, we also choose to select only the area over Australia.
In [6]:
%%time
tmp_a = gen_zarr_range(datetime.datetime(1979,1,1), datetime.datetime(2020,3,31),'air_temperature_at_2_metres')
tmp_all = xr.concat(tmp_a, dim='time0')
tmp = tmp_all.air_temperature_at_2_metres.sel(lon=slice(110,160),lat=slice(-10,-45)) - 272.15
Here we read in an other variable. This time only for a month as we want to use it only for masking.
In [ ]:
sea_data = gen_zarr_range(datetime.datetime(2018,1,1), datetime.datetime(2018,1,1),'sea_surface_temperature')
sea_data_all = xr.concat(sea_data, dim='time0').sea_surface_temperature.sel(lon=slice(110,160),lat=slice(-10,-45))
We decided to use sea surface temperature data for making a sea-land mask.
In [ ]:
sea_data_all0 = sea_data_all[0].values
mask = np.isnan(sea_data_all0)
Mask out the data over the sea. To find out average temepratures over the land, it is important to mask out data over the ocean.
In [ ]:
tmp_masked = tmp.where(mask)
tmp_mean = tmp_masked.mean('time0').compute()
Now we plot the all time (1980-2019) average temperature over Australia. This time we decided to use only xarray plotting tools.
In [ ]:
ax = plt.axes(projection=ccrs.Orthographic(130, -20))
tmp_mean.plot.contourf(ax=ax, transform=ccrs.PlateCarree())
ax.set_global()
ax.coastlines();
plt.draw()
Now we are finding out yearly average temperature over the Australia land area.
In [ ]:
yearly_tmp_AU = tmp_masked.groupby('time0.year').mean('time0').mean(dim=['lon','lat'])
In [ ]:
f, ax = plt.subplots(1, 1)
yearly_tmp_AU.plot.line();
plt.draw()
In conclusion, this was the easy example on how to use zarr data. We were reading in 39 years of global data with 1 hour temporal coverage. Proceeded some operations on this data like selecting out only needed area and computing averages. Zarr makes large amount of data processing much faster than it used to be.
In [ ]: