The datamover Python module provides high-level functionalities to transfer files in between different file resources (local, GitHub, S3 buckets and FTP-drives). To activate the functionalities, import the library functions:
In [1]:
import datamover as dm
The module provide for each of the file resources a class, with for each of the classes the methods download_file and list_files defined:
In [2]:
print(dm.LocalConnector.list_files, dm.LocalConnector.download_file)
print(dm.S3Connector.list_files, dm.S3Connector.download_file)
print(dm.FTPConnector.list_files, dm.FTPConnector.list_files)
print(dm.GithubConnector.list_files, dm.GithubConnector.download_file)
As the S3 bucket provides an essential part in the Enram data infrastructure, an additional class S3EnramHandler is available, providing the required functions to handle the enram bucket. As the class is inherited from the S3Connector, those functions are available as well in the S3 handler:
In [3]:
# print available methods:
print([method for method in dir(dm.S3EnramHandler) if not method.startswith("_")])
In order to transfer files, Transporter classes are available to define specific transfers. Currently a LocalToS3 and a BaltradToS3 are defined to manage the file transfer from respectively a local file resource and the Baltrad file server to the Enram S3 bucket:
In [4]:
print(dm.LocalToS3, dm.BaltradToS3)
From a enram file managament perspective, the S3EnramHandler, together with the transporter classes are most relevant, as explained in the following sections.
Access rights to the S3 instance are managed indirectly, using the ~/.aws/credentials file and by attributing the proper rights to the user in the AWS console. When the user rights are configured and the proper policy is attributed, connection to the S3 bucket from the datamover package is as follows:
In [5]:
s3 = dm.S3Connector("lw-enram") # analog for S3EnramHandler
In [6]:
s3.bucket_name
Out[6]:
Functions for S3 file checks are provided to support file checks and enlisting:
In [7]:
# check if a file(path) already exists on the S3 bucket:
s3.key_exists('cz/brd/2017/04/09/23/czbrd_vp_20170409230000.h5')
Out[7]:
As the file listing provides a generator, different options are available to have an overview list:
In [10]:
set(s3.list_files(path='cz/brd/2016/09/23/00')) # using set
Out[10]:
In [12]:
for filepath in s3.list_files(path='cz/brd/2016/09/23/00'):
print(filepath)
# do something...
As the datamover is just a thin layer around the boto3 package, the other boto3 S3 client options are still available to use:
In [13]:
print([method for method in dir(s3.s3client) if not method.startswith("_")])
In order to have access to the Baltrad-server, a credentials file (creds.py) is required, defining the variables URL, LOGIN and PASSWORD:
In [14]:
from creds import URL, LOGIN, PASSWORD
The transporter class BaltradToS3 supports the file transfer:
In [17]:
btos = dm.BaltradToS3(URL, LOGIN, PASSWORD, "lw-enram", profile_name="lw-enram")
The necessity of the profile_name depends from your AWS account setup. If you're default profile has the appropriate policy rights (as it is with the EC2 instance running the daily cron job), the AWS package will automatically use the default credentials and you do not need to specify the profile to use.
A transfer of files is provided by the transfer method. It is possible to limit the scope of the file transfer by defining a name match string. As a user, you can decide to overwrite the S3 bucket files or not. Furthermore, for testing purposes, a limit option has been provided and the option to print the transfers to stdout:
In [18]:
# transfer files with _vp_ in the name, overwriting existing files and limiting the transferred files to 5:
btos.transfer(name_match="_vp_", overwrite=True,
limit=5, verbose=True)
The results of the transfer are logged in the atributes btos.transferred and btos.stalled. A combined report can be written to a file log_file_transfer. The transfertype option provides the user the ability to have a custom message in the transfer header:
In [19]:
btos.transferred
Out[19]:
In [20]:
btos.report(reset_file=True, transfertype="Baltrad to S3")
The log is written to a file log_file_transfer:
In [21]:
!cat log_file_transfer
The transporter classes provide direct access to the individual connectors of the transfer, analog as the usage of the connector as such:
In [23]:
btos.s3.key_exists('de/boo/2018/01/13/00/deboo_vp_20180113T0015Z_0x5.h5') # S3 check for existing file
Out[23]:
In [25]:
set(btos.s3.list_files(path='de/boo/2018/01/13/00')) # S3 file listing
Out[25]:
In [26]:
set(btos.ftp.list_files(name_match="deboo_vp_20180113T0015Z_0x5.h5")) # ftp file listing
Out[26]:
The transfer from a local file directory towards the S3 Bucket is similar in the API, with the transfer method:
In [27]:
ltos = dm.LocalToS3(filepath="../example_data/", bucket_name="lw-enram",
profile_name="lw-enram")
In [ ]:
ltos.transfer(name_match="_vp_", overwrite=False,
limit=5, verbose=True)
In [ ]:
ltos.transferred
The S3EnramHandler class provides some additional functions to support the enram infrastructure:
In [32]:
s3enram = dm.S3EnramHandler("lw-enram", profile_name="lw-enram") # Connecto to S3 client
The data coverage for each radar can be derived for multiple temporal intervals: day | month | year. For the heatmap on the repository, the daily counts are used:
In [33]:
# Rerun file list overview to extract the current coverage
coverage_day, _ = s3enram.count_enram_coverage(level='day')
Remark the usage of the _ to ignore the second output of the function, which is the information on the most recent available file for each radar.
As an example, derive the number of files available for April 7th 2017 for the tra radar:
In [34]:
coverage_day['frtra 2017-04-07']
Out[34]:
The same aggregation function can be used for monthly and yearly counts:
In [35]:
coverage_year, _ = s3enram.count_enram_coverage(level='year')
For example, derive the yearly counts for the Belgian radars:
In [36]:
{k:v for k,v in coverage_year.items() if k.startswith("be")}
Out[36]:
The counts can be exported to a CSV-file as well, available as general datamover utility function:
In [37]:
with open("my_coverage_filename.csv", 'w') as outfile:
dm.coverage_to_csv(outfile, coverage_year)
The most recent files for each radar can be extracted using the same function:
In [38]:
_, most_recent_file = s3enram.count_enram_coverage(level='month')
In [39]:
most_recent_file
Out[39]:
and saved to a file as well:
In [41]:
with open("radars_latest.csv", 'w') as outfile:
dm.most_recent_to_csv(outfile, most_recent_file)
In [42]:
!head radars_latest.csv
As downloading the individual .h5 files from the website would be cumbersome, two options are available for easier data access:
The preparation and creation of these zip-folders is supported by the S3EnramHandler module, using the create_zip_version function. The function uses a Counter with the key/counts or a list of keys from which the monthly counts will be derived as input.
As downloading the individual .h5 files from the website would be cumbersome, two options are available for easier data access:
The preparation and creation of these zip-folders is supported by the S3EnramHandler module, using the create_zip_version function. The function uses a Counter with the key/counts or a list of keys from which the monthly counts will be derived as input.
Using a list of keys, the relevant month/radar combinations are updated:
In [ ]:
keyset = ['bewid_vp_20161120233000.h5',
'bewid_vp_20161120233500.h5',
'bewid_vp_20161120234000.h5',
'bewid_vp_20161120234500.h5',
'bewid_vp_20161120235000.h5',
'bewid_vp_20161120235500.h5',
'bejab_vp_20161120235000.h5']
In [ ]:
s3enram.create_zip_version(keyset)
A typical use-case is the update of those files that were transferred during a transfer operation, e.g. btos.transferred
In [ ]:
s3enram.create_zip_version(btos.transferred)
Other options are possible, e.g. update those zip files for a specific radar:
In [ ]:
import os
country = "be"
radar = "wid"
keyset = []
for key in s3enram.list_files(path="/".join([country, radar])):
keyset.append(os.path.split(key)[1])
s3enram.create_zip_version(keyset)
An entire update can be done, by using the available coverage on a monthly or daily level (in comments, as this is a large operation):
In [ ]:
# s3client.create_zip_version(s3client.count_enram_coverage(level="month"))
An essential element in the file and folder handling, is that the (sub)folder information is inherent to the file name itself:
Parsing the file name for metadata, e.g. dkrom_vp_20170114231500.h5:
dkrom_vp_201701142300The name parsing is provided by the parse_filename function:
In [ ]:
dm.parse_filename("dkrom_vp_20170114231500.h5")