Python datamover module tutorial

Introduction

The datamover Python module provides high-level functionalities to transfer files in between different file resources (local, GitHub, S3 buckets and FTP-drives). To activate the functionalities, import the library functions:



In [1]:

    
import datamover as dm

The module provide for each of the file resources a class, with for each of the classes the methods download_file and list_files defined:



In [2]:

    
print(dm.LocalConnector.list_files, dm.LocalConnector.download_file)
print(dm.S3Connector.list_files, dm.S3Connector.download_file)
print(dm.FTPConnector.list_files, dm.FTPConnector.list_files)
print(dm.GithubConnector.list_files, dm.GithubConnector.download_file)









    



<function LocalConnector.list_files at 0x7ff720d4df28> <function LocalConnector.download_file at 0x7ff720d4dea0>
<function S3Connector.list_files at 0x7ff720d4e510> <function S3Connector.download_file at 0x7ff720d4e378>
<function FTPConnector.list_files at 0x7ff720d4e840> <function FTPConnector.list_files at 0x7ff720d4e840>
<function GithubConnector.list_files at 0x7ff720d4e1e0> <function GithubConnector.download_file at 0x7ff720d4e158>

As the S3 bucket provides an essential part in the Enram data infrastructure, an additional class S3EnramHandler is available, providing the required functions to handle the enram bucket. As the class is inherited from the S3Connector, those functions are available as well in the S3 handler:



In [3]:

    
# print available methods:
print([method for method in dir(dm.S3EnramHandler)  if not method.startswith("_")])









    



['count_enram_coverage', 'create_zip_version', 'download_file', 'key_exists', 'list_files', 'upload_enram_file', 'upload_file']

In order to transfer files, Transporter classes are available to define specific transfers. Currently a LocalToS3 and a BaltradToS3 are defined to manage the file transfer from respectively a local file resource and the Baltrad file server to the Enram S3 bucket:



In [4]:

    
print(dm.LocalToS3, dm.BaltradToS3)









    



<class 'datamover.transporters.LocalToS3'> <class 'datamover.transporters.BaltradToS3'>

From a enram file managament perspective, the S3EnramHandler, together with the transporter classes are most relevant, as explained in the following sections.

Access to the S3 instance

Access rights to the S3 instance are managed indirectly, using the ~/.aws/credentials file and by attributing the proper rights to the user in the AWS console. When the user rights are configured and the proper policy is attributed, connection to the S3 bucket from the datamover package is as follows:



In [5]:

    
s3 = dm.S3Connector("lw-enram") # analog for S3EnramHandler



In [6]:

    
s3.bucket_name









    Out[6]:





'lw-enram'

Functions for S3 file checks are provided to support file checks and enlisting:



In [7]:

    
# check if a file(path) already exists on the S3 bucket:
s3.key_exists('cz/brd/2017/04/09/23/czbrd_vp_20170409230000.h5')









    Out[7]:





False

As the file listing provides a generator, different options are available to have an overview list:



In [10]:

    
set(s3.list_files(path='cz/brd/2016/09/23/00')) # using set









    Out[10]:





{'cz/brd/2016/09/23/00/czbrd_vp_20160923T0000Z_0x5.h5',
 'cz/brd/2016/09/23/00/czbrd_vp_20160923T0015Z_0x5.h5',
 'cz/brd/2016/09/23/00/czbrd_vp_20160923T0030Z_0x5.h5',
 'cz/brd/2016/09/23/00/czbrd_vp_20160923T0045Z_0x5.h5'}



In [12]:

    
for filepath in s3.list_files(path='cz/brd/2016/09/23/00'):
    print(filepath)
    # do something...









    



cz/brd/2016/09/23/00/czbrd_vp_20160923T0000Z_0x5.h5
cz/brd/2016/09/23/00/czbrd_vp_20160923T0015Z_0x5.h5
cz/brd/2016/09/23/00/czbrd_vp_20160923T0030Z_0x5.h5
cz/brd/2016/09/23/00/czbrd_vp_20160923T0045Z_0x5.h5

As the datamover is just a thin layer around the boto3 package, the other boto3 S3 client options are still available to use:



In [13]:

    
print([method for method in dir(s3.s3client)  if not method.startswith("_")])









    



['abort_multipart_upload', 'can_paginate', 'complete_multipart_upload', 'copy', 'copy_object', 'create_bucket', 'create_multipart_upload', 'delete_bucket', 'delete_bucket_analytics_configuration', 'delete_bucket_cors', 'delete_bucket_encryption', 'delete_bucket_inventory_configuration', 'delete_bucket_lifecycle', 'delete_bucket_metrics_configuration', 'delete_bucket_policy', 'delete_bucket_replication', 'delete_bucket_tagging', 'delete_bucket_website', 'delete_object', 'delete_object_tagging', 'delete_objects', 'download_file', 'download_fileobj', 'exceptions', 'generate_presigned_post', 'generate_presigned_url', 'get_bucket_accelerate_configuration', 'get_bucket_acl', 'get_bucket_analytics_configuration', 'get_bucket_cors', 'get_bucket_encryption', 'get_bucket_inventory_configuration', 'get_bucket_lifecycle', 'get_bucket_lifecycle_configuration', 'get_bucket_location', 'get_bucket_logging', 'get_bucket_metrics_configuration', 'get_bucket_notification', 'get_bucket_notification_configuration', 'get_bucket_policy', 'get_bucket_replication', 'get_bucket_request_payment', 'get_bucket_tagging', 'get_bucket_versioning', 'get_bucket_website', 'get_object', 'get_object_acl', 'get_object_tagging', 'get_object_torrent', 'get_paginator', 'get_waiter', 'head_bucket', 'head_object', 'list_bucket_analytics_configurations', 'list_bucket_inventory_configurations', 'list_bucket_metrics_configurations', 'list_buckets', 'list_multipart_uploads', 'list_object_versions', 'list_objects', 'list_objects_v2', 'list_parts', 'meta', 'put_bucket_accelerate_configuration', 'put_bucket_acl', 'put_bucket_analytics_configuration', 'put_bucket_cors', 'put_bucket_encryption', 'put_bucket_inventory_configuration', 'put_bucket_lifecycle', 'put_bucket_lifecycle_configuration', 'put_bucket_logging', 'put_bucket_metrics_configuration', 'put_bucket_notification', 'put_bucket_notification_configuration', 'put_bucket_policy', 'put_bucket_replication', 'put_bucket_request_payment', 'put_bucket_tagging', 'put_bucket_versioning', 'put_bucket_website', 'put_object', 'put_object_acl', 'put_object_tagging', 'restore_object', 'upload_file', 'upload_fileobj', 'upload_part', 'upload_part_copy', 'waiter_names']

File transfer

Baltrad FTP to S3 bucket

In order to have access to the Baltrad-server, a credentials file (creds.py) is required, defining the variables URL, LOGIN and PASSWORD:



In [14]:

    
from creds import URL, LOGIN, PASSWORD

The transporter class BaltradToS3 supports the file transfer:



In [17]:

    
btos = dm.BaltradToS3(URL, LOGIN, PASSWORD, "lw-enram", profile_name="lw-enram")

The necessity of the profile_name depends from your AWS account setup. If you're default profile has the appropriate policy rights (as it is with the EC2 instance running the daily cron job), the AWS package will automatically use the default credentials and you do not need to specify the profile to use.

A transfer of files is provided by the transfer method. It is possible to limit the scope of the file transfer by defining a name match string. As a user, you can decide to overwrite the S3 bucket files or not. Furthermore, for testing purposes, a limit option has been provided and the option to print the transfers to stdout:



In [18]:

    
# transfer files with _vp_ in the name, overwriting existing files and limiting the transferred files to 5:
btos.transfer(name_match="_vp_", overwrite=True, 
              limit=5, verbose=True)









    



bewid_vp_20180113T1700Z_0x5.h5 is succesfully transferred to S3 bucket
deboo_vp_20180112T2330Z_0x5.h5 is succesfully transferred to S3 bucket
deboo_vp_20180112T2345Z_0x5.h5 is succesfully transferred to S3 bucket
deboo_vp_20180113T0000Z_0x5.h5 is succesfully transferred to S3 bucket
deboo_vp_20180113T0015Z_0x5.h5 is succesfully transferred to S3 bucket

The results of the transfer are logged in the atributes btos.transferred and btos.stalled. A combined report can be written to a file log_file_transfer. The transfertype option provides the user the ability to have a custom message in the transfer header:



In [19]:

    
btos.transferred









    Out[19]:





['bewid_vp_20180113T1700Z_0x5.h5',
 'deboo_vp_20180112T2330Z_0x5.h5',
 'deboo_vp_20180112T2345Z_0x5.h5',
 'deboo_vp_20180113T0000Z_0x5.h5',
 'deboo_vp_20180113T0015Z_0x5.h5']



In [20]:

    
btos.report(reset_file=True, transfertype="Baltrad to S3")

The log is written to a file log_file_transfer:



In [21]:

    
!cat log_file_transfer









    



-------------------------------------------------------
Data transfer at 2018-01-15 15:15 from Baltrad to S3:
-------------------------------------------------------

Files not transferred:


Files succesfully transferred:
bewid_vp_20180113T1700Z_0x5.h5
deboo_vp_20180112T2330Z_0x5.h5
deboo_vp_20180112T2345Z_0x5.h5
deboo_vp_20180113T0000Z_0x5.h5
deboo_vp_20180113T0015Z_0x5.h5

The transporter classes provide direct access to the individual connectors of the transfer, analog as the usage of the connector as such:



In [23]:

    
btos.s3.key_exists('de/boo/2018/01/13/00/deboo_vp_20180113T0015Z_0x5.h5') # S3 check for existing file









    Out[23]:





True



In [25]:

    
set(btos.s3.list_files(path='de/boo/2018/01/13/00')) # S3 file listing









    Out[25]:





{'de/boo/2018/01/13/00/deboo_vp_20180113T0000Z_0x5.h5',
 'de/boo/2018/01/13/00/deboo_vp_20180113T0015Z_0x5.h5',
 'de/boo/2018/01/13/00/deboo_vp_20180113T0045Z_0x5.h5'}



In [26]:

    
set(btos.ftp.list_files(name_match="deboo_vp_20180113T0015Z_0x5.h5")) # ftp file listing









    Out[26]:





{'deboo_vp_20180113T0015Z_0x5.h5'}

Local files to S3 bucket

The transfer from a local file directory towards the S3 Bucket is similar in the API, with the transfer method:



In [27]:

    
ltos = dm.LocalToS3(filepath="../example_data/", bucket_name="lw-enram", 
                    profile_name="lw-enram")



In [ ]:

    
ltos.transfer(name_match="_vp_", overwrite=False, 
              limit=5, verbose=True)



In [ ]:

    
ltos.transferred

S3 enram handler functions

The S3EnramHandler class provides some additional functions to support the enram infrastructure:

coverage check: check how many files are available for a specific time basis and get the most recent file for each of the country/radar combination
zip file creation for bulk data transfers

Bird profile data coverage



In [32]:

    
s3enram = dm.S3EnramHandler("lw-enram", profile_name="lw-enram") # Connecto to S3 client

The data coverage for each radar can be derived for multiple temporal intervals: day | month | year. For the heatmap on the repository, the daily counts are used:



In [33]:

    
# Rerun file list overview to extract the current coverage
coverage_day, _ = s3enram.count_enram_coverage(level='day')

Remark the usage of the _ to ignore the second output of the function, which is the information on the most recent available file for each radar.

As an example, derive the number of files available for April 7th 2017 for the tra radar:



In [34]:

    
coverage_day['frtra 2017-04-07']









    Out[34]:





0

The same aggregation function can be used for monthly and yearly counts:



In [35]:

    
coverage_year, _ = s3enram.count_enram_coverage(level='year')

For example, derive the yearly counts for the Belgian radars:



In [36]:

    
{k:v for k,v in coverage_year.items() if k.startswith("be")}









    Out[36]:





{'bejab 2016': 5102,
 'bewid 2016': 4905,
 'bewid 2017': 33,
 'bewid 2018': 2,
 'bezav 2016': 5491}

The counts can be exported to a CSV-file as well, available as general datamover utility function:



In [37]:

    
with open("my_coverage_filename.csv", 'w') as outfile:
    dm.coverage_to_csv(outfile, coverage_year)

The most recent files for each radar can be extracted using the same function:



In [38]:

    
_, most_recent_file = s3enram.count_enram_coverage(level='month')



In [39]:

    
most_recent_file









    Out[39]:





{'bejab': datetime.datetime(2016, 10, 9, 23, 50),
 'bewid': datetime.datetime(2018, 1, 13, 17, 0),
 'bezav': datetime.datetime(2016, 10, 9, 23, 50),
 'bgvar': datetime.datetime(2016, 10, 9, 23, 55),
 'ctcdv': datetime.datetime(2016, 10, 9, 23, 56),
 'ctpda': datetime.datetime(2016, 10, 9, 23, 56),
 'czbrd': datetime.datetime(2016, 10, 9, 23, 45),
 'czska': datetime.datetime(2016, 10, 9, 23, 45),
 'deboo': datetime.datetime(2018, 1, 15, 7, 30),
 'dedrs': datetime.datetime(2018, 1, 15, 7, 30),
 'deeis': datetime.datetime(2018, 1, 15, 7, 30),
 'deemd': datetime.datetime(2018, 1, 15, 7, 30),
 'deess': datetime.datetime(2018, 1, 15, 7, 30),
 'defbg': datetime.datetime(2016, 10, 9, 23, 45),
 'defld': datetime.datetime(2018, 1, 15, 7, 30),
 'deflg': datetime.datetime(2018, 1, 15, 7, 30),
 'dehnr': datetime.datetime(2018, 1, 15, 7, 30),
 'deisn': datetime.datetime(2016, 10, 9, 23, 45),
 'demem': datetime.datetime(2018, 1, 15, 7, 30),
 'deneu': datetime.datetime(2018, 1, 15, 7, 30),
 'denhb': datetime.datetime(2018, 1, 15, 7, 30),
 'deoft': datetime.datetime(2018, 1, 15, 7, 30),
 'depro': datetime.datetime(2018, 1, 15, 7, 30),
 'deros': datetime.datetime(2018, 1, 15, 7, 30),
 'desna': datetime.datetime(2018, 1, 15, 7, 30),
 'detur': datetime.datetime(2018, 1, 15, 7, 30),
 'deumd': datetime.datetime(2018, 1, 15, 7, 30),
 'dkbor': datetime.datetime(2018, 1, 15, 7, 30),
 'dkrom': datetime.datetime(2018, 1, 15, 7, 30),
 'dksin': datetime.datetime(2018, 1, 15, 7, 30),
 'dkste': datetime.datetime(2018, 1, 15, 7, 30),
 'dkvir': datetime.datetime(2018, 1, 15, 7, 30),
 'eehar': datetime.datetime(2018, 1, 15, 7, 30),
 'eesur': datetime.datetime(2018, 1, 15, 7, 30),
 'esalm': datetime.datetime(2018, 1, 15, 5, 0),
 'esbad': datetime.datetime(2018, 1, 15, 7, 30),
 'esbar': datetime.datetime(2018, 1, 15, 7, 30),
 'escor': datetime.datetime(2017, 12, 30, 20, 0),
 'eslid': datetime.datetime(2018, 1, 15, 5, 30),
 'esmad': datetime.datetime(2018, 1, 15, 0, 30),
 'esmal': datetime.datetime(2018, 1, 15, 4, 30),
 'esmur': datetime.datetime(2018, 1, 15, 7, 30),
 'espma': datetime.datetime(2018, 1, 15, 7, 30),
 'essan': datetime.datetime(2018, 1, 15, 7, 30),
 'essev': datetime.datetime(2018, 1, 15, 5, 30),
 'essse': datetime.datetime(2018, 1, 15, 4, 0),
 'esval': datetime.datetime(2018, 1, 15, 7, 0),
 'eszar': datetime.datetime(2018, 1, 15, 7, 30),
 'fianj': datetime.datetime(2018, 1, 15, 7, 30),
 'fiika': datetime.datetime(2018, 1, 15, 7, 30),
 'fikes': datetime.datetime(2018, 1, 15, 7, 30),
 'fikor': datetime.datetime(2018, 1, 15, 7, 30),
 'fikuo': datetime.datetime(2018, 1, 15, 7, 30),
 'filuo': datetime.datetime(2018, 1, 15, 7, 30),
 'fipet': datetime.datetime(2017, 11, 1, 7, 45),
 'fiuta': datetime.datetime(2018, 1, 15, 7, 30),
 'fivan': datetime.datetime(2018, 1, 15, 7, 30),
 'fivim': datetime.datetime(2018, 1, 15, 7, 30),
 'frabb': datetime.datetime(2018, 1, 15, 7, 30),
 'frale': datetime.datetime(2018, 1, 15, 7, 30),
 'frave': datetime.datetime(2016, 10, 9, 23, 45),
 'frbla': datetime.datetime(2018, 1, 15, 7, 30),
 'frbol': datetime.datetime(2018, 1, 15, 7, 30),
 'frbor': datetime.datetime(2018, 1, 15, 7, 30),
 'frbou': datetime.datetime(2018, 1, 15, 7, 30),
 'frcae': datetime.datetime(2018, 1, 15, 7, 30),
 'frche': datetime.datetime(2018, 1, 15, 7, 30),
 'frcol': datetime.datetime(2018, 1, 15, 7, 30),
 'frgre': datetime.datetime(2018, 1, 15, 7, 30),
 'frlep': datetime.datetime(2018, 1, 15, 7, 30),
 'frmcl': datetime.datetime(2018, 1, 15, 7, 30),
 'frmom': datetime.datetime(2018, 1, 15, 7, 30),
 'frmtc': datetime.datetime(2018, 1, 15, 7, 30),
 'frnan': datetime.datetime(2018, 1, 15, 7, 30),
 'frnim': datetime.datetime(2018, 1, 15, 7, 30),
 'frniz': datetime.datetime(2016, 10, 9, 23, 45),
 'fropo': datetime.datetime(2018, 1, 15, 7, 30),
 'frpla': datetime.datetime(2018, 1, 12, 11, 30),
 'frtou': datetime.datetime(2018, 1, 15, 7, 30),
 'frtra': datetime.datetime(2018, 1, 15, 7, 30),
 'frtre': datetime.datetime(2018, 1, 15, 7, 30),
 'frtro': datetime.datetime(2016, 10, 9, 23, 45),
 'hrbil': datetime.datetime(2018, 1, 15, 7, 30),
 'hrosi': datetime.datetime(2018, 1, 15, 7, 30),
 'nldbl': datetime.datetime(2017, 1, 17, 13, 30),
 'nldhl': datetime.datetime(2018, 1, 15, 7, 30),
 'nlhrw': datetime.datetime(2018, 1, 15, 7, 30),
 'plbrz': datetime.datetime(2017, 11, 24, 14, 0),
 'plgda': datetime.datetime(2017, 11, 24, 14, 0),
 'plleg': datetime.datetime(2017, 11, 24, 14, 0),
 'plpas': datetime.datetime(2017, 11, 24, 14, 0),
 'plpoz': datetime.datetime(2017, 11, 24, 14, 0),
 'plram': datetime.datetime(2017, 11, 24, 14, 15),
 'plrze': datetime.datetime(2017, 11, 24, 14, 0),
 'plswi': datetime.datetime(2017, 11, 24, 14, 0),
 'ptfar': datetime.datetime(2016, 11, 30, 23, 56),
 'ptliz': datetime.datetime(2016, 11, 30, 23, 56),
 'ptprt': datetime.datetime(2016, 11, 30, 23, 56),
 'seang': datetime.datetime(2018, 1, 15, 7, 30),
 'searl': datetime.datetime(2018, 1, 15, 7, 30),
 'sease': datetime.datetime(2017, 5, 13, 9, 45),
 'sehem': datetime.datetime(2018, 1, 15, 7, 30),
 'sehud': datetime.datetime(2016, 10, 9, 23, 45),
 'sehuv': datetime.datetime(2018, 1, 15, 7, 30),
 'sekir': datetime.datetime(2018, 1, 15, 7, 30),
 'sekkr': datetime.datetime(2018, 1, 15, 7, 30),
 'selek': datetime.datetime(2018, 1, 15, 7, 30),
 'selul': datetime.datetime(2018, 1, 15, 7, 30),
 'seoer': datetime.datetime(2018, 1, 15, 7, 30),
 'seosd': datetime.datetime(2018, 1, 15, 7, 30),
 'seosu': datetime.datetime(2016, 10, 9, 23, 45),
 'seovi': datetime.datetime(2017, 8, 6, 4, 30),
 'sevar': datetime.datetime(2016, 10, 9, 23, 45),
 'sevax': datetime.datetime(2018, 1, 15, 7, 30),
 'sevil': datetime.datetime(2018, 1, 15, 7, 30),
 'silis': datetime.datetime(2018, 1, 15, 7, 30),
 'sipas': datetime.datetime(2018, 1, 10, 17, 30),
 'skjav': datetime.datetime(2018, 1, 15, 7, 30),
 'skkoj': datetime.datetime(2018, 1, 15, 7, 30)}

and saved to a file as well:



In [41]:

    
with open("radars_latest.csv", 'w') as outfile:
    dm.most_recent_to_csv(outfile, most_recent_file)



In [42]:

    
!head radars_latest.csv

Zip file support

As downloading the individual .h5 files from the website would be cumbersome, two options are available for easier data access:

Inclusion of a download function in the BioRad R package, dedicated for bird profile research
Download of aggregated monthly dataset, provided as a zip-folders

The preparation and creation of these zip-folders is supported by the S3EnramHandler module, using the create_zip_version function. The function uses a Counter with the key/counts or a list of keys from which the monthly counts will be derived as input.

As downloading the individual .h5 files from the website would be cumbersome, two options are available for easier data access:

Inclusion of a download function in the BioRad R package, dedicated for bird profile research
Download of aggregated monthly dataset, provided as a zip-folders

Using a list of keys, the relevant month/radar combinations are updated:



In [ ]:

    
keyset = ['bewid_vp_20161120233000.h5', 
          'bewid_vp_20161120233500.h5',
          'bewid_vp_20161120234000.h5',
          'bewid_vp_20161120234500.h5',
          'bewid_vp_20161120235000.h5',
          'bewid_vp_20161120235500.h5',
          'bejab_vp_20161120235000.h5']



In [ ]:

    
s3enram.create_zip_version(keyset)

A typical use-case is the update of those files that were transferred during a transfer operation, e.g. btos.transferred



In [ ]:

    
s3enram.create_zip_version(btos.transferred)

Other options are possible, e.g. update those zip files for a specific radar:



In [ ]:

    
import os

country = "be"
radar = "wid"

keyset = []
for key in s3enram.list_files(path="/".join([country, radar])):
    keyset.append(os.path.split(key)[1])
s3enram.create_zip_version(keyset)

An entire update can be done, by using the available coverage on a monthly or daily level (in comments, as this is a large operation):



In [ ]:

    
# s3client.create_zip_version(s3client.count_enram_coverage(level="month"))

Further remarks

An essential element in the file and folder handling, is that the (sub)folder information is inherent to the file name itself:

Parsing the file name for metadata, e.g. dkrom_vp_20170114231500.h5:

country: 2 characters dk
radar: 3 characters rom
ignore _vp_
year: 4 characters 2017
month: 2 characters 01
day: 2 characters 14
hour: 2 characters 23
minutes: 2 characters 00

The name parsing is provided by the parse_filename function:



In [ ]:

    
dm.parse_filename("dkrom_vp_20170114231500.h5")