Problem 12.1. HDF.

This problem will give you a chance to practice what you have learned in lesson 1 about saving a DataFrame to an HDF file.


In [ ]:
import numpy as np
import pandas as pd

You should use columns 9-14 of 2001.csv: UniqueCarrier, FlightNum, TailNum, ActualElapsedTime, CRSElapsedTime, and AirTime.


In [ ]:
ucs = list(range(8, 14)) # the count starts at 0
cnms = ['UniqueCarrier',
        'FlightNum',
        'TailNum',
        'ActualElapsedTime',
        'CRSElapsedTime',
        'AirTime']

Function: csv_to_hdf()

Write a function that takes three strings: the path to the CSV file, the path to the HDF file, and the table name.

  • Use six columns listed in ucs and cnms.
  • Don't forget that there may be missing values. You should drop all rows that have missing values in any or all columns of that row.
  • After you use pandas.read_csv() to create a DataFrame, use pandas.DataFrame.info() or pandas.DataFrame.dtypes to check the data types in the DataFrame. If you didn't specify which data types should be used, it is likely that Pandas has selected the biggest data type for each column. You should change the data type of each column to its optimal data type.

    To do this, use Pandas.DataFrame.describe() to check the minimum and maximum values of each column. Compare them with the ranges of each data type. You can find this information in the docs, e.g. Numpy data types, or use numpy.iinfo() for ints and numpy.finfo() for floats. For example, to find the minimum and maximum values that one-byte (8 bits) unsigned integer would hold, run

    print(np.iinfo(np.uint8))
    

    which prints out

    Machine parameters for uint8
    ---------------------------------------------------------------------
    min = 0
    max = 255
    ---------------------------------------------------------------------
    
  • The function should take three strings. The first string is the file path and/or name to the CSV file, e.g. /data/airline/2001.csv. The second string is the file path and/or name that points to the HDF file you have created, e.g. /data/airline/w12p1.h5 The third string is the key that can be used to access the table in the HDF database, i.e. the string "table" you would pass as the key argument in

    store_path = '/data/airline/w12p1.h5'
    df = pd.read_hdf(store_path, key='table')
    

    You can list the keys with

    with pd.get_store(store_path) as store:
      print(store.keys())
    

    which should print out

    ['/', '/table']
    
  • In the end, when I ran

    csv_path = '/data/airline/2001.csv'
    store_path = '/data/airline/w12p1.h5'
    table_name = 'table'
    
    csv_to_hdf(csv_path, store_path, table_name)
    !ls -lah $store_path
    

    I got

    -rw-r--r-- 1 root root 144M Apr  9 04:35 /data/airline/w12p1.h5
    

    So your HDF file should be no larger than 144M.


In [ ]:
def csv_to_hdf(file_path, store_path, table_name):
    '''
    Takes three strings. Returns None if successful.
    
    Parameters
    ----------
    file_path: A str. The file path and/or name of the CSV file, e.g. '/data/airline/2001.csv'.
    store_path: A str. The string you would use as the first argument in `pandas.read_hdf()`.
    table_name: A str. The string you would use in the 'key` argument of `pandas.read_hdf()`.
    
    Returns
    -------
    None.
    '''
    
    #### your code goes here ####
    
    return None

Run the following the code cell to test your function.


In [ ]:
csv_path = '/data/airline/2001.csv' # edit the path if necessary
store_path = '/data/airline/w12p1.h5' # edit if you want
table_name = 'table' # edit if you want

# make sure that we are starting from scratch
!rm -f $store_path

# test the function
csv_to_hdf(csv_path, store_path, table_name) # edit the file path if necessary

# check results
!ls -lah $store_path

In [ ]: