Historical Daily Temperature Analysis

Objective

If there is scientific evidence of extreme fluctuations in our weather patterns due to human impact to the environment then we should be able to identify factual examples of increase in the frequency in extreme temperatures.

There has been a great deal of discussion around climate change and global warming. Since NOAA has made their data public, let us explore the data ourselves and see what insights we can discover.

  1. How many weather stations in US?
  2. For US weather stations, what is the average years of record keeping?
  3. For each US weather station, on each day of the year, identify the frequency at which daily High and Low temperature records are broken.
  4. Does the historical frequency of daily temperature records (High or Low) in the US provide statistical evidence of dramatic climate change?
  5. What is the average life-span of a daily temperature record (High or Low) in the US?

This analytical notebook is a component of a package of notebooks. The package is intended to serve as an exercise in the applicability of IPython/Juypter Notebooks to public weather data for DIY Analytics.

WARNING: This notebook requires a minimum of 12GB of free disk space to Run All cells for all processing phases. The time required to run all phases is **1 hr 50 mins**. The notebook supports two approaches for data generation (CSV and HDF). It is recommended that you choose to run either Approach 1 or Approach 2 and avoid running all cells.

Assumptions

  1. We will observe and report everything in Fahrenheit.
  2. The data we extract from NOAA may be something to republish as it may require extensive data munging.

Data

The Global Historical Climatology Network (GHCN) - Daily dataset integrates daily climate observations from approximately 30 different data sources. Over 25,000 worldwide weather stations are regularly updated with observations from within roughly the last month. The dataset is also routinely reconstructed (usually every week) from its roughly 30 data sources to ensure that GHCN-Daily is generally in sync with its growing list of constituent sources. During this process, quality assurance checks are applied to the full dataset. Where possible, GHCN-Daily station data are also updated daily from a variety of data streams. Station values for each daily update also undergo a suite of quality checks.

This notebook was developed using a March 16, 2015 snapshot of USA-Only daily temperature readings from the Global Historical Climatology Network. The emphasis herein is on the generation of human readable data. Data exploration and analytical exercises are deferred to separate but related notebooks.

Ideal datasets

NOAA's National Climatic Data Center(NCDC) is responsible for preserving, monitoring, assessing, and providing public access to the USA's climate and historical weather data and information. Since weather is something that can be observed at varying intervals, the process used by NCDC is the best that we have yet it is far from ideal. Optimally, weather metrics should be observed, streamed, stored and analyzed in real-time. Such an approach could offer the data as a service associated with a data lake.

Data lakes that can scale at the pace of the cloud remove integration barriers and clear a path for more timely and informed business decisions.

Access to cloud-based data services that front-end a data lake would help to reduce the possibility of human error and divorce us from downstream processing that alters the data from it's native state.

Available datasets

NOAA NCDC provides public FTP access to the GHCN-Daily dataset, which contains a file for each US weather station. Each file contains historical daily data for the given station since that station began to observe and record. Here are some details about the available data:

  • The data is delivered in a text file based, fixed record machine-readable format.
  • Version 3 of the GHCN-Daily dataset was released in September 2012.
  • Changes to the processing system associated with the Version 3 release also allowed for updates to occur 7 days a week rather than only on most weekdays.
  • Version 3 contains station-based measurements from well over 90,000 land-based stations worldwide, about two thirds of which are for precipitation measurement only.
  • Other meteorological elements include, but are not limited to, daily maximum and minimum temperature, temperature at the time of observation, snowfall and snow depth.
Citation Information
  • GHCN-Daily journal article: Menne, M.J., I. Durre, R.S. Vose, B.E. Gleason, and T.G. Houston, 2012: An overview of the Global Historical Climatology Network-Daily Database. Journal of Atmospheric and Oceanic Technology, 29, 897-910.
  • Menne, M.J., I. Durre, B. Korzeniewski, S. McNeal, K. Thomas, X. Yin, S. Anthony, R. Ray, R.S. Vose, B.E.Gleason, and T.G. Houston, 2012: Global Historical Climatology Network - Daily (GHCN-Daily), [Version 3.20-upd-2015031605], NOAA National Climatic Data Center [March 16, 2015].

Getting Started

Project Setup

This project is comprised of several notebooks that address various stages of the analytical process. A common theme for the project is the enablement of reproducible research. This notebook will focus on the creation of new datasets that will be used for downstream analytics.

Analytical Workbench

This notebook is compatible with Project Jupyter.

Execution of this notebook depends on one or more features found in [IBM Knowledge Anyhow Workbench (KAWB)](https://marketplace.ibmcloud.com/apps/2149#!overview). To request a free trial account, visit us at https://knowledgeanyhow.org. You can, however, load it and view it on nbviewer or in IPython / Jupyter.

In [ ]:
# Import special helper functions for the IBM Knowledge Anyhow Workbench.
import kawb
Workarea Folder Structure

The project will be comprised of several file artifacts. This project's file subfolder structure is:

 noaa_hdta_*.ipynb                         - Notebooks
 noaa-hdta/data/ghcnd_hcn.tar.gz           - Obtained from NCDC FTP site.
 noaa-hdta/data/usa-daily/15mar2015/*.dly  - Daily weather station files
 noaa-hdta/data/hdf5/15mar2015/*.h5        - Hierarchical Data Format files 
 noaa-hdta/data/derived/15mar2015/*.h5     - Comma delimited files

Notes:

  1. The initial project research used the 15-March-2015 snapshot of the GHCN Dataset. The folder structure allows for other snapshots to be used.
  2. This notebook can be used to generate both Hierarchical Data Format and comma delimited files. It is recommended to pick one or the other as disk space requirements can be as large as:

    • HDF5: 8GB
    • CSV: 4GB
In IBM Knowledge Anyhow Workbench, the user workarea is located under the "resources" folder.

In [ ]:
%%bash
# Create folder structure
mkdir -p noaa-hdta/data noaa-hdta/data/usa-daily/15mar2015
mkdir -p noaa-hdta/data/hdf5 
mkdir -p noaa-hdta/data/derived/15mar2015/missing
mkdir -p noaa-hdta/data/derived/15mar2015/summaries
mkdir -p noaa-hdta/data/derived/15mar2015/raw
mkdir -p noaa-hdta/data/derived/15mar2015/station_details
# List all project realted files and folders
!ls -laR noaa-*
Tags

KAWB allows all files to be tagged for project organization and search. This project will use the following tags.

  • noaa_data: Used to tag data files (.dly, .h5, .csv)
  • noaa_hdta: Used to tag project notebooks (.ipynb)

The following inline code can be used throughout the project to tag project files.


In [ ]:
import glob

data_tagdetail = ['noaa_data',
                  ['/resources/noaa-hdta/data/',
                   '/resources/noaa-hdta/data/usa-daily/15mar2015/',
                   '/resources/noaa-hdta/data/hdf5/15mar2015/',
                   '/resources/noaa-hdta/data/derived/15mar2015/'
                  ]]
nb_tagdetail = ['noaa_hdta',['/resources/noaa_hdta_*.ipynb']]

def tag_files(tagdetail):
    pathnames = tagdetail[1]
    for path in pathnames:
        for fname in glob.glob(path):
            kawb.set_tag(fname, tagdetail[0])

# Tag Project files
tag_files(data_tagdetail)
tag_files(nb_tagdetail)

Technical Awareness

If the intended use of this notebook is to generate Hierarchical Data Formatted files, then the user of this notebook should be familiar with the following software concepts:

Obtain the data

  1. Copy URL below to a new browser tab and then download the noaa-hdta/data/ghcnd_hcn.tar.gz file.
     ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily
  2. Upload the same file to your workbench.
  3. Move and unpack the tarball into the designated project folders
     %%bash
     mv /resources/ghcnd_hcn.tar.gz /resources/noaa-hdta/data/ghcnd_hcn.tar.gz
     cd /resources/noaa-hdta/data
     tar -xzf ghcnd_hcn.tar.gz
     mv ./ghcnd_hcn/*.dly ./usa-daily/15mar2015/
     rm -R ghcnd_hcn
     ls -la
In IBM Knowledge Anyhow Workbench, you can drag/drop the file on your workbench browser tab to simplify the uploading process.

In [ ]:
%%bash
# Provide the inline code for obtaining the data. See step 3 above.

Dependencies

If the intended use of this notebook is to generate Hierarchical Data Formatted files, then this notebook requires the installation of the following software dependencies:

    $ pip install h5py

In [ ]:
# Provide the inline code necessary for loading any required libraries

Tidy Data Check

Upon review of the NOAA NCDC GHCN Dataset, the data fails to meet the requirements of Tidy Data. A dataset is tidy if rows, columns and tables are matched up with observations, variables and types. In tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table. Messy data is any other arrangement of the data.

In the case of the GHCN Dataset, we are presented with datasets that contain observations for each day in a month for a given year. Each ".dly" file contains data for one station. The name of the file corresponds to a station's identification code. Each record in a file contains one month of daily data. Each row contains observations for more than 20 different element types. The variables on each line include the following:

------------------------------
Variable   Columns   Type
------------------------------
ID            1-11   Character
YEAR         12-15   Integer
MONTH        16-17   Integer
ELEMENT      18-21   Character
VALUE1       22-26   Integer
MFLAG1       27-27   Character
QFLAG1       28-28   Character
SFLAG1       29-29   Character
VALUE2       30-34   Integer
MFLAG2       35-35   Character
QFLAG2       36-36   Character
SFLAG2       37-37   Character
  .           .          .
  .           .          .
  .           .          .
VALUE31    262-266   Integer
MFLAG31    267-267   Character
QFLAG31    268-268   Character
SFLAG31    269-269   Character
------------------------------

A more detailed interpretation of the format of the data is outlined in readme.txt here ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily.

Sample Preview

The variables of interest to this project have the following definitions:

  • ID is the station identification code. See "ghcnd-stations.txt" for a complete list of stations and their metadata.
  • YEAR is the year of the record.
  • MONTH is the month of the record.
  • ELEMENT is the element type. There are more than 20 possible elements. The five core elements are:
    • PRCP = Precipitation (tenths of mm)
    • SNOW = Snowfall (mm)
    • SNWD = Snow depth (mm)
    • TMAX = Maximum temperature (tenths of degrees C)
    • TMIN = Minimum temperature (tenths of degrees C)
  • VALUE(n) is element value on the nth day of the month (missing = -9999).
  • NAPAD(n) contains non-applicable fields of interest.

where n denotes the day of the month (1-31) If the month has less than 31 days, then the remaining variables are set to missing (e.g., for April, VALUE31 = -9999, NAPAD31 = {MFLAG31 = blank, QFLAG31 = blank, SFLAG31 = blank}).

Here is a snippet depicting how the data is represented:

USC00011084201409TMAX  350  H  350  H  344  H  339  H  306  H  333  H  328  H  339  H  339  H  322  H  339  H  339  H  339  H  333  H  339  H  333  H  339  H  328  H  322  H  328  H  283  H  317  H  317  H  272  H  283  H  272  H  272  H  272  H-9999   -9999   -9999   
USC00011084201409TMIN  217  H  217  H  228  H  222  H  217  H  217  H  222  H  233  H  233  H  228  H  222  H  222  H  217  H  211  H  217  H  217  H  211  H  206  H  200  H  189  H  172  H  178  H  122  H  139  H  144  H  139  H  161  H  206  H-9999   -9999   -9999   
USC00011084201409TOBS  217  H  256  H  233  H  222  H  217  H  233  H  239  H  239  H  233  H  278  H  294  H  256  H  250  H  228  H  222  H  222  H  211  H  206  H  211  H  194  H  217  H  194  H  139  H  161  H  144  H  194  H  217  H  228  H-9999   -9999   -9999   
USC00011084201409PRCP    0  H    0  H    0  H   13  H   25  H    8  H    0  H    0  H    0  H    0  H    0  H    0  H    0  H    0  H    0  H   25  H  178  H    0  H    0  H   56  H    0  H    0  H    0  H    0  H    0  H    0  H    0  H    0  H-9999   -9999   -9999

Remove noise

The NOAA NCDC GHCN Dataset includes -9999 as an indicator of missing observation data. We will take this into consideration as we transform the data into a usable format for the project.

Data Processing Decision

Before we can attempt to answer the questions driving this project, we must first map the data into a more reasonable format. As a result, the focus of this notebook will be the creation of new datasets that can be consumed by other notebooks in the project.

Data munging or data wrangling is loosely the process of manually converting or mapping data from one "raw" form into another format that allows for more convenient consumption of the data."

This data munging endeavor is an undertaking unto itself since our goal here is to extract from the machine-readable content new information and store it in a human-readable format. Essentially, we seek to explode (decompress) the data for general consumption and normalize it into a relational model that may be informative to users.

Desired Data Schema

Historical Daily Summary

The goal here is to capture summary information about a given day throughout history at a specific weather station in the US. This dataset contains 365 rows where each row depicts the aggregated low and high temperature data for a specific day throughout the history of the weather station.

Column Description
Station ID Name of the US Weather Station
Month Month of the observations
Day Day of the observations
FirstYearOfRecord First year that this weather station started collecting data for this day in in history.
TMin Current record low temperature (F) for this day in history.
TMinRecordYear Year in which current record low temperature (F) occurred.
TMax Current record high temperature (F) for this day in history.
TMaxRecordYear Year in which current record high temperature occurred.
CurTMinMaxDelta Difference in degrees F between record high and low records for this day in history.
CurTMinRecordDur LifeSpan of curent low record temperature.
CurTMaxRecordDur LifeSpan of current high record temperature.
MaxDurTMinRecord Maximum years a low temperature record was held for this day in history.
MinDurTMinRecord Minimum years a low temperature record was held for this day in history.
MaxDurTMaxRecord Maximum years a high temperature record was held for this day in history.
MinDurTMaxRecord Minimum years a high temperature record was held for this day in history.
TMinRecordCount Total number of TMin records set on this day (does not include first since that may not be a record).
TMaxRecordCount Total number of TMax records set on this day (does not include first since that may not be a record).

Historical Daily Detail

The goal here is to capture details for each year that a record has changed for a specific weather station in the US. During the processing of the Historical Daily Summary dataset, we will log each occurrence of a new temperature record. Information in this file can be used to drill-down into and/or validate the summary file.

Column Description
Station ID Name of the US Weather Station
Year Year of the observation
Month Month of the observation
Day Day of the observation
Type Type of temperature record (Low = TMin, High = TMax)
OldTemp Temperature (F) prior to change.
NewTemp New temperature (F) record for this day.
TDelta Delta between old and new temperatures.

Historical Daily Missing Record Detail

The goal here is to capture details pertaining to missing data. Each record in this dataset represents a day in history that a specific weather station in the USA failed to observe a temperature reading.

Column Description
Station ID Name of the US Weather Station
Year Year of the missing observation
Month Month of the missing observation
Day Day of the missing observation
Type Type of temperature missing (Low = TMin, High = TMax)

Historical Raw Detail

The goal here is to capture raw daily details. Each record in this dataset represents a specific temperature observation for a day in history for a specific that a specific weather station.

Column Description
Station ID Name of the US Weather Station
Year Year of the observation
Month Month of the observation
Day Day of the observation
Type Type of temperature reading (Low = TMin, High = TMax)
FahrenheitTemp Fahrenheit Temperature

Derived Datasets

While this notebook is focused on daily temperature data, we could imagine future work associated with other observation types like snow accumulations and precipitation. Therefore, the format we choose to capture and store our desired data should also allow us to organize and append future datasets.

The HDF5 Python Library provides support for the standard Hierarchical Data Format. This library will allow us to:

  1. Create, save and publish derived datasets for reuse.
  2. Organize our datasets into groups (folders).
  3. Create datasets that can be easily converted to/from dataframes.

However, HDF5 files can be very large which could be a problem if we want to share the data. Alternatively, we could store the information in new collections of CSV files where each .csv contained weather station specific content for one of our target schemas.

Extract, Transform and Load (ETL)

The focus of this notebook is to perform the to compute intensive activities that will explode the text file based, fixed record machine-readable format into the derived data schemas we have specified.

Gather Phase 1

The purpose of this phase will be to do the following:

  • For each daily data file (*.dly) provided by NOAA:
    • For each record where ELEMENT == TMAX or TMIN
      • Identify missing daily temperature readings, write them to the missing dataset.
      • Convert each daily temperature reading from celcius to fahrenheit, write each daily reading to the raw dataset.

Approach 1: Comma delimited files

Use the code below to layout your target project environment (if it should differ from what is described herein) and then run the process for Gather Phase 1. This will take about 20 minutes to process the 1218 or more weather station files. You should expect to see output like this:

>> Processing file 0: USC00207812
Extracting content from file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00207812.dly.
>> Processing Complete: 7024 lines of file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00207812.dly.
>>   Elapsed file execution time 0:00:00
>> Processing file 1: USC00164700
Extracting content from file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00164700.dly.
>> Processing Complete: 9715 lines of file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00164700.dly.
.
.
.
>> Processing file 1217: USC00200230
Extracting content from file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00200230.dly.
>> Processing Complete: 10112 lines of file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00200230.dly.
>>   Elapsed file execution time 0:00:00
>>      Processing Complete.
>>   Elapsed corpus execution time 0:19:37
[IBM Knowledge Anyhow Workbench (KAWB)](https://marketplace.ibmcloud.com/apps/2149#!overview) provides support for importable notebooks that address the challenges of code reuse. It also supports the concept of code injection from reusable cookbooks. For more details, please refer to the Share and Reuse tutorials that come with your instance of KAWB.

In [ ]:
import mywb.noaa_hdta_etl_csv_tools as csvtools

In [ ]:
csvtools.help()

%inject csvtools.noaa_run_phase1_approach1


In [ ]:
# Approach 1 Content Layout for Gather Phases 1 and 2
htda_approach1_content_layout = {
    'Content_Version': '15mar2015',
    'Daily_Input_Files': '/resources/noaa-hdta/data/usa-daily/15mar2015/*.dly',
    'Raw_Details': '/resources/noaa-hdta/data/derived/15mar2015/raw',
    'Missing_Details': '/resources/noaa-hdta/data/derived/15mar2015/missing',
    'Station_Summary': '/resources/noaa-hdta/data/derived/15mar2015/summaries', 
    'Station_Details': '/resources/noaa-hdta/data/derived/15mar2015/station_details',
    }

In [ ]:
# Run Gather Phase 1 for all 1218 files using approach 1 (CSV)
csvtools.noaa_run_phase1_approach1(htda_approach1_content_layout)
Observe Output

You can compute the disk capacity of your Gather Phase 1 results.

96M /resources/noaa-hdta/data/derived/15mar2015/missing
24M /resources/noaa-hdta/data/derived/15mar2015/summaries
3.2G    /resources/noaa-hdta/data/derived/15mar2015/raw

In [ ]:
%%bash
# Compute size of output folders
du -h --max-depth=1 /resources/noaa-hdta/data/derived/15mar2015/

Approach 2: HDF5 files

Use the code below to layout your target project environment (if it should differ from what is described herein) and then run the process for Gather Phase 1. It will take less than 20 minutes to process the 1218 or more weather station files. Note: You will need to have room for about **6.5GB**. You should expect to see output like this:

>> Processing file 0: USC00207812
Extracting content from file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00207812.dly.
>> Processing Complete: 7024 lines of file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00207812.dly.
>>   Elapsed file execution time 0:00:00
>> Processing file 1: USC00164700
Extracting content from file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00164700.dly.
>> Processing Complete: 9715 lines of file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00164700.dly.
.
.
.
>> Processing file 1217: USC00200230
Extracting content from file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00200230.dly.
>> Processing Complete: 10112 lines of file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00200230.dly.
>>   Elapsed file execution time 0:00:00
>>      Processing Complete.
>>   Elapsed corpus execution time 0:17:43

In [ ]:
import mywb.noaa_hdta_etl_hdf_tools as hdftools

In [ ]:
hdftools.help()

%inject hdftools.noaa_run_phase1_approach2


In [ ]:
# Approach 2 Content Layout for Gather Phases 1 and 2
htda_approach2_content_layout = {
    'Content_Version': '15mar2015',
    'Daily_Input_Files': '/resources/noaa-hdta/data/usa-daily/15mar2015/*.dly',
    'Raw_Details': '/resources/noaa-hdta/data/hdf5/15mar2015/noaa_ncdc_raw_details.h5',
    'Missing_Details': '/resources/noaa-hdta/data/hdf5/15mar2015/noaa_ncdc_missing_details.h5',
    'Station_Summary': '/resources/noaa-hdta/data/hdf5/15mar2015/noaa_ncdc_station_summaries.h5', 
    'Station_Details': '/resources/noaa-hdta/data/hdf5/15mar2015/noaa_ncdc_station_details.h5',
    }

In [ ]:
# Run Gather Phase 1 for all 1218 files using approach 2 (HDF5)
hdftools.noaa_run_phase1_approach2(htda_approach2_content_layout)

In [ ]:
%%bash
# Compute size of output folders
du -h --max-depth=1 /resources/noaa-hdta/data/hdf5/15mar2015/

Gather Phase 2

The purpose of this phase will be to do the following:

  • For each daily data file (*.dly) provided by NOAA:

    • For each record where ELEMENT == TMAX or TMIN
      • Identify missing daily temperature readings, write them to the missing dataset.
      • Convert each daily temperature reading from celcius to fahrenheit, write each daily reading to the raw dataset.
  • For each raw dataset per weather station that was generated in Gather Phase 1:

    • For each tuple in a dataset, identify when a new temperature record has occurred (TMIN or TMAX):
      • Create a Station Detail tuple and add to list of tuples
      • Update the Summary Detail list of tuples for this day in history for this station
    • Store the lists to disk

Approach 1: Comma delimited files

Use the code below to layout your target project environment (if it should differ from what is described herein) and then run the process for Gather Phase 2. This will take about 40 minutes to process the 1218 or more raw weather station files. You should expect to see output like this:

Processing dataset 0 - 1218: USC00011084
Processing dataset 1 - 1218: USC00012813
.
.
.
Processing dataset 1216: USC00130133
Processing dataset 1217: USC00204090
>> Processing Complete.
>>   Elapsed corpus execution time 0:38:47

In [ ]:
# Decide if we need to generate station detail files.
csvtools.noaa_run_phase2_approach1.help()

In [ ]:
# Run Gather Phase 2 for all 1218 raw files using approach 1 (CSV)
csvtools.noaa_run_phase2_approach1(htda_approach1_content_layout, create_details=True)
Observe Output

You can compute the disk capacity of your Gather Phase 2 results.

96M /resources/noaa-hdta/data/derived/15mar2015/missing
24M /resources/noaa-hdta/data/derived/15mar2015/summaries
3.2G    /resources/noaa-hdta/data/derived/15mar2015/raw
129M    /resources/noaa-hdta/data/derived/15mar2015/station_details
3.4G    /resources/noaa-hdta/data/derived/15mar2015/

In [ ]:
%%bash
# Compute size of output folders
du -h --max-depth=1 /resources/noaa-hdta/data/derived/15mar2015/

Approach 2: HDF5 files

Use the code below to layout your target project environment (if it should differ from what is described herein) and then run the process for Gather Phase 2. This will take about 30 minutes to process the 1218 or more raw weather station files. Note: You will need to have room for about **6.5GB**. You should expect to see output like this:

Fetching keys for type = raw_detail
>> Fetch Complete.
>>   Elapsed key-fetch execution time 0:00:09
Processing dataset 0 - 1218: USC00011084
Processing dataset 1 - 1218: USC00012813
.
.
.
Processing dataset 1216 - 1218: USW00094794
Processing dataset 1217 - 1218: USW00094967
>> Processing Complete.
>>   Elapsed corpus execution time 0:28:48

%inject hdftools.noaa_run_phase2_approach2

noaa_run_phase2_approach2

Takes a dictionary of project folder details to drive the processing of Gather Phase 2 Approach 2 using HDF files.

Disk Storage Requirements

  • This function creates a Station Summaries dataset that requires ~2GB of free space.
  • This function can also create a Station Details dataset. If you require this dataset to be generated, modify the call to noaa_run_phase2_approach2() with create_details=True. You will need additional free space to support this feature. Estimated requirement: **5GB**

In [ ]:
# Run Gather Phase 2 for all 1218 files using approach 2 (HDF)
hdftools.noaa_run_phase2_approach2(htda_approach2_content_layout)
Observe Output TODO

You can compute the disk capacity of your Gather Phase 2 results.

HDF File Usage (Phases 1 & 2) - Per File and Total
4.9G    /resources/noaa-hdta/data/hdf5/15mar2015/noaa_ncdc_raw_details.h5
1.4G    /resources/noaa-hdta/data/hdf5/15mar2015/noaa_ncdc_missing_details.h5
1.3G    /resources/noaa-hdta/data/hdf5/15mar2015/noaa_ncdc_station_summaries.h5
7.4G    /resources/noaa-hdta/data/hdf5/15mar2015/

In [ ]:
%%bash
# Compute size of output folders
echo "HDF File Usage (Phases 1 & 2) - Per File and Total"
du -ah /resources/noaa-hdta/data/hdf5/15mar2015/

References

Summary

This notebook provides two approaches in the creation of human readable datasets for historical daily temperature analytics. This analytical notebook is a component of a package of notebooks. The tasks addressed herein focused on data munging activities to produce a desired set of datasets for several predefined schemas. These datasets can now be used in other package notebooks for data exploration activities.

This notebook has embraced the concepts of reproducible research and can be shared with others so that they can recreate the data locally.

Future Investigations

  1. Fix the ordering of the record lifespan calculations. See code and comments for clarity.
  2. Fix FTP link so that it can be embedded into notebook.
  3. Explore multi-level importable notebooks to allow the tools files to share common code.
This notebook was created using [IBM Knowledge Anyhow Workbench](https://knowledgeanyhow.org). To learn more, visit us at https://knowledgeanyhow.org.