If there is scientific evidence of extreme fluctuations in our weather patterns due to human impact to the environment then we should be able to identify factual examples of increase in the frequency in extreme temperatures.
There has been a great deal of discussion around climate change and global warming. Since NOAA has made their data public, let us explore the data ourselves and see what insights we can discover.
This analytical notebook is a component of a package of notebooks. The package is intended to serve as an exercise in the applicability of IPython/Juypter Notebooks to public weather data for DIY Analytics.
The Global Historical Climatology Network (GHCN) - Daily dataset integrates daily climate observations from approximately 30 different data sources. Over 25,000 worldwide weather stations are regularly updated with observations from within roughly the last month. The dataset is also routinely reconstructed (usually every week) from its roughly 30 data sources to ensure that GHCN-Daily is generally in sync with its growing list of constituent sources. During this process, quality assurance checks are applied to the full dataset. Where possible, GHCN-Daily station data are also updated daily from a variety of data streams. Station values for each daily update also undergo a suite of quality checks.
NOAA's National Climatic Data Center(NCDC) is responsible for preserving, monitoring, assessing, and providing public access to the USA's climate and historical weather data and information. Since weather is something that can be observed at varying intervals, the process used by NCDC is the best that we have yet it is far from ideal. Optimally, weather metrics should be observed, streamed, stored and analyzed in real-time. Such an approach could offer the data as a service associated with a data lake.
Data lakes that can scale at the pace of the cloud remove integration barriers and clear a path for more timely and informed business decisions.
Access to cloud-based data services that front-end a data lake would help to reduce the possibility of human error and divorce us from downstream processing that alters the data from it's native state.
NOAA NCDC provides public FTP access to the GHCN-Daily dataset, which contains a file for each US weather station. Each file contains historical daily data for the given station since that station began to observe and record. Here are some details about the available data:
This project is comprised of several notebooks that address various stages of the analytical process. A common theme for the project is the enablement of reproducible research. This notebook will focus on the creation of new datasets that will be used for downstream analytics.
This notebook is compatible with Project Jupyter.
In [ ]:
# Import special helper functions for the IBM Knowledge Anyhow Workbench.
import kawb
The project will be comprised of several file artifacts. This project's file subfolder structure is:
noaa_hdta_*.ipynb - Notebooks
noaa-hdta/data/ghcnd_hcn.tar.gz - Obtained from NCDC FTP site.
noaa-hdta/data/usa-daily/15mar2015/*.dly - Daily weather station files
noaa-hdta/data/hdf5/15mar2015/*.h5 - Hierarchical Data Format files
noaa-hdta/data/derived/15mar2015/*.h5 - Comma delimited files
Notes:
This notebook can be used to generate both Hierarchical Data Format and comma delimited files. It is recommended to pick one or the other as disk space requirements can be as large as:
In [ ]:
%%bash
# Create folder structure
mkdir -p noaa-hdta/data noaa-hdta/data/usa-daily/15mar2015
mkdir -p noaa-hdta/data/hdf5
mkdir -p noaa-hdta/data/derived/15mar2015/missing
mkdir -p noaa-hdta/data/derived/15mar2015/summaries
mkdir -p noaa-hdta/data/derived/15mar2015/raw
mkdir -p noaa-hdta/data/derived/15mar2015/station_details
# List all project realted files and folders
!ls -laR noaa-*
KAWB allows all files to be tagged for project organization and search. This project will use the following tags.
noaa_data
: Used to tag data files (.dly, .h5, .csv)noaa_hdta
: Used to tag project notebooks (.ipynb)The following inline code can be used throughout the project to tag project files.
In [ ]:
import glob
data_tagdetail = ['noaa_data',
['/resources/noaa-hdta/data/',
'/resources/noaa-hdta/data/usa-daily/15mar2015/',
'/resources/noaa-hdta/data/hdf5/15mar2015/',
'/resources/noaa-hdta/data/derived/15mar2015/'
]]
nb_tagdetail = ['noaa_hdta',['/resources/noaa_hdta_*.ipynb']]
def tag_files(tagdetail):
pathnames = tagdetail[1]
for path in pathnames:
for fname in glob.glob(path):
kawb.set_tag(fname, tagdetail[0])
# Tag Project files
tag_files(data_tagdetail)
tag_files(nb_tagdetail)
noaa-hdta/data/ghcnd_hcn.tar.gz
file.
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily
%%bash
mv /resources/ghcnd_hcn.tar.gz /resources/noaa-hdta/data/ghcnd_hcn.tar.gz
cd /resources/noaa-hdta/data
tar -xzf ghcnd_hcn.tar.gz
mv ./ghcnd_hcn/*.dly ./usa-daily/15mar2015/
rm -R ghcnd_hcn
ls -la
In [ ]:
%%bash
# Provide the inline code for obtaining the data. See step 3 above.
In [ ]:
# Provide the inline code necessary for loading any required libraries
Upon review of the NOAA NCDC GHCN Dataset, the data fails to meet the requirements of Tidy Data. A dataset is tidy if rows, columns and tables are matched up with observations, variables and types. In tidy data:
In the case of the GHCN Dataset, we are presented with datasets that contain observations for each day in a month for a given year. Each ".dly" file contains data for one station. The name of the file corresponds to a station's identification code. Each record in a file contains one month of daily data. Each row contains observations for more than 20 different element types. The variables on each line include the following:
------------------------------
Variable Columns Type
------------------------------
ID 1-11 Character
YEAR 12-15 Integer
MONTH 16-17 Integer
ELEMENT 18-21 Character
VALUE1 22-26 Integer
MFLAG1 27-27 Character
QFLAG1 28-28 Character
SFLAG1 29-29 Character
VALUE2 30-34 Integer
MFLAG2 35-35 Character
QFLAG2 36-36 Character
SFLAG2 37-37 Character
. . .
. . .
. . .
VALUE31 262-266 Integer
MFLAG31 267-267 Character
QFLAG31 268-268 Character
SFLAG31 269-269 Character
------------------------------
A more detailed interpretation of the format of the data is outlined in readme.txt
here ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily.
The variables of interest to this project have the following definitions:
where n denotes the day of the month (1-31) If the month has less than 31 days, then the remaining variables are set to missing (e.g., for April, VALUE31 = -9999, NAPAD31 = {MFLAG31 = blank, QFLAG31 = blank, SFLAG31 = blank}).
Here is a snippet depicting how the data is represented:
USC00011084201409TMAX 350 H 350 H 344 H 339 H 306 H 333 H 328 H 339 H 339 H 322 H 339 H 339 H 339 H 333 H 339 H 333 H 339 H 328 H 322 H 328 H 283 H 317 H 317 H 272 H 283 H 272 H 272 H 272 H-9999 -9999 -9999
USC00011084201409TMIN 217 H 217 H 228 H 222 H 217 H 217 H 222 H 233 H 233 H 228 H 222 H 222 H 217 H 211 H 217 H 217 H 211 H 206 H 200 H 189 H 172 H 178 H 122 H 139 H 144 H 139 H 161 H 206 H-9999 -9999 -9999
USC00011084201409TOBS 217 H 256 H 233 H 222 H 217 H 233 H 239 H 239 H 233 H 278 H 294 H 256 H 250 H 228 H 222 H 222 H 211 H 206 H 211 H 194 H 217 H 194 H 139 H 161 H 144 H 194 H 217 H 228 H-9999 -9999 -9999
USC00011084201409PRCP 0 H 0 H 0 H 13 H 25 H 8 H 0 H 0 H 0 H 0 H 0 H 0 H 0 H 0 H 0 H 25 H 178 H 0 H 0 H 56 H 0 H 0 H 0 H 0 H 0 H 0 H 0 H 0 H-9999 -9999 -9999
The NOAA NCDC GHCN Dataset includes -9999
as an indicator of missing observation data. We will take this into consideration as we transform the data into a usable format for the project.
Before we can attempt to answer the questions driving this project, we must first map the data into a more reasonable format. As a result, the focus of this notebook will be the creation of new datasets that can be consumed by other notebooks in the project.
Data munging or data wrangling is loosely the process of manually converting or mapping data from one "raw" form into another format that allows for more convenient consumption of the data."
This data munging endeavor is an undertaking unto itself since our goal here is to extract from the machine-readable content new information and store it in a human-readable format. Essentially, we seek to explode (decompress) the data for general consumption and normalize it into a relational model that may be informative to users.
The goal here is to capture summary information about a given day throughout history at a specific weather station in the US. This dataset contains 365 rows where each row depicts the aggregated low and high temperature data for a specific day throughout the history of the weather station.
Column | Description |
---|---|
Station ID | Name of the US Weather Station |
Month | Month of the observations |
Day | Day of the observations |
FirstYearOfRecord | First year that this weather station started collecting data for this day in in history. |
TMin | Current record low temperature (F) for this day in history. |
TMinRecordYear | Year in which current record low temperature (F) occurred. |
TMax | Current record high temperature (F) for this day in history. |
TMaxRecordYear | Year in which current record high temperature occurred. |
CurTMinMaxDelta | Difference in degrees F between record high and low records for this day in history. |
CurTMinRecordDur | LifeSpan of curent low record temperature. |
CurTMaxRecordDur | LifeSpan of current high record temperature. |
MaxDurTMinRecord | Maximum years a low temperature record was held for this day in history. |
MinDurTMinRecord | Minimum years a low temperature record was held for this day in history. |
MaxDurTMaxRecord | Maximum years a high temperature record was held for this day in history. |
MinDurTMaxRecord | Minimum years a high temperature record was held for this day in history. |
TMinRecordCount | Total number of TMin records set on this day (does not include first since that may not be a record). |
TMaxRecordCount | Total number of TMax records set on this day (does not include first since that may not be a record). |
The goal here is to capture details for each year that a record has changed for a specific weather station in the US. During the processing of the Historical Daily Summary dataset, we will log each occurrence of a new temperature record. Information in this file can be used to drill-down into and/or validate the summary file.
Column | Description |
---|---|
Station ID | Name of the US Weather Station |
Year | Year of the observation |
Month | Month of the observation |
Day | Day of the observation |
Type | Type of temperature record (Low = TMin, High = TMax) |
OldTemp | Temperature (F) prior to change. |
NewTemp | New temperature (F) record for this day. |
TDelta | Delta between old and new temperatures. |
The goal here is to capture details pertaining to missing data. Each record in this dataset represents a day in history that a specific weather station in the USA failed to observe a temperature reading.
Column | Description |
---|---|
Station ID | Name of the US Weather Station |
Year | Year of the missing observation |
Month | Month of the missing observation |
Day | Day of the missing observation |
Type | Type of temperature missing (Low = TMin, High = TMax) |
The goal here is to capture raw daily details. Each record in this dataset represents a specific temperature observation for a day in history for a specific that a specific weather station.
Column | Description |
---|---|
Station ID | Name of the US Weather Station |
Year | Year of the observation |
Month | Month of the observation |
Day | Day of the observation |
Type | Type of temperature reading (Low = TMin, High = TMax) |
FahrenheitTemp | Fahrenheit Temperature |
While this notebook is focused on daily temperature data, we could imagine future work associated with other observation types like snow accumulations and precipitation. Therefore, the format we choose to capture and store our desired data should also allow us to organize and append future datasets.
The HDF5 Python Library provides support for the standard Hierarchical Data Format. This library will allow us to:
However, HDF5 files can be very large which could be a problem if we want to share the data. Alternatively, we could store the information in new collections of CSV files where each .csv
contained weather station specific content for one of our target schemas.
The purpose of this phase will be to do the following:
Use the code below to layout your target project environment (if it should differ from what is described herein) and then run the process for Gather Phase 1. This will take about 20 minutes to process the 1218 or more weather station files. You should expect to see output like this:
>> Processing file 0: USC00207812
Extracting content from file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00207812.dly.
>> Processing Complete: 7024 lines of file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00207812.dly.
>> Elapsed file execution time 0:00:00
>> Processing file 1: USC00164700
Extracting content from file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00164700.dly.
>> Processing Complete: 9715 lines of file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00164700.dly.
.
.
.
>> Processing file 1217: USC00200230
Extracting content from file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00200230.dly.
>> Processing Complete: 10112 lines of file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00200230.dly.
>> Elapsed file execution time 0:00:00
>> Processing Complete.
>> Elapsed corpus execution time 0:19:37
In [ ]:
import mywb.noaa_hdta_etl_csv_tools as csvtools
In [ ]:
csvtools.help()
%inject csvtools.noaa_run_phase1_approach1
In [ ]:
# Approach 1 Content Layout for Gather Phases 1 and 2
htda_approach1_content_layout = {
'Content_Version': '15mar2015',
'Daily_Input_Files': '/resources/noaa-hdta/data/usa-daily/15mar2015/*.dly',
'Raw_Details': '/resources/noaa-hdta/data/derived/15mar2015/raw',
'Missing_Details': '/resources/noaa-hdta/data/derived/15mar2015/missing',
'Station_Summary': '/resources/noaa-hdta/data/derived/15mar2015/summaries',
'Station_Details': '/resources/noaa-hdta/data/derived/15mar2015/station_details',
}
In [ ]:
# Run Gather Phase 1 for all 1218 files using approach 1 (CSV)
csvtools.noaa_run_phase1_approach1(htda_approach1_content_layout)
In [ ]:
%%bash
# Compute size of output folders
du -h --max-depth=1 /resources/noaa-hdta/data/derived/15mar2015/
Use the code below to layout your target project environment (if it should differ from what is described herein) and then run the process for Gather Phase 1. It will take less than 20 minutes to process the 1218 or more weather station files. Note: You will need to have room for about **6.5GB**. You should expect to see output like this:
>> Processing file 0: USC00207812
Extracting content from file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00207812.dly.
>> Processing Complete: 7024 lines of file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00207812.dly.
>> Elapsed file execution time 0:00:00
>> Processing file 1: USC00164700
Extracting content from file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00164700.dly.
>> Processing Complete: 9715 lines of file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00164700.dly.
.
.
.
>> Processing file 1217: USC00200230
Extracting content from file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00200230.dly.
>> Processing Complete: 10112 lines of file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00200230.dly.
>> Elapsed file execution time 0:00:00
>> Processing Complete.
>> Elapsed corpus execution time 0:17:43
In [ ]:
import mywb.noaa_hdta_etl_hdf_tools as hdftools
In [ ]:
hdftools.help()
%inject hdftools.noaa_run_phase1_approach2
In [ ]:
# Approach 2 Content Layout for Gather Phases 1 and 2
htda_approach2_content_layout = {
'Content_Version': '15mar2015',
'Daily_Input_Files': '/resources/noaa-hdta/data/usa-daily/15mar2015/*.dly',
'Raw_Details': '/resources/noaa-hdta/data/hdf5/15mar2015/noaa_ncdc_raw_details.h5',
'Missing_Details': '/resources/noaa-hdta/data/hdf5/15mar2015/noaa_ncdc_missing_details.h5',
'Station_Summary': '/resources/noaa-hdta/data/hdf5/15mar2015/noaa_ncdc_station_summaries.h5',
'Station_Details': '/resources/noaa-hdta/data/hdf5/15mar2015/noaa_ncdc_station_details.h5',
}
In [ ]:
# Run Gather Phase 1 for all 1218 files using approach 2 (HDF5)
hdftools.noaa_run_phase1_approach2(htda_approach2_content_layout)
In [ ]:
%%bash
# Compute size of output folders
du -h --max-depth=1 /resources/noaa-hdta/data/hdf5/15mar2015/
The purpose of this phase will be to do the following:
For each daily data file (*.dly) provided by NOAA:
For each raw dataset per weather station that was generated in Gather Phase 1:
Use the code below to layout your target project environment (if it should differ from what is described herein) and then run the process for Gather Phase 2. This will take about 40 minutes to process the 1218 or more raw weather station files. You should expect to see output like this:
Processing dataset 0 - 1218: USC00011084
Processing dataset 1 - 1218: USC00012813
.
.
.
Processing dataset 1216: USC00130133
Processing dataset 1217: USC00204090
>> Processing Complete.
>> Elapsed corpus execution time 0:38:47
In [ ]:
# Decide if we need to generate station detail files.
csvtools.noaa_run_phase2_approach1.help()
In [ ]:
# Run Gather Phase 2 for all 1218 raw files using approach 1 (CSV)
csvtools.noaa_run_phase2_approach1(htda_approach1_content_layout, create_details=True)
You can compute the disk capacity of your Gather Phase 2 results.
96M /resources/noaa-hdta/data/derived/15mar2015/missing
24M /resources/noaa-hdta/data/derived/15mar2015/summaries
3.2G /resources/noaa-hdta/data/derived/15mar2015/raw
129M /resources/noaa-hdta/data/derived/15mar2015/station_details
3.4G /resources/noaa-hdta/data/derived/15mar2015/
In [ ]:
%%bash
# Compute size of output folders
du -h --max-depth=1 /resources/noaa-hdta/data/derived/15mar2015/
Use the code below to layout your target project environment (if it should differ from what is described herein) and then run the process for Gather Phase 2. This will take about 30 minutes to process the 1218 or more raw weather station files. Note: You will need to have room for about **6.5GB**. You should expect to see output like this:
Fetching keys for type = raw_detail
>> Fetch Complete.
>> Elapsed key-fetch execution time 0:00:09
Processing dataset 0 - 1218: USC00011084
Processing dataset 1 - 1218: USC00012813
.
.
.
Processing dataset 1216 - 1218: USW00094794
Processing dataset 1217 - 1218: USW00094967
>> Processing Complete.
>> Elapsed corpus execution time 0:28:48
%inject hdftools.noaa_run_phase2_approach2
Takes a dictionary of project folder details to drive the processing of Gather Phase 2 Approach 2 using HDF files.
noaa_run_phase2_approach2()
with create_details=True
. You will need additional free space to support this feature. Estimated requirement: **5GB**
In [ ]:
# Run Gather Phase 2 for all 1218 files using approach 2 (HDF)
hdftools.noaa_run_phase2_approach2(htda_approach2_content_layout)
You can compute the disk capacity of your Gather Phase 2 results.
HDF File Usage (Phases 1 & 2) - Per File and Total
4.9G /resources/noaa-hdta/data/hdf5/15mar2015/noaa_ncdc_raw_details.h5
1.4G /resources/noaa-hdta/data/hdf5/15mar2015/noaa_ncdc_missing_details.h5
1.3G /resources/noaa-hdta/data/hdf5/15mar2015/noaa_ncdc_station_summaries.h5
7.4G /resources/noaa-hdta/data/hdf5/15mar2015/
In [ ]:
%%bash
# Compute size of output folders
echo "HDF File Usage (Phases 1 & 2) - Per File and Total"
du -ah /resources/noaa-hdta/data/hdf5/15mar2015/
This notebook provides two approaches in the creation of human readable datasets for historical daily temperature analytics. This analytical notebook is a component of a package of notebooks. The tasks addressed herein focused on data munging activities to produce a desired set of datasets for several predefined schemas. These datasets can now be used in other package notebooks for data exploration activities.
This notebook has embraced the concepts of reproducible research and can be shared with others so that they can recreate the data locally.