Web-site Demo

http://myweatherproject.s3-website-us-east-1.amazonaws.com/

If viewing on github, here is a sample of the index web-page and Chicago city web-page

index.html
chicago.html

Architecture

EC2

Hourly cronjob (meets api data restrictions)
weather_api.py

Weather API

Obtains data from Weather Underground
Creates a list of tuples: (city, current_weather), for speed layer
Ships raw data to Firehose

Speed Layer

Pulls in web-site HTML as String
Updates Current Weather for each City using Regular Expressions

E-mail

Sends completion e-mail that Weather API and Speed Layer are Complete
Indicates number of cities updated (expecting all)

Kinesis

Firehose

Packages raw data from all cities together into a single file
Ships raw data to S3

S3

Raw Data

Stores raw data

Web Host

Hosts web-site

EMR Spark Cluster

Normalize

city: Data about the city
nearby: Nearby locations for each city
cityDay: Data about the city on the given date
weather: Weather data for the given city, date, and time
forecast: Forecasted weather for the city retrieved at the given date and time about the forecast date and forecast time

path: All S3 paths that have been loaded into the tables
stats: Output from analyze job discussed below

Normalize Process

Hourly cronjob following EC2 API job
weather_normalize.py
Load each table from parquet*
Check S3 for any/all new files that are not in "path" table
For each new file:
- Normalize the file's data
- Add filepath "source" data for each record (track lineage)
- Append new data to full tables
Enforce keys (see below)
Write back to parquet
Send Job Completion E-mail

* Self-healing mechanism recreates tables from raw data if issues encountered with parquet files. This was used during development but hasn't been encountered in production.

Forecast Problem/Solution

Problem - can't explode multiple columns
Solution - switch to RDD

DataFrame:
City, Date, Time, [forecast date/times], [forecast temperatures], [forecast humidity], [ ]...

RDD:
Zip:
City, Date, Time, zip(forecast date/times, forecast temps, hum etc.)
City, Date, Time, [(dt, temp, hum, ...), (dt, temp, hum, ...), (dt, temp, hum...), ...)

Reshape:
[(city, date, time, dt, temp, hum, ...), (city, date, time, dt, temp, hum, ...), ...]

FlatMap:
(city, date, time, dt, temp, hum, ...)

Switch Back to DF

Enforce Keys

I noticed that Weather Underground shipped me 2 different historical temperatures for the same city/day (they were different by 1 degree).
If I simply append the new data, weather underground may not live up to my keys.
To enforce my keys, I will use the most recent data provided by Weather Underground for each key.
Because I tracked the data lineage (source) of each piece of information, I can accomplish this as follows:

    select *
    from
        (select *
        ,row_number() over(partition by city, date order by source desc) as rk
        from cityDay2V)
    where rk=1').drop('rk')

I enforce keys for every table

Analyze

Hourly cronjob following Web Update job (we discuss it first since the previously analyzed data is used in the web update)
weather_analyze.py
Load tables from Parquet
Join Actual Weather that occured back onto the Previous Forecasts that were made
I truncated minutes and joined to the nearest hour (reasonable since most data was between xx:00 and xx:02)
Calculate the number of hours between forecast and actual weather (call it "forecast hours")
- For example, at 11:00 we forecast weather for 2:00, the forecast hours are 3
Calculate the difference between the forecast weather features and the actual weather features
Calculate counts, means, and standard deviations for the differences by "forecast hours"
Write Stats to Parquet
Send Job Completion E-mail

Web Update

Hourly cronjob following Normalize process
weather_report.py
Load tables from Parquet
Phase 1: Preprocess DataFrames to filter on the Current Data needed and Cache the smaller tables
Phase 2: For each city:
- Query all the different tables for the current data for each section of the html report
- Create city web-page html using both strings and pandas DataFrame.to_html()
- Create plots by joining stats with forecast, calculating confidence intervals, using DataFrame.plot(), and saving each image to S3
Create index and error web-pages.
Send Job Completion E-mail

Hourly E-mails

Appendix - Big Data System Properties