Homework 3

You will have to submit the following two completed ipython notebooks for this homework.

  1. PCA_analysis
  2. Reconstruction

In [2]:
%pylab inline
data_dir = "./Data/Weather/"


Populating the interactive namespace from numpy and matplotlib

Downloading Pickled data from S3

If STAT.pickle is not in the directory, get it using the following command


In [3]:
!curl -o $data_dir/STAT.pickle http://mas-dse-open.s3.amazonaws.com/Weather/STAT.pickle


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 77.8M  100 77.8M    0     0  2278k      0  0:00:35  0:00:35 --:--:-- 4389k

Get the statistics from the Pickle File


In [4]:
import pickle
STAT,STAT_description=pickle.load(open(data_dir+'/STAT.pickle','r'))

In [5]:
STAT.keys()


Out[5]:
['TMIN', 'TOBS', 'TMAX', 'SNOW', 'SNWD', 'PRCP']

In [6]:
STAT_description


Out[6]:
[('SortedVals',
  'Sample of values',
  'vector whose length varies between measurements'),
 ('UnDef',
  'sample of number of undefs per row',
  'vector whose length varies between measurements'),
 ('mean', 'mean value', ()),
 ('std', 'std', ()),
 ('low100', 'bottom 1%', ()),
 ('high100', 'top 1%', ()),
 ('low1000', 'bottom 0.1%', ()),
 ('high1000', 'top 0.1%', ()),
 ('E', 'Sum of values per day', (365,)),
 ('NE', 'count of values per day', (365,)),
 ('Mean', 'E/NE', (365,)),
 ('O', 'Sum of outer products', (365, 365)),
 ('NO', 'counts for outer products', (365, 365)),
 ('Cov', 'O/NO', (365, 365)),
 ('Var', 'The variance per day = diagonal of Cov', (365,)),
 ('eigval', 'PCA eigen-values', (365,)),
 ('eigvec', 'PCA eigen-vectors', (365, 365))]

In [7]:
Scalars=['mean','std','low1000','low100','high100','high1000']
for meas in STAT.keys():
    !grep $meas './Data/Weather/ghcnd-readme.txt'
    S=STAT[meas]
    for scalar in Scalars:
        print '%s:%f'%(scalar,S[scalar]),
    print


           TMIN = Minimum temperature (tenths of degrees C)
mean:44.531018 std:109.933072 low1000:-350.000000 low100:-245.000000 high100:239.000000 high1000:278.000000
 	   TOBS = Temperature at the time of observation (tenths of degrees C)
mean:113.582223 std:119.255224 low1000:-267.000000 low100:-167.000000 high100:339.000000 high1000:389.000000
           TMAX = Maximum temperature (tenths of degrees C)
mean:175.823101 std:123.742076 low1000:-233.000000 low100:-122.000000 high100:383.000000 high1000:433.000000
   	   SNOW = Snowfall (mm)
mean:2.126900 std:24.160393 low1000:0.000000 low100:0.000000 high100:76.000000 high1000:254.000000
	   SNWD = Snow depth (mm)
mean:21.454498 std:123.727039 low1000:0.000000 low100:0.000000 high100:508.000000 high1000:1676.000000
           PRCP = Precipitation (tenths of mm)
mean:24.277398 std:100.174134 low1000:0.000000 low100:0.000000 high100:356.000000 high1000:830.000000

In [ ]:

Script for plotting yearly plots


In [8]:
def YearlyPlots(T,ttl='',yl='',xl='',y=None,x=None,size=(10,7), c=None):
    yearday=[i for i in range(1,366)]
    fig=figure(1,figsize=size,dpi=300)
    if shape(T)[0] != 365:
        raise ValueError("First dimension of T should be 365. Shape(T)="+str(shape(T)))
    if c is not None:
        plot_date(yearday,T, '-',color=c);
    else:
        plot_date(yearday,T, '-', );
    # rotate and align the tick labels so they look better
    #fig.autofmt_xdate()
    plt.gca().xaxis.set_major_formatter( DateFormatter('%b') )
    ylabel(yl)
    xlabel(xl)
    if y is not None:
        ylim(y)
    if x is not None:
        xlim(x)
    grid()
    title(ttl)

Plot the following 3 plots for each measurement:

  1. A histogram from the sample values (from SortedVals) restricted between low100 and high100 (By which we mean that any value larger or equal to low100 and smaller or equal to high100 is included).
  2. Plot of mean and mean $\pm$ std
  3. Number of measurements recorded each day

In [9]:
figure(figsize=(15,30))
offset=1
for meas in STAT.keys():
    subplot(6,3,offset)
    offset+=1
    
    S=STAT[meas]
    pyplot.hist(S['SortedVals'],bins=np.arange((S['low100']), (S['high100']), 5))
    
    subplot(6,3,offset)
    offset+=1
    ## Your code for mean and mean +- std
    std = sqrt(STAT[meas]["Var"])
    mean_y = STAT[meas]['Mean']
    std_plus_y = mean_y + std
    std_minus_y = mean_y - std
    YearlyPlots(mean_y,ttl=meas+' mean +-std',yl='',xl='',y=None,x=None,size=(10,7), c='r')
    YearlyPlots(std_plus_y,ttl=meas+' mean +-std',yl='',xl='',y=None,x=None,size=(10,7), c='b')
    YearlyPlots(std_minus_y,ttl=meas+' mean +-std',yl='',xl='',y=None,x=None,size=(10,7), c='b')
    
    
    subplot(6,3,offset)
    offset+=1
    YearlyPlots(STAT[meas]['NE'], ttl="counts")


Plot the Number of measurements recorded each day for TMAX


In [15]:
YearlyPlots(STAT["TMAX"]['NE'], ttl="counts", c='black')


Extra Credit

  • Can you figure out what is the reason for these lower counts (especially at the beginning and end of the year and also the sudden dip at the end of each month)? Is it restricted to a subset of the stations? Suggest a way to remove this effect.

  • Can you Explain the counts per day for "SNWD" ?

Provide your explanation in new markdown cells appended after this cell. Support your explanation using code cells and graphs. If you need more data that is available only in the full dataset in the cloud but not in the data you have downloaded, contact your TA.

Plot the following 3 plots for each measurement:

  1. The percentage of variance explained by top-k eigen vectors for k between 1 to 9
  2. Plot of mean and mean $\pm$ std
  3. Plot of top 3 eigenvectors

In [ ]:


In [ ]:
figure(figsize=(15,30))
offset=1
for meas in STAT.keys():
    subplot(6,3,offset)
    offset+=1
    ## Your code for percentage of variance explained
    S=STAT[meas]
    tvar = S['eigval'].sum()
    explained_variance_ratio_ = S['eigval']/tvar
    pyplot.plot(np.cumsum(explained_variance_ratio_[:10]))
    
    
    
    subplot(6,3,offset)
    offset+=1
    ## Your code for mean and mean +- std
    std = sqrt(STAT[meas]["Var"])
    mean_y = STAT[meas]['Mean']
    std_plus_y = mean_y + std
    std_minus_y = mean_y - std
    YearlyPlots(mean_y,ttl=meas+' mean +-std',yl='',xl='',y=None,x=None,size=(10,7), c='r')
    YearlyPlots(std_plus_y,ttl=meas+' mean +-std',yl='',xl='',y=None,x=None,size=(10,7), c='b')
    YearlyPlots(std_minus_y,ttl=meas+' mean +-std',yl='',xl='',y=None,x=None,size=(10,7), c='b')
                
                
    subplot(6,3,offset)
    offset+=1
    ## Your code for top-3 eigenvectors
    YearlyPlots(S['eigvec'][:,0], ttl=meas+' mean +-std',yl='',xl='',y=None,x=None,size=(10,7), c='r')
    YearlyPlots(S['eigvec'][:,1], ttl=meas+' mean +-std',yl='',xl='',y=None,x=None,size=(10,7), c='g')
    YearlyPlots(S['eigvec'][:,2], ttl=meas+' mean +-std',yl='',xl='',y=None,x=None,size=(10,7), c='b')

In [ ]: