Exploratory data analysis: temperature data

7/30/2017 eda-temp-data.ipynb

Set up



In [46]:

    
import os
from urllib.request import urlretrieve
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn')



In [2]:

    
#https://stackoverflow.com/questions/11936967/text-file-parsing-with-python
def clean_data(filename):    

    inputfile = open(filename + '.txt')
    outputfile = open(filename + '.csv', 'w')
    
    outputfile.writelines('Date,Temp\n')
    for line in inputfile.readlines()[1:]:
        outputfile.writelines(','.join(line.split()).replace('99999.9', '') + '\n')        

    inputfile.close()
    outputfile.close()



In [3]:

    
def get_data(url, filename, force=False):
    if force or not os.path.exists(filename + '.txt'): 
        urlretrieve(url, filename + '.txt')
    if force or not os.path.exists(filename + '.csv'):
        clean_data(filename)

Get data



In [4]:

    
#http://www.bom.gov.au/climate/change/acorn-sat/#tabs=Data-and-networks
maxURL = 'http://www.bom.gov.au/climate/change/acorn/sat/data/acorn.sat.maxT.094029.daily.txt'
maxFile = 'hobart-max'
get_data(maxURL, maxFile)
data = pd.read_csv('hobart-max.csv', index_col='Date', parse_dates=True)

Examining numerical data



In [5]:

    
data.shape









    Out[5]:





(36219, 1)



In [6]:

    
data.head()



In [7]:

    
data.describe()









    Out[7]:







  
    
      
      Temp
    
  
  
    
      count
      36180.000000
    
    
      mean
      17.491993
    
    
      std
      4.950748
    
    
      min
      4.700000
    
    
      25%
      13.800000
    
    
      50%
      17.100000
    
    
      75%
      20.400000
    
    
      max
      41.800000



In [8]:

    
# measures of variability

# variance- average deviation from the mean
print(data.var())

# standard deviation - square root of variance
print(data.std())









    



Temp    24.509904
dtype: float64
Temp    4.950748
dtype: float64



In [9]:

    
def apply_common(title=''):
    #ax.set_ylim(-5,45)
    ax.set_title(title)
    ax.set_xlabel('Date')
    ax.set_ylabel('°Centrigrade')
    ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)



In [10]:

    
ax = data.plot()
apply_common('All data')



In [11]:

    
plt.scatter(data['Temp'], data.index, marker='.')
plt.show()



In [12]:

    
#distribution - unimodal, right-skewed
data.hist()









    Out[12]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000025A8BF24208>]], dtype=object)



In [13]:

    
# summarizes a data set using five statistics while also plotting unusual observations
# box is middle 50% of data, line in box is mean
# total length of the box, is interquartile range (IQR)
# whiskers < 1.5 IQR
# observations beyond whiskers are outliers

filtered_data = data.dropna()
boxplot_data = [filtered_data['Temp']]
plt.boxplot(boxplot_data)
plt.xticks([1], ['max temp'])
plt.show()

Examining categorical data

Bar plot Segmented bar plot Standardized segmented bar plot Mosaic plot Pie chart



In [14]:

    
data['Day'] = data.index.dayofweek



In [15]:

    
data.head()



In [16]:

    
data.describe()









    Out[16]:







  
    
      
      Temp
      Day
    
  
  
    
      count
      36180.000000
      36219.000000
    
    
      mean
      17.491993
      2.999945
    
    
      std
      4.950748
      2.000028
    
    
      min
      4.700000
      0.000000
    
    
      25%
      13.800000
      1.000000
    
    
      50%
      17.100000
      3.000000
    
    
      75%
      20.400000
      5.000000
    
    
      max
      41.800000
      6.000000



In [23]:

    
#x = data.groupby(data.Day)
print()
#plt.bar(x,7)
#plt.show()









    



<pandas.core.groupby.DataFrameGroupBy object at 0x0000025ABDEF7320>

Measures of center



In [43]:

    
print('Mean is {0}'.format(data.Temp.mean()))









    



Mean is 17.49199281370915



In [44]:

    
print('Median is {0}'.format(data.Temp.median()))









    



Median is 17.1



In [45]:

    
print('Mode is {0}'.format(data.Temp.mode()))









    



Mode is 0    20.2
dtype: float64

Measures of spread



In [42]:

    
print('Range is {0} to {1}'.format(data.Temp.min(), data.Temp.max()))









    



Range is 4.7 to 41.8



In [52]:

    
print('IQR is {0}'.format(stats.iqr(data.Temp.dropna())))









    



IQR is 6.599999999999998



In [53]:

    
print('Variance is {0}'.format(data.Temp.var()))









    



Variance is 24.509903820395614



In [54]:

    
print('Standard deviation is {0}'.format(data.Temp.std()))









    



Standard deviation is 4.950747804160056

Measures of shape



In [55]:

    
data.Temp.hist()









    Out[55]:





<matplotlib.axes._subplots.AxesSubplot at 0x25abf8fd4e0>

Symmetry Distribution is unimodal, or normal.

Skewness Distribution is very mildly right, or positive, skewed (i.e, tail is to the right, larger numbers). Mean > median.

Kurtosis Distribution is mildly platykurtic?



In [62]:

    
stats.normaltest(data.Temp.dropna())









    Out[62]:





NormaltestResult(statistic=2665.5752415225479, pvalue=0.0)



In [61]:

    
stats.skewtest(data.Temp.dropna())









    Out[61]:





SkewtestResult(statistic=47.542811993468227, pvalue=0.0)



In [59]:

    
stats.kurtosistest(data.Temp.dropna())









    Out[59]:





KurtosistestResult(statistic=20.130977851964417, pvalue=3.9512027615088447e-90)



In [ ]:

	Temp
Date
1918-01-01	20.2
1918-01-02	20.0
1918-01-03	28.2
1918-01-04	19.8
1918-01-05	20.2

	Temp	Day
Date
1918-01-01	20.2	1
1918-01-02	20.0	2
1918-01-03	28.2	3
1918-01-04	19.8	4
1918-01-05	20.2	5

	Temp
count	36180.000000
mean	17.491993
std	4.950748
min	4.700000
25%	13.800000
50%	17.100000
75%	20.400000
max	41.800000

	Temp	Day
count	36180.000000	36219.000000
mean	17.491993	2.999945
std	4.950748	2.000028
min	4.700000	0.000000
25%	13.800000	1.000000
50%	17.100000	3.000000
75%	20.400000	5.000000
max	41.800000	6.000000