Exploratory data analysis: temperature data

7/30/2017 eda-temp-data.ipynb

Set up


In [46]:
import os
from urllib.request import urlretrieve
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn')

In [2]:
#https://stackoverflow.com/questions/11936967/text-file-parsing-with-python
def clean_data(filename):    

    inputfile = open(filename + '.txt')
    outputfile = open(filename + '.csv', 'w')
    
    outputfile.writelines('Date,Temp\n')
    for line in inputfile.readlines()[1:]:
        outputfile.writelines(','.join(line.split()).replace('99999.9', '') + '\n')        

    inputfile.close()
    outputfile.close()

In [3]:
def get_data(url, filename, force=False):
    if force or not os.path.exists(filename + '.txt'): 
        urlretrieve(url, filename + '.txt')
    if force or not os.path.exists(filename + '.csv'):
        clean_data(filename)

Get data


In [4]:
#http://www.bom.gov.au/climate/change/acorn-sat/#tabs=Data-and-networks
maxURL = 'http://www.bom.gov.au/climate/change/acorn/sat/data/acorn.sat.maxT.094029.daily.txt'
maxFile = 'hobart-max'
get_data(maxURL, maxFile)
data = pd.read_csv('hobart-max.csv', index_col='Date', parse_dates=True)

Examining numerical data


In [5]:
data.shape


Out[5]:
(36219, 1)

In [6]:
data.head()


Out[6]:
Temp
Date
1918-01-01 20.2
1918-01-02 20.0
1918-01-03 28.2
1918-01-04 19.8
1918-01-05 20.2

In [7]:
data.describe()


Out[7]:
Temp
count 36180.000000
mean 17.491993
std 4.950748
min 4.700000
25% 13.800000
50% 17.100000
75% 20.400000
max 41.800000

In [8]:
# measures of variability

# variance- average deviation from the mean
print(data.var())

# standard deviation - square root of variance
print(data.std())


Temp    24.509904
dtype: float64
Temp    4.950748
dtype: float64

In [9]:
def apply_common(title=''):
    #ax.set_ylim(-5,45)
    ax.set_title(title)
    ax.set_xlabel('Date')
    ax.set_ylabel('°Centrigrade')
    ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

In [10]:
ax = data.plot()
apply_common('All data')



In [11]:
plt.scatter(data['Temp'], data.index, marker='.')
plt.show()



In [12]:
#distribution - unimodal, right-skewed
data.hist()


Out[12]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000025A8BF24208>]], dtype=object)

In [13]:
# summarizes a data set using five statistics while also plotting unusual observations
# box is middle 50% of data, line in box is mean
# total length of the box, is interquartile range (IQR)
# whiskers < 1.5 IQR
# observations beyond whiskers are outliers

filtered_data = data.dropna()
boxplot_data = [filtered_data['Temp']]
plt.boxplot(boxplot_data)
plt.xticks([1], ['max temp'])
plt.show()


Examining categorical data

Bar plot Segmented bar plot Standardized segmented bar plot Mosaic plot Pie chart


In [14]:
data['Day'] = data.index.dayofweek

In [15]:
data.head()


Out[15]:
Temp Day
Date
1918-01-01 20.2 1
1918-01-02 20.0 2
1918-01-03 28.2 3
1918-01-04 19.8 4
1918-01-05 20.2 5

In [16]:
data.describe()


Out[16]:
Temp Day
count 36180.000000 36219.000000
mean 17.491993 2.999945
std 4.950748 2.000028
min 4.700000 0.000000
25% 13.800000 1.000000
50% 17.100000 3.000000
75% 20.400000 5.000000
max 41.800000 6.000000

In [23]:
#x = data.groupby(data.Day)
print()
#plt.bar(x,7)
#plt.show()


<pandas.core.groupby.DataFrameGroupBy object at 0x0000025ABDEF7320>

Measures of center


In [43]:
print('Mean is {0}'.format(data.Temp.mean()))


Mean is 17.49199281370915

In [44]:
print('Median is {0}'.format(data.Temp.median()))


Median is 17.1

In [45]:
print('Mode is {0}'.format(data.Temp.mode()))


Mode is 0    20.2
dtype: float64

Measures of spread


In [42]:
print('Range is {0} to {1}'.format(data.Temp.min(), data.Temp.max()))


Range is 4.7 to 41.8

In [52]:
print('IQR is {0}'.format(stats.iqr(data.Temp.dropna())))


IQR is 6.599999999999998

In [53]:
print('Variance is {0}'.format(data.Temp.var()))


Variance is 24.509903820395614

In [54]:
print('Standard deviation is {0}'.format(data.Temp.std()))


Standard deviation is 4.950747804160056

Measures of shape


In [55]:
data.Temp.hist()


Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x25abf8fd4e0>

Symmetry Distribution is unimodal, or normal.

Skewness Distribution is very mildly right, or positive, skewed (i.e, tail is to the right, larger numbers). Mean > median.

Kurtosis Distribution is mildly platykurtic?


In [62]:
stats.normaltest(data.Temp.dropna())


Out[62]:
NormaltestResult(statistic=2665.5752415225479, pvalue=0.0)

In [61]:
stats.skewtest(data.Temp.dropna())


Out[61]:
SkewtestResult(statistic=47.542811993468227, pvalue=0.0)

In [59]:
stats.kurtosistest(data.Temp.dropna())


Out[59]:
KurtosistestResult(statistic=20.130977851964417, pvalue=3.9512027615088447e-90)

In [ ]: