In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from IPython.display import Image
from sklearn.externals.six import StringIO
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline
RANDOM_SEED = 9
In [2]:
data = pd.read_csv('./asset/Daily_Weather_Observations.csv', sep=',')
print(data.shape)
data.head()
Out[2]:
Heading | Meaning | Units |
---|---|---|
Day | Day of the week | first two letters |
Temps_min | Minimum temperature in the 24 hours to 9am. | degrees Celsius |
Temps_max | Maximum temperature in the 24 hours from 9am. | degrees Celsius |
Rain | Precipitation (rainfall) in the 24 hours to 9am. | millimetres |
Evap | Class A pan evaporation in the 24 hours to 9am | millimetres |
Sun_hours | Bright sunshine in the 24 hours to midnight | hours |
Max_wind_dir | Direction of strongest gust in the 24 hours to midnight | 16 compass points |
Max_wind_spd | Speed of strongest wind gust in the 24 hours to midnight | kilometres per hour |
Max_wind_time | Time of strongest wind gust | local time hh:mm |
Temp_at_9am | Temperature at 9 am | degrees Celsius |
RH_at_9am | Relative humidity at 9 am | percent |
CLD_at_9am | Fraction of sky obscured by cloud at 9 am | eighths |
Wind_dir_at_9am | Wind direction averaged over 10 minutes prior to 9 am | compass points |
Wind_spd_at_9am | Wind speed averaged over 10 minutes prior to 9 am | kilometres per hour |
MSLP_at_9am | Atmospheric pressure reduced to mean sea level at 9 am | hectopascals |
Temp_at_3pm | Temperature at 3 pm | degrees Celsius |
RH_at_3pm | Relative humidity at 3 pm | percent |
CLD_at_3pm | Fraction of sky obscured by cloud at 3 pm | eighths |
Wind_dir_at_3pm | Wind direction averaged over 10 minutes prior to 3 pm | compass points |
Wind_spd_at_3pm | Wind speed averaged over 10 minutes prior to 3 pm | kilometres per hour |
MSLP_at_3pm | Atmospheric pressure reduced to mean sea level at 3 pm | hectopascals |
We have noticed that the Sun_hours values are missing in some rows. We want to predict these missing Sun_hours values.
We firstly seperate those rows with missing Sun_hours from the original dataset.
In [3]:
data_missing_sun_hours = data[pd.isnull(data['Sun_hours'])]
data_missing_sun_hours
Out[3]:
In [4]:
data = data[pd.notnull(data['Sun_hours'])]
print(data.shape)
We firstly category Sun_hours into three levels: High(>10), Med(>5 and <=10), and Low(<=5)
In [5]:
labels = ['Low','Med','High']
data['Sun_level'] = pd.cut(data.Sun_hours, [-1,5,10,25], labels=labels)
data[['Sun_hours','Sun_level']].head()
Out[5]:
From time to time, observations will not be available, for a variety of reasons. We need to handle missing values before training a classifier.
In [6]:
# output all rows with missing values
data[data.isnull().any(axis=1)]
Out[6]:
For CLD_at_9am, Max_wind_dir, Max_wind_spd and Max_wind_dir, we simply drop the rows with missing values
In [7]:
data = data.dropna(subset = ['CLD_at_9am', 'Max_wind_dir', 'Max_wind_spd', 'Max_wind_dir'])
print(data.shape)
According to the Bureau of Meteorology (http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml), sometimes when the evaporation are missing, the next value given has been accumulated over several days rather than the normal one day. Therefore, we will not only drop the rows with missing Evap data, but also the rows below them.
In [8]:
bitmap1 = data.Evap.notnull()
bitmap2 = bitmap1.shift(1)
bitmap2[0] = True
data = data[bitmap1 & bitmap2]
print(data.shape)
In [9]:
def corr(data):
c = data.corr()
df = c.Sun_level_num.to_frame()
df['abs'] = abs(df['Sun_level_num'])
df.sort_values(by = 'abs', ascending=False)
print(df.sort_values(by = 'abs', ascending=False))
In [10]:
# we need to convert labels (string) into numeric to get the correlation
labels = [0,1,2]
data['Sun_level_num'] = pd.cut(data.Sun_hours, [-1,5,10,25], labels=labels)
data[['Sun_level_num']] = data[['Sun_level_num']].apply(pd.to_numeric)
corr(data)
# c = data.corr()
# c.sort_values(by='Sun_level_num', ascending=True)['Sun_level_num']
If you have domain knowledge, you will know that the difference between max and min temperature might be highly correlated to sun hours. Let us add such column.
In [11]:
data['Temps_diff'] = data['Temps_max'] - data['Temps_min']
corr(data)
Let's try the features with higher correlations to Sun_level
In [12]:
feature_list = ['CLD_at_9am', 'CLD_at_3pm', 'RH_at_3pm', 'RH_at_9am', 'Temps_min', 'Temps_diff', 'Month']
We preserve 80% of the data as training data, and the rest will be used to tune the classifier.
In [13]:
train, test = train_test_split(data, test_size = 0.2)
Now we generate features
In [14]:
X_train = train[feature_list]
X_test = test[feature_list]
X_train.head()
Out[14]:
And we use Sun_level as labels
In [15]:
y_train = train.Sun_level
y_test = test.Sun_level
With features and labels, we can then train our decision tree
In [16]:
clf = DecisionTreeClassifier(criterion = "gini")
dtree = clf.fit(X_train, y_train)
Firstly let's see the accuracy of the decision tree on the training dataset
In [17]:
dtree.score(X_train,y_train) # output the accuracy of the trained decision tree
Out[17]:
Let's see the accuracy on the testing dataset
In [18]:
dtree.score(X_test,y_test)
Out[18]:
We can also get the importance of each feature
In [19]:
feature = pd.DataFrame({'name': pd.Series(feature_list),
'importance': pd.Series(dtree.feature_importances_)})
feature.sort_values(by = 'importance', ascending = False)
Out[19]:
Apparently the above tree is overfitting. One way to deal with it is to change the maximum height of the decision tree.
In [20]:
def experiment(train, test, features, depth=5):
X_train = train[features]
y_train = train.Sun_level
clf = DecisionTreeClassifier(criterion = "gini", max_depth = depth, random_state = RANDOM_SEED)
dtree = clf.fit(X_train, y_train)
err_training = dtree.score(X_train,y_train)
X_test = test[features]
y_test = test.Sun_level
err_testing = dtree.score(X_test,y_test)
err_diff = err_training - err_testing
print('{}, {}, {}'.format(err_training, err_testing, err_diff))
return err_training, err_testing
def evaluate(features, repeat_times = 10, depth = 5):
print('features: {}'.format(features))
print('max_depth: {}\n'.format(depth))
total_err_training = 0
total_err_testing = 0
for i in range(repeat_times):
train, test = train_test_split(data, test_size = 0.2, random_state = RANDOM_SEED + i)
err_training, err_testing = experiment(train, test, features, depth)
total_err_training += err_training
total_err_testing += err_testing
print('==============')
print('avg. training error:\t{}'.format(total_err_training/repeat_times))
print('avg. testing error:\t{}'.format(total_err_testing/repeat_times))
print('avg. difference:\t{}'.format((total_err_training - total_err_testing)/repeat_times))
print('============================')
In [21]:
feature_list = ['CLD_at_9am', 'CLD_at_3pm', 'RH_at_3pm', 'RH_at_9am', 'Temps_min', 'Temps_diff', 'Month']
evaluate(feature_list, 10, 6)
In [22]:
evaluate(feature_list, 10, 5)
In [23]:
evaluate(feature_list, 10, 4)
In [24]:
evaluate(feature_list, 10, 3)
In [25]:
X = data[feature_list]
y = data.Sun_level
clf = DecisionTreeClassifier(criterion = "gini", max_depth = 4, random_state = RANDOM_SEED)
dtree = clf.fit(X, y)
In [26]:
data_missing_sun_hours['Temps_diff'] = data_missing_sun_hours['Temps_max'] - data_missing_sun_hours['Temps_min']
In [27]:
X_pred = data_missing_sun_hours[feature_list]
data_missing_sun_hours['Sun_level_pred'] = dtree.predict(X_pred)
data_missing_sun_hours
Out[27]:
Although we cannot remember what happened on these 7 days (can you?), it seems the predictions are probably correct. E.g.,
Rain = 61.0 (mm)
. Note that the model didn't use the rain
feature at all! You need to install pydotplus (use pip install pydotplus
) and graphviz on your computer to get the following code run
In [28]:
import pydotplus
import sys
# sys.path.append("C:\\Program Files (x86)\\Graphviz2.38\bin")
dotfile = StringIO()
export_graphviz(dtree, out_file = dotfile, feature_names = X_train.columns)
graph = pydotplus.graph_from_dot_data(dotfile.getvalue())
Image(graph.create_png())
You can generate a pdf file if you want
In [ ]:
graph.write_pdf("./asset/dtree.pdf")