Classification of Weather Data
using scikit-learn
Daily Weather Data Analysis
In this notebook, we will use scikit-learn to perform a decision tree based classification of weather data.
Importing the Necessary Libraries
In [1]:
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
Creating a Pandas DataFrame from a CSV file
In [2]:
data = pd.read_csv('./weather/daily_weather.csv')
Daily Weather Data Description
In [ ]:
data.columns
Each row in daily_weather.csv captures weather data for a separate day.
Sensor measurements from the weather station were captured at one-minute intervals. These measurements were then processed to generate values to describe daily weather. Since this dataset was created to classify low-humidity days vs. non-low-humidity days (that is, days with normal or high humidity), the variables included are weather measurements in the morning, with one measurement, namely relatively humidity, in the afternoon. The idea is to use the morning weather values to predict whether the day will be low-humidity or not based on the afternoon measurement of relative humidity.
Each row, or sample, consists of the following variables:
In [ ]:
data
In [ ]:
data[data.isnull().any(axis=1)]
Data Cleaning Steps
We will not need to number for each row so we can clean it.
In [ ]:
del data['number']
Now let's drop null values using the pandas dropna function.
In [ ]:
before_rows = data.shape[0]
print(before_rows)
In [ ]:
data = data.dropna()
In [ ]:
after_rows = data.shape[0]
print(after_rows)
How many rows dropped due to cleaning?
In [ ]:
before_rows - after_rows
Convert to a Classification Task
In [ ]:
clean_data = data.copy()
clean_data['high_humidity_label'] = (clean_data['relative_humidity_3pm'] > 24.99)*1
print(clean_data['high_humidity_label'])
Target is stored in 'y'.
In [ ]:
y=clean_data[['high_humidity_label']].copy()
#y
In [ ]:
clean_data['relative_humidity_3pm'].head()
In [ ]:
y.head()
Use 9am Sensor Signals as Features to Predict Humidity at 3pm
In [ ]:
morning_features = ['air_pressure_9am','air_temp_9am','avg_wind_direction_9am','avg_wind_speed_9am',
'max_wind_direction_9am','max_wind_speed_9am','rain_accumulation_9am',
'rain_duration_9am']
In [ ]:
X = clean_data[morning_features].copy()
In [ ]:
X.columns
In [ ]:
y.columns
Perform Test and Train split
In the training phase, the learning algorithm uses the training data to adjust the model’s parameters to minimize errors. At the end of the training phase, you get the trained model.
In the testing phase, the trained model is applied to test data. Test data is separate from the training data, and is previously unseen by the model. The model is then evaluated on how it performs on the test data. The goal in building a classifier model is to have the model perform well on training as well as test data.
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)
In [ ]:
#type(X_train)
#type(X_test)
#type(y_train)
#type(y_test)
#X_train.head()
#y_train.describe()
Fit on Train Set
In [ ]:
humidity_classifier = DecisionTreeClassifier(max_leaf_nodes=10, random_state=0)
humidity_classifier.fit(X_train, y_train)
In [ ]:
type(humidity_classifier)
Predict on Test Set
In [ ]:
predictions = humidity_classifier.predict(X_test)
In [ ]:
predictions[:10]
In [ ]:
y_test['high_humidity_label'][:10]
Measure Accuracy of the Classifier
In [ ]:
accuracy_score(y_true = y_test, y_pred = predictions)
In [ ]: