Clustering with scikit-learn
In this notebook, we will learn how to perform k-means lustering using scikit-learn in Python.
We will use cluster analysis to generate a big picture model of the weather at a local station using a minute-graunlarity data. In this dataset, we have in the order of millions records. How do we create 12 clusters our of them?
NOTE: The dataset we will use is in a large CSV file called minute_weather.csv. Please download it into the weather directory in your Week-7-MachineLearning folder. The download link is: https://drive.google.com/open?id=0B8iiZ7pSaSFZb3ItQ1l4LWRMTjg
Importing the Necessary Libraries
In [3]:
    
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import python_utils
import pandas as pd
import numpy as np
from itertools import cycle, islice
import matplotlib.pyplot as plt
from pandas.tools.plotting import parallel_coordinates
%matplotlib inline
    
Creating a Pandas DataFrame from a CSV file
In [4]:
    
data = pd.read_csv('./weather/minute_weather.csv')
    
Minute Weather Data Description
As with the daily weather data, this data comes from a weather station located in San Diego, California. The weather station is equipped with sensors that capture weather-related measurements such as air temperature, air pressure, and relative humidity. Data was collected for a period of three years, from September 2011 to September 2014, to ensure that sufficient data for different seasons and weather conditions is captured.
Each row in minute_weather.csv contains weather data captured for a one-minute interval. Each row, or sample, consists of the following variables:
In [5]:
    
data.shape
    
    Out[5]:
In [6]:
    
data.head()
    
    Out[6]:
Data Sampling
Lots of rows, so let us sample down by taking every 10th row. 
In [7]:
    
sampled_df = data[(data['rowID'] % 10) == 0]
sampled_df.shape
    
    Out[7]:
Statistics
In [8]:
    
sampled_df.describe().transpose()
    
    Out[8]:
In [9]:
    
sampled_df[sampled_df['rain_accumulation'] == 0].shape
    
    Out[9]:
In [10]:
    
sampled_df[sampled_df['rain_duration'] == 0].shape
    
    Out[10]:
Drop all the Rows with Empty rain_duration and rain_accumulation
In [11]:
    
del sampled_df['rain_accumulation']
del sampled_df['rain_duration']
    
In [12]:
    
rows_before = sampled_df.shape[0]
sampled_df = sampled_df.dropna()
rows_after = sampled_df.shape[0]
    
How many rows did we drop ?
In [13]:
    
rows_before - rows_after
    
    Out[13]:
In [14]:
    
sampled_df.columns
    
    Out[14]:
Select Features of Interest for Clustering
In [15]:
    
features = ['air_pressure', 'air_temp', 'avg_wind_direction', 'avg_wind_speed', 'max_wind_direction', 
        'max_wind_speed','relative_humidity']
    
In [16]:
    
select_df = sampled_df[features]
    
In [17]:
    
select_df.columns
    
    Out[17]:
In [18]:
    
select_df
    
    Out[18]:
Scale the Features using StandardScaler
In [19]:
    
X = StandardScaler().fit_transform(select_df)
X
    
    Out[19]:
Use k-Means Clustering
In [20]:
    
kmeans = KMeans(n_clusters=12)
model = kmeans.fit(X)
print("model\n", model)
    
    
What are the centers of 12 clusters we formed ?
In [21]:
    
centers = model.cluster_centers_
centers
    
    Out[21]:
Plots
Let us first create some utility functions which will help us in plotting graphs:
In [22]:
    
# Function that creates a DataFrame with a column for Cluster Number
def pd_centers(featuresUsed, centers):
	colNames = list(featuresUsed)
	colNames.append('prediction')
	# Zip with a column called 'prediction' (index)
	Z = [np.append(A, index) for index, A in enumerate(centers)]
	# Convert to pandas data frame for plotting
	P = pd.DataFrame(Z, columns=colNames)
	P['prediction'] = P['prediction'].astype(int)
	return P
    
In [23]:
    
# Function that creates Parallel Plots
def parallel_plot(data):
	my_colors = list(islice(cycle(['b', 'r', 'g', 'y', 'k']), None, len(data)))
	plt.figure(figsize=(15,8)).gca().axes.set_ylim([-3,+3])
	parallel_coordinates(data, 'prediction', color = my_colors, marker='o')
    
In [24]:
    
P = pd_centers(features, centers)
P
    
    Out[24]:
In [25]:
    
parallel_plot(P[P['relative_humidity'] < -0.5])
    
    
    
In [26]:
    
parallel_plot(P[P['air_temp'] > 0.5])
    
    
    
In [27]:
    
parallel_plot(P[(P['relative_humidity'] > 0.5) & (P['air_temp'] < 0.5)])
    
    
    
In [ ]: