https://github.com/abulbasar/data/blob/master/snsdata.csv?raw=true
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import *
%matplotlib inline
In [2]:
df = pd.read_csv("https://github.com/abulbasar/data/blob/master/snsdata.csv?raw=true")
df.head()
Out[2]:
Proportion of male/female profiles in the dataset
In [3]:
df.gender.value_counts()/len(df)
Out[3]:
In [4]:
features = df.columns[4:]
In [5]:
X = df[features] * 1.0
a = X.values.flatten()
fig, _ = plt.subplots(2, 1, figsize = (8, 6))
axes = fig.axes
axes[0].hist(X.values.flatten(), bins = 50, log = True);
axes[0].set_title("Histogram of frequencies")
axes[1].boxplot(a, vert = False);
axes[1].set_title("Boxplot of frequencies")
plt.tight_layout()
In [6]:
a.shape
Out[6]:
How many records are there for which count is greater than 20.
In [7]:
a[a>20].shape
Out[7]:
Clip the count values beyond 50.
In [8]:
X_clipped = np.clip(X.values, a_min=0, a_max=50)
plt.hist(X_clipped.flatten(), log=True, bins = 50);
Before applying KMeans, standarized the values.
In [9]:
scaler = preprocessing.MinMaxScaler()
X_std = scaler.fit_transform(X_clipped)
Set the number of clusters to k=5.
In [10]:
k = 5
In [11]:
kmeans = cluster.KMeans(n_clusters=k, random_state=1)
kmeans.fit(X_std)
Out[11]:
Predict cluster for each point based on the KMeans model.
In [12]:
y_pred = kmeans.predict(X_std)
Centroids of the clusters in the normal scale (using scaler.inverse_transform).
In [13]:
centroids = pd.DataFrame(scaler.inverse_transform(kmeans.cluster_centers_), columns=features)
centroids.T
Out[13]:
For first cluster, find top 10 the most dominant features based on the magnitude.
In [14]:
centroids.iloc[0, :].T.sort_values(ascending = False)[:10]
Out[14]:
For first cluster, music, god, dance, hair etc. are dominant features. Let's see the dominant features in the other clusters.
In [15]:
centroids.iloc[1, :].T.sort_values(ascending = False)[:10]
Out[15]:
In [16]:
centroids.iloc[2, :].T.sort_values(ascending = False)[:10]
Out[16]:
In [17]:
centroids.iloc[3, :].T.sort_values(ascending = False)[:10]
Out[17]:
In [18]:
centroids.iloc[4, :].T.sort_values(ascending = False)[:10]
Out[18]:
As music and god are common in top 10 for each cluster, we can drop these features and retry the clustering.
Find the density of each cluster. Calculate the avg distance of a point and its closes centroid.
In [19]:
df["cluster"] = y_pred
distances = np.zeros(len(y_pred))
for i in range(k):
center = kmeans.cluster_centers_[i]
distances[y_pred == i] = metrics.euclidean_distances(X_std[y_pred == i]
, center.reshape(1, -1)).squeeze()
df["distance"] = distances
df.sample(10)
Out[19]:
In [20]:
df.pivot_table("distance", "cluster", "gender", aggfunc="mean")
Out[20]:
Let's find the anomalies in the features depending the distance of the profile from its centroid. Using Box Whisker method, identify the outliers.
In [21]:
def find_outliers(a):
q1, q2, q3 = np.percentile(a, [25, 50, 75])
iqr = q3 - q1
lower_whisker = max(q1 - 1.5 * iqr, np.min(a))
upper_whisker = min(q3 + 1.5 * iqr, np.max(a))
q1, q2, q3, iqr, lower_whisker, upper_whisker
is_outlier = (a < lower_whisker) | (a > upper_whisker)
return is_outlier
In [22]:
anamolies = df[find_outliers(df.distance)]
anamolies.shape, df.shape
Out[22]:
In [23]:
df.distance.plot.hist(bins = 50, log = True)
Out[23]:
Apply dendo-gram (hierarchical clustering)
In [24]:
from scipy.cluster.hierarchy import linkage, dendrogram
In [25]:
plt.figure(figsize = (15, 10))
row_clusters = linkage(X_std, method="complete", metric="euclidean")
f = dendrogram(row_clusters, p = 5, truncate_mode="level")
In [ ]: