Use 36 features - "basketball", "football" … "drunk", "drugs" (these columns indicate how many times a user has used these words in her profile) and apply K-Means clustering to group the profiles into 5 clusters
Find the number of users in each cluster and mean distance with each cluster.
Which cluster is the most dense in terms of average distance.
How many anomalies are there?
For each cluster, find the top 3 dominant features.



In [1]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import *

%matplotlib inline



In [2]:

    
df = pd.read_csv("https://github.com/abulbasar/data/blob/master/snsdata.csv?raw=true")
df.head()









    Out[2]:







  
    
      
      gradyear
      gender
      age
      friends
      basketball
      football
      soccer
      softball
      volleyball
      swimming
      ...
      blonde
      mall
      shopping
      clothes
      hollister
      abercrombie
      die
      death
      drunk
      drugs
    
  
  
    
      0
      2006
      M
      18.982
      7
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      2006
      F
      18.801
      0
      0
      1
      0
      0
      0
      0
      ...
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      2006
      M
      18.335
      69
      0
      1
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
    
    
      3
      2006
      F
      18.875
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      2006
      NaN
      18.995
      10
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      2
      0
      0
      0
      0
      0
      1
      1
    
  

5 rows × 40 columns

Proportion of male/female profiles in the dataset



In [3]:

    
df.gender.value_counts()/len(df)









    Out[3]:





F    0.735133
M    0.174067
Name: gender, dtype: float64



In [4]:

    
features = df.columns[4:]



In [5]:

    
X = df[features] * 1.0
a = X.values.flatten()

fig, _ = plt.subplots(2, 1, figsize = (8, 6))
axes = fig.axes 

axes[0].hist(X.values.flatten(), bins = 50, log = True);
axes[0].set_title("Histogram of frequencies")
axes[1].boxplot(a, vert = False);
axes[1].set_title("Boxplot of frequencies")
plt.tight_layout()



In [6]:

    
a.shape









    Out[6]:





(1080000,)

How many records are there for which count is greater than 20.



In [7]:

    
a[a>20].shape









    Out[7]:





(52,)

Clip the count values beyond 50.



In [8]:

    
X_clipped = np.clip(X.values, a_min=0, a_max=50)
plt.hist(X_clipped.flatten(), log=True, bins = 50);

Before applying KMeans, standarized the values.



In [9]:

    
scaler = preprocessing.MinMaxScaler()
X_std = scaler.fit_transform(X_clipped)

Set the number of clusters to k=5.



In [10]:

    
k = 5



In [11]:

    
kmeans = cluster.KMeans(n_clusters=k, random_state=1)
kmeans.fit(X_std)









    Out[11]:





KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=1, tol=0.0001, verbose=0)

Predict cluster for each point based on the KMeans model.



In [12]:

    
y_pred = kmeans.predict(X_std)

Centroids of the clusters in the normal scale (using scaler.inverse_transform).



In [13]:

    
centroids = pd.DataFrame(scaler.inverse_transform(kmeans.cluster_centers_), columns=features)
centroids.T









    Out[13]:







  
    
      
      0
      1
      2
      3
      4
    
  
  
    
      basketball
      2.185926e-01
      0.449442
      0.436441
      0.412173
      0.282163
    
    
      football
      2.111973e-01
      0.377216
      0.549435
      0.360555
      0.273462
    
    
      soccer
      1.931401e-01
      0.367695
      0.264124
      0.276194
      0.249845
    
    
      softball
      1.303934e-01
      0.288903
      0.238701
      0.251541
      0.161591
    
    
      volleyball
      1.103398e-01
      0.342416
      0.211864
      0.181818
      0.122436
    
    
      swimming
      1.041695e-01
      0.267892
      0.189266
      0.202234
      0.162213
    
    
      cheerleading
      3.017104e-02
      0.085686
      2.888418
      0.058937
      0.046613
    
    
      baseball
      9.568531e-02
      0.119829
      0.185028
      0.143297
      0.106277
    
    
      tennis
      7.817250e-02
      0.136901
      0.076271
      0.108629
      0.089497
    
    
      sports
      1.191870e-01
      0.175312
      0.168079
      0.223421
      0.210690
    
    
      cute
      2.137834e-01
      0.800394
      0.576271
      0.556626
      0.424487
    
    
      sex
      1.215916e-01
      0.266907
      0.300847
      0.447612
      0.839030
    
    
      sexy
      1.156935e-01
      0.186146
      0.203390
      0.208783
      0.269111
    
    
      hot
      8.343542e-02
      0.314183
      0.251412
      0.211479
      0.170914
    
    
      kissed
      4.645887e-02
      0.136573
      0.189266
      0.325886
      0.420137
    
    
      dance
      3.249399e-01
      0.832239
      0.668079
      0.661017
      0.540087
    
    
      band
      2.689079e-01
      0.338805
      0.261299
      0.433359
      0.436917
    
    
      marching
      3.978948e-02
      0.040053
      0.024011
      0.053929
      0.038533
    
    
      music
      6.345901e-01
      0.984570
      0.748588
      1.137519
      1.026725
    
    
      rock
      1.959076e-01
      0.345043
      0.360169
      0.389445
      0.413300
    
    
      god
      4.086475e-01
      0.561064
      0.608757
      0.610555
      0.743319
    
    
      church
      1.971780e-01
      0.468155
      0.365819
      0.379815
      0.266004
    
    
      jesus
      1.033075e-01
      0.149048
      0.115819
      0.135978
      0.121815
    
    
      bible
      1.851096e-02
      0.021668
      0.026836
      0.036210
      0.032940
    
    
      hair
      2.422758e-01
      0.754760
      0.661017
      1.065485
      1.121193
    
    
      dress
      5.784674e-02
      0.387722
      0.146893
      0.194530
      0.164077
    
    
      blonde
      5.389955e-02
      0.166776
      0.289548
      0.173344
      0.211311
    
    
      mall
      1.259471e-01
      1.013132
      0.418079
      0.397535
      0.330019
    
    
      shopping
      1.407831e-01
      1.778070
      0.707627
      0.439137
      0.267247
    
    
      clothes
      7.291390e-14
      0.174327
      0.217514
      1.355162
      0.156619
    
    
      hollister
      3.039789e-02
      0.253775
      0.211864
      0.151772
      0.067744
    
    
      abercrombie
      2.146001e-02
      0.192055
      0.158192
      0.100924
      0.064015
    
    
      die
      1.455923e-01
      0.223572
      0.190678
      0.287365
      0.467371
    
    
      death
      9.305385e-02
      0.155942
      0.127119
      0.167180
      0.234307
    
    
      drunk
      -3.529121e-14
      0.048260
      0.086158
      0.050847
      1.428838
    
    
      drugs
      3.026178e-02
      0.073867
      0.081921
      0.124037
      0.336234

For first cluster, find top 10 the most dominant features based on the magnitude.



In [14]:

    
centroids.iloc[0, :].T.sort_values(ascending = False)[:10]









    Out[14]:





music         0.634590
god           0.408648
dance         0.324940
band          0.268908
hair          0.242276
basketball    0.218593
cute          0.213783
football      0.211197
church        0.197178
rock          0.195908
Name: 0, dtype: float64

For first cluster, music, god, dance, hair etc. are dominant features. Let's see the dominant features in the other clusters.



In [15]:

    
centroids.iloc[1, :].T.sort_values(ascending = False)[:10]









    Out[15]:





shopping      1.778070
mall          1.013132
music         0.984570
dance         0.832239
cute          0.800394
hair          0.754760
god           0.561064
church        0.468155
basketball    0.449442
dress         0.387722
Name: 1, dtype: float64



In [16]:

    
centroids.iloc[2, :].T.sort_values(ascending = False)[:10]









    Out[16]:





cheerleading    2.888418
music           0.748588
shopping        0.707627
dance           0.668079
hair            0.661017
god             0.608757
cute            0.576271
football        0.549435
basketball      0.436441
mall            0.418079
Name: 2, dtype: float64



In [17]:

    
centroids.iloc[3, :].T.sort_values(ascending = False)[:10]









    Out[17]:





clothes       1.355162
music         1.137519
hair          1.065485
dance         0.661017
god           0.610555
cute          0.556626
sex           0.447612
shopping      0.439137
band          0.433359
basketball    0.412173
Name: 3, dtype: float64



In [18]:

    
centroids.iloc[4, :].T.sort_values(ascending = False)[:10]









    Out[18]:





drunk     1.428838
hair      1.121193
music     1.026725
sex       0.839030
god       0.743319
dance     0.540087
die       0.467371
band      0.436917
cute      0.424487
kissed    0.420137
Name: 4, dtype: float64

As music and god are common in top 10 for each cluster, we can drop these features and retry the clustering.

Find the density of each cluster. Calculate the avg distance of a point and its closes centroid.



In [19]:

    
df["cluster"] = y_pred

distances = np.zeros(len(y_pred))
for i in range(k):
    center = kmeans.cluster_centers_[i]
    distances[y_pred == i] = metrics.euclidean_distances(X_std[y_pred == i]
                                                        , center.reshape(1, -1)).squeeze()
df["distance"] = distances
df.sample(10)









    Out[19]:







  
    
      
      gradyear
      gender
      age
      friends
      basketball
      football
      soccer
      softball
      volleyball
      swimming
      ...
      shopping
      clothes
      hollister
      abercrombie
      die
      death
      drunk
      drugs
      cluster
      distance
    
  
  
    
      16893
      2008
      F
      16.364
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0.046323
    
    
      18432
      2008
      F
      16.690
      23
      0
      0
      0
      1
      0
      0
      ...
      1
      0
      0
      0
      0
      0
      1
      0
      4
      0.305214
    
    
      2962
      2006
      M
      17.960
      51
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0.043324
    
    
      6076
      2006
      F
      18.319
      54
      0
      0
      0
      0
      0
      1
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0.109869
    
    
      14262
      2007
      M
      18.215
      14
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      1
      0
      4
      0.107005
    
    
      16646
      2008
      M
      16.304
      0
      0
      0
      0
      0
      0
      0
      ...
      1
      0
      0
      0
      0
      1
      0
      0
      0
      0.123572
    
    
      26571
      2009
      F
      16.085
      30
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0.082439
    
    
      20107
      2008
      F
      16.528
      119
      1
      0
      0
      1
      0
      0
      ...
      0
      0
      0
      0
      1
      0
      2
      1
      4
      0.294885
    
    
      13933
      2007
      NaN
      NaN
      8
      0
      0
      0
      0
      0
      1
      ...
      1
      0
      0
      0
      0
      0
      0
      0
      1
      0.233036
    
    
      12905
      2007
      M
      17.695
      16
      0
      0
      1
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0.087391
    
  

10 rows × 42 columns



In [20]:

    
df.pivot_table("distance", "cluster", "gender", aggfunc="mean")

Let's find the anomalies in the features depending the distance of the profile from its centroid. Using Box Whisker method, identify the outliers.



In [21]:

    
def find_outliers(a):
    q1, q2, q3 = np.percentile(a, [25, 50, 75])
    iqr = q3 - q1
    lower_whisker = max(q1 - 1.5 * iqr, np.min(a))
    upper_whisker = min(q3 + 1.5 * iqr, np.max(a))
    q1, q2, q3, iqr, lower_whisker, upper_whisker
    is_outlier = (a < lower_whisker) | (a > upper_whisker)
    return is_outlier



In [22]:

    
anamolies = df[find_outliers(df.distance)]
anamolies.shape, df.shape









    Out[22]:





((1543, 42), (30000, 42))



In [23]:

    
df.distance.plot.hist(bins = 50, log = True)









    Out[23]:





<matplotlib.axes._subplots.AxesSubplot at 0x1a27bde860>

Apply dendo-gram (hierarchical clustering)



In [24]:

    
from scipy.cluster.hierarchy import linkage, dendrogram



In [25]:

    
plt.figure(figsize = (15, 10))
row_clusters = linkage(X_std, method="complete", metric="euclidean")
f = dendrogram(row_clusters, p = 5, truncate_mode="level")



In [ ]:

gender	F	M
cluster
0	0.131036	0.126245
1	0.281103	0.333438
2	0.260543	0.266780
3	0.231254	0.225874
4	0.232222	0.205003

	gradyear	gender	age	friends	football	...	mall	shopping	death	drunk	drugs
0	2006	M	18.982	7	0	...	0	0	0	0	0
1	2006	F	18.801	0	1	...	1	0	0	0	0
2	2006	M	18.335	69	1	...	0	0	1	0	0
3	2006	F	18.875	0	0	...	0	0	0	0	0
4	2006	NaN	18.995	10	0	...	0	2	0	1	1

	0	1	2	3	4
basketball	2.185926e-01	0.449442	0.436441	0.412173	0.282163
football	2.111973e-01	0.377216	0.549435	0.360555	0.273462
soccer	1.931401e-01	0.367695	0.264124	0.276194	0.249845
softball	1.303934e-01	0.288903	0.238701	0.251541	0.161591
volleyball	1.103398e-01	0.342416	0.211864	0.181818	0.122436
swimming	1.041695e-01	0.267892	0.189266	0.202234	0.162213
cheerleading	3.017104e-02	0.085686	2.888418	0.058937	0.046613
baseball	9.568531e-02	0.119829	0.185028	0.143297	0.106277
tennis	7.817250e-02	0.136901	0.076271	0.108629	0.089497
sports	1.191870e-01	0.175312	0.168079	0.223421	0.210690
cute	2.137834e-01	0.800394	0.576271	0.556626	0.424487
sex	1.215916e-01	0.266907	0.300847	0.447612	0.839030
sexy	1.156935e-01	0.186146	0.203390	0.208783	0.269111
hot	8.343542e-02	0.314183	0.251412	0.211479	0.170914
kissed	4.645887e-02	0.136573	0.189266	0.325886	0.420137
dance	3.249399e-01	0.832239	0.668079	0.661017	0.540087
band	2.689079e-01	0.338805	0.261299	0.433359	0.436917
marching	3.978948e-02	0.040053	0.024011	0.053929	0.038533
music	6.345901e-01	0.984570	0.748588	1.137519	1.026725
rock	1.959076e-01	0.345043	0.360169	0.389445	0.413300
god	4.086475e-01	0.561064	0.608757	0.610555	0.743319
church	1.971780e-01	0.468155	0.365819	0.379815	0.266004
jesus	1.033075e-01	0.149048	0.115819	0.135978	0.121815
bible	1.851096e-02	0.021668	0.026836	0.036210	0.032940
hair	2.422758e-01	0.754760	0.661017	1.065485	1.121193
dress	5.784674e-02	0.387722	0.146893	0.194530	0.164077
blonde	5.389955e-02	0.166776	0.289548	0.173344	0.211311
mall	1.259471e-01	1.013132	0.418079	0.397535	0.330019
shopping	1.407831e-01	1.778070	0.707627	0.439137	0.267247
clothes	7.291390e-14	0.174327	0.217514	1.355162	0.156619
hollister	3.039789e-02	0.253775	0.211864	0.151772	0.067744
abercrombie	2.146001e-02	0.192055	0.158192	0.100924	0.064015
die	1.455923e-01	0.223572	0.190678	0.287365	0.467371
death	9.305385e-02	0.155942	0.127119	0.167180	0.234307
drunk	-3.529121e-14	0.048260	0.086158	0.050847	1.428838
drugs	3.026178e-02	0.073867	0.081921	0.124037	0.336234

	gradyear	gender	age	friends	basketball	soccer	softball	swimming	...	shopping	die	death	drunk	drugs	cluster	distance
16893	2008	F	16.364	0	0	0	0	0	...	0	0	0	0	0	0	0.046323
18432	2008	F	16.690	23	0	0	1	0	...	1	0	0	1	0	4	0.305214
2962	2006	M	17.960	51	0	0	0	0	...	0	0	0	0	0	0	0.043324
6076	2006	F	18.319	54	0	0	0	1	...	0	0	0	0	0	0	0.109869
14262	2007	M	18.215	14	0	0	0	0	...	0	0	0	1	0	4	0.107005
16646	2008	M	16.304	0	0	0	0	0	...	1	0	1	0	0	0	0.123572
26571	2009	F	16.085	30	0	0	0	0	...	0	0	0	0	0	0	0.082439
20107	2008	F	16.528	119	1	0	1	0	...	0	1	0	2	1	4	0.294885
13933	2007	NaN	NaN	8	0	0	0	1	...	1	0	0	0	0	1	0.233036
12905	2007	M	17.695	16	0	1	0	0	...	0	0	0	0	0	0	0.087391

Dataset: Social media data