When it comes to consuming real-time content, Twitter is the place to be; Be it sending out 'Tweets' real-time, or discovering latest online 'Trends' anywhere, or ability to begin a 'conversation' with anyone, Twitter does it all. In fact, Twitter Management wrote this in their letter to shareholders last year.

We’re focused now on what Twitter does best: live. Twitter is live: live commentary, live connections, live conversations. Whether it’s breaking news, entertainment, sports, or everyday topics, hearing about and watching a live event unfold is the fastest way to understand the power of Twitter.

Twitter has always been considered a “second screen” for what’s happening in the world and we believe we can become the first screen for everything that’s happening now. And by doing so, we believe we can build the planet’s largest daily connected audience. A connected audience is one that watches together, and can talk with one another in real-time. It’s what Twitter has provided for close to 10 years, and it’s what we will continue to drive in the future

Embedded in a Twitter User's Social-graph* is a wealth of information on User's likes and interests. Unlike Facebook or LinkedIn, the magic of Twitter is in its 'Follow' structure - where any can follow any without they knowing each other. This directed social-graph, when methodologically summarized, can reveal interesting information on the most influential/central friends in the network and also help personalize/enrich one's Twitter experience by unearthing implicit-clusters in the network.

*Social-graph: [User, 1st degree Connections, and the links between]

In this notebook, we'll look into these:

  1. Extract the ego-network of 'self' using Twitter API
  • Identify the influential nodes using 'Page Rank' formulation
  • Identify implicit clusters formed
  • Recommend new friends to follow on the basis of cluster of interest

Note: This study is limited to Ego network: the focal node ("ego": here the self-node) and the nodes to whom ego is directly connected to ("alters") plus the ties, if any, among the alters.

1. Extract social-graph strucutre using Twitter API


In [1]:
from collections import Counter, defaultdict
from datetime import datetime
from sklearn.decomposition import PCA

import csv
import matplotlib.pyplot as plt
import numpy as np
import os.path
import pandas as pd
import re
import seaborn as sns;  sns.set()
import time
import twitter

% matplotlib inline
plt.rcParams['figure.figsize'] = (10,7)

import warnings
warnings.filterwarnings("ignore")
#print twitter.__path__  

import random
random.seed(1000)

In [2]:
# Pandas data-frame print formatting
from IPython.core.display import HTML
css = open('style-table.css').read() + open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))


Out[2]:

In [3]:
self_screen_name = 'bala_io'            # Self

# Keep appending data
fof_filename     = "edges.csv"          # 'Alters' and their Source->Sink Edges. 
cache_filename   = "cache.csv"          # local cache of (TwitterId, FullName, UserName)

# One-time-use files
binaryMap_filename = "binaryMap.csv"   # Directed Graph. Adjacencies as 0/1.RowFollowCol
cluster_filename   = "results.csv"

In [4]:
# Twitter auth. https://dev.twitter.com/oauth/overview/application-owner-access-tokens

with open("../../passwd/credentials.txt", "r") as f:
    reader = csv.reader(f )
    login_dict = {line[0]: line[1]                       
                    for line in reader}        

api = twitter.Api(consumer_key=login_dict.get('consumer_key') ,
                  consumer_secret=login_dict.get('consumer_secret'),
                  access_token_key=login_dict.get('access_token_key'),
                  access_token_secret=login_dict.get('access_token_secret'))
api


Out[4]:
<twitter.api.Api at 0x7f1c2f6bb898>

In [5]:
# 'Self' and Friends of Self

self_node    =  api.GetUser(screen_name = self_screen_name)    
self_node_id =  str(self_node.id)                      # Twitter Id of Self

friends_of_self = api.GetFriendIDs(user_id = self_node_id, 
                                   screen_name = self_screen_name , 
                                   stringify_ids = True)
index = [self_node_id] + friends_of_self

In [6]:
# GetFriendIDs() API call is rate-limited at 15 req / 15 min 
# https://dev.twitter.com/rest/public/rate-limiting

# For each of the list of nodes, fetch the list of nodes it follows, append to file

def update_FoF_File(fileName, to_fetch_list):
    
    with open(fileName, 'a') as f:
        apiReqCount = 0
        
        for node in to_fetch_list:
            friends_of_node = api.GetFriendIDs(user_id = node,  stringify_ids = True) 
            row = ','.join([str(i) for i in [node] +  friends_of_node ]) + "\n"
            f.write(row) 
            
            apiReqCount += 1
            
            if (apiReqCount == 15): 
                apiReqCount = 0
                
                print("Off to Sleep :)")
                time.sleep(15*60 + 10)

In [7]:
# parse FoF file and return list of nodes, for whom source->sink Edges are already there.

def getFinishList(fileName):
        
    if not os.path.isfile(fileName):
        return [] 
    with open(fileName, 'r') as f:
        return [ row.strip().split(',')[0] for row in f ]  # 1st entry is a user

In [8]:
# Ego-network as adjacency-matrix
# Parses FoF file in order of index, create list of adjacencies as 0 | 1
# Writes to File. Adjacency-matrix in Row_follows_Column format

def updateBinaryMapFile(fof_filename, binaryMap_filename, index):
    
    with open(fof_filename, "r") as f:
        stripped_f = (line.replace('\r', '') for line in f)
        reader = csv.reader(stripped_f)
        fof_dict = {line[0]: line[1:] for line in reader 
                    if line[0] in index}   # dict of node:his_followers
        if self_node_id not in fof_dict:
            fof_dict[self_node_id] = index[1:]             # dict of Self
    
    bool_list = []
    
    for user in index:
        user_friends = set( fof_dict[user] )  
        bool_row = [item in user_friends for item in index]  # for each, fill T/F 
        bool_list.append(bool_row)
    
    int_nparray = np.array(bool_list) + 0                    # Bool to int

    binaryMap_rfc = pd.DataFrame(data = int_nparray, columns= index, index = index)
    binaryMap_rfc.to_csv(binaryMap_filename)

In [9]:
# For list of Ids, fetch Profile details. If not in Offline file, make an API call
# Returns ['UserName', 'FullName', 'Followers_count', 'Friends_count', 'Location', 'Created_at']
# UsersLookup API 100 Ids/request

def lookup_in_cache(friendsIdsList):

    cache, delta_cache = pd.DataFrame(), pd.DataFrame()
    UserNameList, namesList     = [], []
    followers_count, friends_count = [], []
    location, created_at           = [], []
    
    if os.path.isfile(cache_filename):
        cache = pd.read_csv(cache_filename, skip_blank_lines=True,
                        dtype={'Ids':str, 'Friends_count':int, 'Followers_count':int})
        cache.set_index('Ids', inplace=True)
        to_fetch_list = list ( set (friendsIdsList) - set(cache.index) )
    else :        
        to_fetch_list = friendsIdsList   
    
    i = 0
    while (i < len(to_fetch_list) * 1./100):   
        print("... Cache-Miss for " + str(len(to_fetch_list)) +  " nodes. Updating Cache...")
        low, high = i * 100, min( len(to_fetch_list), (i+1)*100 )  # UsersLookup api
        twitterObjectsList = api.UsersLookup(user_id = to_fetch_list[low:high])    
        
        temp = zip(*[( tempObject.screen_name,        #ScreenName
                      tempObject.name,                #Name
                      tempObject.followers_count,     #Followers
                      tempObject.friends_count,       #Friends
                      tempObject.location,            #Location
                      tempObject.created_at           #CreatedAt
                     ) for tempObject in twitterObjectsList])
        temp = list(temp)
        
        UserNameList    += list(temp[0])
        namesList       += list(temp[1])
        followers_count += list(temp[2])
        friends_count   += list(temp[3])
        location        += list(temp[4])
        created_at      += list(temp[5])
        
        i = i + 1       
    
    if len(to_fetch_list) > 0:
        delta_cache  = pd.DataFrame({'UserName':UserNameList,
                                     'FullName':namesList,
                                     'Ids': to_fetch_list,
                                     'Followers':followers_count,
                                     'Friends': friends_count,
                                     'Location':location,
                                     'Created':created_at})
        
        delta_cache['Created'] = delta_cache['Created'].apply(lambda x: 
                                                datetime.strptime(
                                                    re.sub(r"\+[0-9]* ", "",x),'%c').
                                                strftime("%b-%Y"))
    
        delta_cache.set_index('Ids', inplace=True, drop = True)
        cache = cache.append(delta_cache)
        cache.to_csv(cache_filename)        

    return cache.loc[friendsIdsList]

In [10]:
# Display cluster-wise most-influential users, for the given clustering algo

def top_nodes_in_cluster(df, cluster_algo, n_clusters):
    dummy_df = pd.DataFrame()

    for i in range(n_clusters):
        nodes_in_cluster = list( df [df[cluster_algo] == i ]['FullName'] )     
        if len(nodes_in_cluster) >= 10:            # show only clusters of size > 10        
            col_name           = str(i) + " : " + str(len(nodes_in_cluster)) + " Ids"
            dummy_df[col_name] = nodes_in_cluster[:10]      
    
    return dummy_df

In [11]:
# identify 20 friends to follow after aggregating friends followed by top 50% in list

def discover_Friends_toFollow(ids_of_interest, friend_list, prop = .5, count = 20):    
    
    ids_of_interest  = ids_of_interest[:int(len(ids_of_interest) * prop)]
    
    if self_node_id in ids_of_interest:
        ids_of_interest.remove(self_node_id)

    print("'Who-To-Follow' reco after looking at %3d friends' friends:" %(len(ids_of_interest)))
    
    with open(fof_filename) as f:
        reader = csv.reader(f)
        fof_dict = {row[0]:row[0:] for row in reader}  # dict of node:her_followers

    friendsToFollow = []
    for id in ids_of_interest:
        friendsToFollow += list (set(fof_dict[str(id)])  - set(friend_list) ) 
        

    friendsToFollow = Counter(friendsToFollow).most_common(count)    
    tuples_list = list(zip(*friendsToFollow) )
           
    topFriendsToFollowDF = pd.DataFrame()
    topFriendsToFollowDF['Ids'] = list(tuples_list[0])
    topFriendsToFollowDF['Freq'] = list(tuples_list[1])
    topFriendsToFollowDF.set_index('Ids', drop = True, inplace = True)              
    index = topFriendsToFollowDF.index
    
    topFriendsToFollowDF = topFriendsToFollowDF.merge(lookup_in_cache(index), copy = False,
                              left_index = True, right_index = True)
    
    return topFriendsToFollowDF

In [12]:
# For the list of nodes I follow, fetch their friends-list

fof_finish_list   = getFinishList(fof_filename )        # Completed nodes
fof_to_fetch_list = list ( set(friends_of_self) - set(fof_finish_list) )  # Pending nodes

print( str(len(fof_to_fetch_list)) + " out of " + str(len(index)  - 1) + 
      " Friends details to be fetched")


1 out of 376 Friends details to be fetched

In [13]:
# For the remaining nodes, populate their details in fof_file
update_FoF_File(fof_filename, fof_to_fetch_list) 

# Build the adjacency matrix in terms of 0 and 1 (if there is an edge)
updateBinaryMapFile(fof_filename, binaryMap_filename, index)

In [14]:
# Read adj-matrix into df. Cell MxN is 1 iff node in Mth row follows node in Nth column 
binaryMap_rfc = pd.read_csv(binaryMap_filename, skip_blank_lines=True, index_col = 0)
print(binaryMap_rfc.shape)


outlinks_count = binaryMap_rfc.sum(axis = 1)   # horizontal-sum to count outlinks
inlinks_count = binaryMap_rfc.sum(axis = 0)   # vertical-sum to count inlinks


(377, 377)

In [15]:
# Histogram of number of OutLinks per node, within ego-network
sns.distplot(outlinks_count, bins = len(index)//10, kde=False);



In [16]:
# Histogram of number of InLinks per node, within ego-network
sns.distplot(inlinks_count, bins = len(index)//10, kde=False);


2. Identify influential friends using 'Page Rank' formulation

From the adjacency matrix generated above, we can construct a column-stochastic matrix (also called a transition matrix) such that, a column with m outlinks will have 1/m as value in respective m cells. Conceptually, a value in the cell a x b gives the probability of a random-surfer in node B jumping to node A.


In [17]:
binaryMap_cfr      = binaryMap_rfc.transpose()            # column-values: Outlinks
binaryMap_cfr_norm = binaryMap_cfr / binaryMap_cfr.sum(axis = 0)
colStochMatrix = np.matrix( binaryMap_cfr_norm.fillna(0)) # column-stochastic-matrix

Initialize PageRank vector, such that all the nodes have equal PageRank score adding upto 1.


In [18]:
pageRankVector = np.matrix([1.0/len(index)] *  len(index)) # iniitialize page-rank-vector 
pageRankVector = pageRankVector.transpose()                # transpose to column-vector

On applying the above Transition-matrix transformation iteratively on the PageRank vector, the vector will eventully converge such that:

Matrix.Vector = Vector

Equivalently, this is Eigen-Vector formulation with PageRank vector being the principal eigen-vector of matrix corresponding to eigen-value 1 [Since M is column-stochastic matrix, principal eigen-value is 1]

Here we use Power Iteration method to solve for the PageRank vector. Inorder to handle the nodes which have zero out-links (dead-ends) and nodes with periodic-loops (spider-traps) in the ego-network, we introduce some randomness through parameter beta such that a random-surfer at any node picks a random path approx. every 1 out of 6 times ( 1 - beta = 0.15)


In [19]:
# PageRank algo: Power Iteration to solve Markov transition matrix 
# refer this     : http://setosa.io/blog/2014/07/26/markov-chains/index.html

beta = 0.85
epsilon = 999
iteration = 0
while epsilon > (1.0/(10**16)):
    pageRankVectorUpdating = colStochMatrix * pageRankVector * beta 
    
    # re-insert leaked page-ranks
    S = np.array(pageRankVectorUpdating).sum()                      
    pageRankVectorUpdated = pageRankVectorUpdating + (
        1 - S) * (1.0/len(index)) * np.ones_like(len(index))
    
    # compute the squared-difference and check for convergence
    error = np.array(pageRankVectorUpdated - pageRankVector)
    epsilon = np.sqrt((error *  error).sum())   
    iteration = iteration + 1
    pageRankVector = pageRankVectorUpdated    
    
print( "Sum of Page-Rank Scores: " + str(pageRankVector.sum()) + 
      "\nConverged in " + str(iteration) + " iterations")


Sum of Page-Rank Scores: 1.0
Converged in 58 iterations

In [20]:
# Collect the results

results_df = pd.DataFrame()
results_df['Ids'], results_df['PageRank'] = index, pageRankVector
results_df['Inlinks'], results_df['Outlinks'] = list(inlinks_count), list(outlinks_count)
results_df = results_df.set_index('Ids', drop = True )

results_df = results_df.merge(lookup_in_cache(index), copy = False,
                              left_index = True, right_index = True)
results_df = results_df[['PageRank','UserName', 'FullName', 'Inlinks' , 'Outlinks',
                         'Followers','Friends', 'Location', 'Created' ]]


... Cache-Miss for 1 nodes. Updating Cache...

2a. Nodes with high PageRank scores


In [21]:
results_df.fillna('').sort_values(by = 'PageRank', ascending =False).set_index('FullName').head(10)


Out[21]:
PageRank UserName Inlinks Outlinks Followers Friends Location Created
FullName
Elon Musk 0.014158 elonmusk 201 4 8845530 41 Boring Jun-2009
Bill Gates 0.011921 BillGates 168 16 34940895 183 Seattle, WA Jun-2009
Marc Andreessen 0.011725 pmarca 192 24 645085 13952 Menlo Park, CA May-2007
Edward Snowden 0.010114 Snowden 97 0 3150406 1 Dec-2014
Tim O'Reilly 0.009973 timoreilly 178 132 1987340 1709 Oakland, CA Mar-2007
WIRED 0.009110 WIRED 103 10 8974304 279 San Francisco/New York Mar-2007
Nate Silver 0.008281 NateSilver538 145 26 2276360 992 New York Aug-2008
Chris Dixon 0.008217 cdixon 162 184 545467 3054 CA & NYC Mar-2007
Fred Wilson 0.008190 fredwilson 158 75 617149 1160 New York City Mar-2007
Ev Williams 0.008037 ev 141 104 1979132 1781 San Francisco, CA, US Mar-2006

2b. Top 100 influential-nodes in Ego-Network


In [22]:
dummy_df = pd.DataFrame()
temp_df  = results_df.sort_values( by = 'PageRank', ascending =False)

for i in range(10):
    dummy_df[i] = list (temp_df [10*i : 10* i + 10]['FullName'])
dummy_df


Out[22]:
0 1 2 3 4 5 6 7 8 9
0 Elon Musk Reid Hoffman Sam Altman Tim Urban Microsoft Research Brian Chesky Tony Fadell John D. Cook David Sacks Clayton Christensen
1 Bill Gates Bill Gurley Steven Levy Joi Ito John Markoff Max Roser Matt Cutts Naval Ravikant Tim Berners-Lee GV
2 Marc Andreessen Eric Schmidt Chris Anderson Benedict Evans Neil deGrasse Tyson Bradley Horowitz Yann LeCun Tim Ferriss Recode Mark Suster
3 Edward Snowden Paul Graham Hilary Mason Brad Stone Charlie Rose Show 🍪Steven Sinofsky ॐ Atul Gawande John Lilly Sarah Lacy Drew Conway
4 Tim O'Reilly Vinod Khosla OpenAI NassimNicholasTaleb Bill Gross Gabe Rivera Jessica Lessin Dan Primack Edward Tufte Fei-Fei Li
5 WIRED Om Malik Steve Case Josh Kopelman danah boyd Walt Mossberg Google Research Jimmy Wales Sebastian Thrun Mike Bloomberg
6 Nate Silver TechCrunch Sundar Pichai DAVE MORIN Pierre Omidyar Steven Pinker Jonah Peretti Paul Kedrosky Josh Elman a16z
7 Chris Dixon Ben Horowitz Patrick Collison Hunter Walk Emily Chang Keith Rabois jason Startup L. Jackson Jeff Hammerbacher Kevin Weil
8 Fred Wilson Kara Swisher Max Levchin Y Combinator dj patil Andrew Ng EFF Kevin Kelly Bret Taylor Mike Bostock
9 Ev Williams Chris Sacca Nick Bilton Lessig Ben Thompson Glenn Greenwald Chamath Palihapitiya John Collison Paul Buchheit Jeff Jordan

2c. Histogram of PageRank scores

PageRank scores are scaled such that nodes have an average-score of 1. So the scores below give an idea of how influential are the nodes, with respect to an average node.


In [23]:
pageRank_to_plot = len(index) * results_df["PageRank"]
sns.distplot(pageRank_to_plot, kde=False, rug=True, bins = len(index)//10);


A joint-plot showing how Inlinks and Outlinks of the nodes are distributed (within the ego-network)


In [24]:
sns.jointplot(x="Inlinks", y="Outlinks", data=results_df, kind = "kde");


3. Identify implicit clusters using Clustering algos

Typically, number of clusters are chosen with a plot of within-cluster sum-of-squares-of-dstances vs number of clusters. Here for simplicity, we use a simple heuristic to fix the number of clusters in advnace.


In [25]:
n_clusters = min( int( round(np.sqrt(len(index)/2)) ), 10 ) # not more than 10 clusters
print(n_clusters)


10

3a. K-Means clustering

K Means is a point-assignment based clustering-algorithm: here we start with k points chosen randomly as centroids, assign the remaining points to k centroids by using certain distance measure (Euclidean / Cosine / Jaccardi). Then we compute new centroids, re-assign remaining points and repeat, until there is convergence of centroids.

Here we are more interested in clustering inherent in who-follows-me (network-driven) graph, rather than who-do-I-follow (self-driven). So the approach is to represent the nodes as Observations and whether other nodes follow them or not (1 or 0) as Features (One observation per row, and one feature per column) Thus the input matrix must be such that any value in cell implies whether node in the column follows node in the row.


In [26]:
from sklearn.cluster import KMeans
est = KMeans(max_iter = 100000, n_clusters = n_clusters, n_init = 200, init='k-means++')  
results_df['kmeans'] = est.fit_predict(binaryMap_cfr)

top_nodes_in_cluster(results_df.sort_values( by = 'PageRank', ascending =False), 
                     'kmeans', n_clusters)


Out[26]:
0 : 45 Ids 1 : 67 Ids 2 : 39 Ids 3 : 28 Ids 4 : 21 Ids 5 : 17 Ids 6 : 15 Ids 7 : 24 Ids 8 : 18 Ids 9 : 103 Ids
0 Edward Snowden Tim Urban John D. Cook Hilary Mason OpenAI Elon Musk WIRED Mike Bostock NassimNicholasTaleb Charlie Rose Show
1 Nate Silver Tim Ferriss Lynn Cherny dj patil Microsoft Research Bill Gates Eric Schmidt Gilad Lotan Neil deGrasse Tyson Irene Au
2 Sundar Pichai GV Ben Hamner Jeff Hammerbacher Andrew Ng Marc Andreessen TechCrunch Jeffrey Heer Max Roser Steven Brooks
3 Patrick Collison Mike Bloomberg David Smith Drew Conway Yann LeCun Tim O'Reilly Steven Levy Martin Wattenberg Steven Pinker Udacity
4 Nick Bilton Garry Tan Fernando Perez Wes McKinney Google Research Chris Dixon Chris Anderson Andrew Gelman Glenn Greenwald briankrebs
5 Benedict Evans Joe Gebbia Andreas Mueller Monica Rogati Sebastian Thrun Fred Wilson Steve Case Fernanda Viégas Atul Gawande Christopher Soghoian
6 Brad Stone Yves Behar Kaggle Hadley Wickham Fei-Fei Li Ev Williams Joi Ito Steven Strogatz EFF IDEO
7 Josh Kopelman Ashlee Vance John Foreman Sean J. Taylor Andrej Karpathy Reid Hoffman Lessig Duncan Watts Kevin Kelly Stanford University
8 DAVE MORIN Dustin Moskovitz Jake VanderPlas Nathan Yau Shivon Zilis Bill Gurley John Markoff Kenneth Cukier Tim Berners-Lee Gary Marcus
9 Hunter Walk Balaji S. Srinivasan Scientific Python Peter Skomoroch Pedro Domingos Paul Graham Bill Gross Sinan Aral Edward Tufte Matías Duarte

3b. Spectral clustering

One problem in using K-Means algorithm to cluster the given social-graph is 'Curse of Dimensionality' i.e. at higher-dimensions (here ~400 dimensions), metric like 'Euclidean distance' or 'Centre' would have little meaning in the context of non-convex adjacency matrix.

On the other hand, spectral clustering attempts to partition the graph such that number of edges which connect different components are minimized. Below is the output from Spectral Clustering.


In [27]:
from sklearn import cluster

spectral = cluster.SpectralClustering(n_clusters=n_clusters,  n_init = 500,
                                          eigen_solver='arpack',
                                          affinity="nearest_neighbors")
spectral.fit(binaryMap_cfr)
results_df['spectral'] = spectral.labels_.astype(np.int)

top_nodes_in_cluster(results_df.sort_values( by = 'PageRank', ascending =False), 
                     'spectral', n_clusters)


Out[27]:
0 : 34 Ids 2 : 19 Ids 3 : 27 Ids 4 : 55 Ids 5 : 11 Ids 6 : 10 Ids 7 : 187 Ids 8 : 14 Ids 9 : 13 Ids
0 dj patil Lynn Cherny Tim O'Reilly Nate Silver Hilary Mason Elon Musk Edward Snowden Andrew Ng Luke Wroblewski
1 Jeff Hammerbacher Ben Hamner Ev Williams Sundar Pichai Drew Conway Bill Gates WIRED Yann LeCun Matías Duarte
2 Monica Rogati Olivier Grisel Bill Gurley Patrick Collison Wes McKinney Marc Andreessen Eric Schmidt Fei-Fei Li Julie Zhuo
3 Peter Skomoroch Fernando Perez Om Malik Benedict Evans Hadley Wickham Chris Dixon Vinod Khosla Andrej Karpathy Jon Wiley
4 Michael E. Driscoll Andreas Mueller Kara Swisher Y Combinator Sean J. Taylor Fred Wilson TechCrunch Stanford NLP Group Daniel Burka
5 bradford cross Jake VanderPlas Chris Sacca Emily Chang Nathan Yau Reid Hoffman Chris Anderson John Platt Braden Kowitz
6 Jeremy Howard Scientific Python Steven Levy Ben Thompson John Myles White Paul Graham OpenAI Richard Google Design
7 Mike Olson Sebastian Raschka Nick Bilton Brian Chesky chris wiggins Ben Horowitz Steve Case Oren Etzioni Josh Brewer
8 Anthony Goldbloom Trey Causey Josh Kopelman 🍪Steven Sinofsky ॐ Ben Lorica 罗瑞卡 Sam Altman Tim Urban Ian Goodfellow Jake Knapp
9 Joe Hellerstein Joshua Bloom DAVE MORIN Tony Fadell Josh Wills Max Levchin Joi Ito Nando de Freitas John Zeratsky

Nodes mainly into Deep Learning community have grouped into Cluster 8, Python Machine Learning community into Cluster 2, Design community into Cluster 9, general Data Science community into Cluster 0 and 5

One smaller clusters (1) wasn't shown above.


In [29]:
results_df [results_df['spectral'].isin([1])].sort_values(by = 'PageRank', ascending =False).set_index('FullName').head()


Out[29]:
PageRank UserName Inlinks Outlinks Followers Friends Location Created kmeans spectral
FullName
JAMA 0.000658 JAMA_current 4 3 205154 559 Chicago, IL May-2009 9 1
FSI Stanford 0.000643 FSIStanford 5 5 6614 285 Stanford, CA Dec-2010 9 1
Adam Johnson 0.000638 AJInsight 4 26 27945 1232 New York Jul-2010 9 1
PNAS 0.000595 PNASNews 4 17 66720 1110 Washington, DC Feb-2011 9 1
IQSS @ Harvard 0.000558 IQSS 5 4 4241 360 Cambridge, MA May-2009 9 1

3c. Affinity Propagation clustering

Unlike K-Means or Spectral clustering, affinity propagation doesn't require the number of clusters to be estimated beforehand. Here the algorithm finds 'exemplars' i.e. members of dataset that are representative of clusters. and tries to find clusters around them.


In [30]:
from sklearn.cluster import AffinityPropagation

af = AffinityPropagation(preference=-50).fit(binaryMap_cfr)
results_df['affinity'] = af.labels_
n_clusters_affinity = len(af.cluster_centers_indices_)

print(str(n_clusters_affinity) + " affinity clusters.")

top_nodes_in_cluster(results_df.sort_values( by = 'PageRank', ascending =False), 
                     'affinity', n_clusters_affinity)


128 affinity clusters.
Out[30]:
2 : 87 Ids 6 : 34 Ids 32 : 23 Ids 38 : 11 Ids 41 : 11 Ids 111 : 42 Ids
0 Tim Urban Adam Grant Lynn Cherny John D. Cook Jeff Jordan Yves Behar
1 Charlie Rose Show Steven Strogatz Ben Hamner chris wiggins Dustin Moskovitz Ashlee Vance
2 Irene Au Tim Harford Olivier Grisel Jeremy Howard Balaji S. Srinivasan WikiLeaks
3 John Carmack Gary Marcus Fernando Perez Xavier Amatriain Horace Dediu Seth Godin
4 The Information DHH Andreas Mueller Anthony Goldbloom Jessica Livingston Tom Hulme
5 Paul Ford BJ Fogg Jake VanderPlas Kaggle David Marcus Udacity
6 briankrebs Joe Blitzstein Scientific Python John Foreman Frank Chen IDEO
7 Christopher Soghoian Brian Nosek Sebastian Raschka Edwin Chen Kleiner Perkins Scott Dadich
8 John Hagel Martin Varsavsky Trey Causey Mark Faridani Amir Efrati ReadWrite
9 Sarah Frier Russell Roberts Joshua Bloom Shane Conway Scott Kupor Peter Diamandis

3d. Principal Component Analysis

To handle the 'curse of dimensionality' problem inherent in high-dimension data, PCA is generally used to represent the high-dimensional data in a fewer number of dimensions - this helps in better visualizing of data as well as faster computation while running clustering algorithm.


In [31]:
pca = PCA(n_components=3)
Xproj = pca.fit_transform(binaryMap_cfr)

results_df['dim1'] = Xproj[:,0]
results_df['dim2'] = Xproj[:,1]
results_df['dim3'] = Xproj[:,2]

results_df = results_df.sort_values( by = 'PageRank', ascending =False)
results_df.to_csv(cluster_filename)

print("Explained-variance and Proportion of Explained-variance in 3 dimensions [dim1 dim2 dim3]")
print(pca.explained_variance_, pca.explained_variance_ratio_)


Explained-variance and Proportion of Explained-variance in 3 dimensions [dim1 dim2 dim3]
[ 7.09266174  3.16212677  1.21891396] [ 0.18363161  0.08186862  0.03155813]

A simpler visualization of Spectral-clustering outcome as rendered in 2 dimensions.


In [32]:
# Spectral clustering | Plot the ego-network in 2 dimensions

plt.figure(num=None, figsize=(20, 10), dpi=80, facecolor='w', edgecolor='k')
plt.scatter(results_df['dim1'], results_df['dim2'], s = 100  ,c= results_df['spectral'], 
            alpha=0.5, cmap=plt.cm.get_cmap('nipy_spectral', 10))
plt.colorbar();


4. Recommend new friends to follow

Now that we have PageRank in place for nodes in social-graph, we can ask for recommendations on the basis of top-ranked nodes in the graph. E.g. To get 20 recommendations, after looking at friends of top PageRank scoring nodes in my network

4a. Reco: After looking at top nodes in full ego-network


In [33]:
discover_Friends_toFollow(ids_of_interest = index,
                          friend_list = index, 
                          prop = .5 , count =  20).fillna('').set_index('FullName')


'Who-To-Follow' reco after looking at 187 friends' friends:
Out[33]:
Freq Created Followers Friends Location UserName
FullName
Medium 67 May-2012 2188119 78 San Francisco, CA, US Medium
Barack Obama 67 Mar-2007 88892730 629775 Washington, DC BarackObama
Aaron Levie 52 Mar-2007 1626688 423 Palo Alto levie
Clay Shirky 51 May-2007 365919 749 Shanghai cshirky
Mitch Kapor 47 Mar-2007 116400 553 Oakland mkapor
The New York Times 45 Mar-2007 37377094 894 New York City nytimes
jack 45 Mar-2006 4055017 2791 California, USA jack
marissamayer 44 Nov-2008 1705305 350 San Francisco, CA marissamayer
Stewart Butterfield 43 Sep-2006 65193 2418 West coast stewart
Anil Dash 40 Dec-2006 613133 4684 NYC anildash
dick costolo 40 May-2007 1739649 681 San Francisco dickc
Chris Anderson 40 Nov-2008 184874 977 Berkeley chr1sa
The Economist 39 May-2007 20816240 159 London TheEconomist
President Obama 39 Jun-2013 15459612 79 Washington, D.C. POTUS44
Farhad Manjoo 🐣 39 Mar-2007 158093 3674 I'm right over here (it's not NYC) fmanjoo
John Maeda 38 Jul-2008 439273 21724 WordPressLandia johnmaeda
David Pogue 38 Oct-2007 1564476 1794 USA Pogue
Hanna Wallach 37 Sep-2012 7537 795 Brooklyn, NY hannawallach
Tim Cook 37 Jul-2013 4925005 56 Cupertino tim_cook
Hillary Clinton 37 Apr-2013 15590192 762 New York, NY HillaryClinton

Recommendations on the basis of a specific cluster outcome. E.g. Nodes to follow on the basis of top PageRank nodes in 'Spectral Clustering::Data' clusters.

Reco: 4b. After looking at Data-Science clustsers (Spectral clustering)


In [34]:
favorite_cluster_df = results_df [results_df['spectral'].isin([0,2,5,8])]
favorite_cluster_list = list(favorite_cluster_df.index)

discover_Friends_toFollow(ids_of_interest = favorite_cluster_list, 
                          friend_list = index, 
                          prop = .5, count = 30).fillna('').set_index('FullName')


'Who-To-Follow' reco after looking at  39 friends' friends:
... Cache-Miss for 3 nodes. Updating Cache...
Out[34]:
Freq Created Followers Friends Location UserName
FullName
Pete Warden 21 May-2008 12142 1559 San Francisco, CA petewarden
Hanna Wallach 19 Sep-2012 7537 795 Brooklyn, NY hannawallach
Barack Obama 19 Mar-2007 88892730 629775 Washington, DC BarackObama
Alex Smola 19 Jun-2008 8768 48 Mountain View and Pittsburgh smolix
Hal Daumé III 18 Jun-2010 8702 1017 Washington DC haldaume3
Chris Diehl 17 Apr-2008 4302 999 San Francisco Bay Area, CA ChrisDiehl
Julie Steele 17 Mar-2008 4199 913 NYC and Providence jsteeleeditor
Joseph Adler 17 Apr-2008 1583 623 Silicon Valley jadler
joseph reisinger 17 Jul-2009 1933 686 現在地 josephreisinger
Daniel Tunkelang 16 Aug-2008 8234 20 Mountain View dtunkelang
KDnuggets 16 Feb-2009 84083 397 Brookline, MA, USA kdnuggets
Demis Hassabis 16 Jun-2013 69914 21 demishassabis
Ryan Adams 16 Jan-2015 10417 674 ryan_p_adams
Hugo Larochelle 16 Jun-2015 22563 350 hugo_larochelle
Oriol Vinyals 16 Oct-2015 18911 42 London, England OriolVinyalsML
Jack Clark 16 Oct-2009 18015 2506 San Francisco, CA jackclarkSF
Fernando Pereira 16 Oct-2010 8045 186 Menlo Park, CA, USA earnmyturns
Philip (flip) Kromer 15 Mar-2007 4773 1417 Austin, TX mrflip
Josh Reich 15 Dec-2008 5315 1278 Portland i2pi
Sean Gourley 15 Dec-2008 8062 1947 San Francisco, New Zealand sgourley
Doug Cutting 15 Jul-2007 22676 116 California, USA cutting
ML Hipster 15 Jul-2012 12401 0 Some local minima ML_Hipster
Panos Ipeirotis 14 Oct-2008 3988 259 New York, NY ipeirotis
Beau Cronin 14 Jun-2009 3705 291 Neck deep in AWS beaucronin
Florian Leibert 14 Jan-2009 11076 1741 San Francisco, CA flo
Mark Reid 14 Feb-2008 4625 718 Sunnyvale, CA mdreid
Claudia Perlich 14 Feb-2012 2732 105 NYC claudia_perlich
Alistair Croll 14 Mar-2007 22391 11537 Montreal, Quebec acroll
Joseph Turian 14 Dec-2007 2873 991 San Francisco turian
Neha Narkhede 14 Sep-2009 11158 678 Mountain View nehanarkhede

4c.Reco: After looking at Design clustser (Spectral clustering)


In [35]:
discover_Friends_toFollow(ids_of_interest = list(results_df [results_df['spectral'] == 9].index), 
                          friend_list = index, 
                          prop = 1, count = 20).fillna('').set_index('FullName')


'Who-To-Follow' reco after looking at  13 friends' friends:
... Cache-Miss for 18 nodes. Updating Cache...
Out[35]:
Freq Created Followers Friends Location UserName
FullName
Nicholas Jitkoff 8 Jan-2007 4605 204 Palo Alto, CA alcor
John Maeda 8 Jul-2008 439273 21724 WordPressLandia johnmaeda
Joshua Porter 7 Mar-2007 33903 817 Newburyport, MA bokardo
Khoi Vinh 7 Dec-2006 347073 2347 New York, NY khoi
Brett Lider 7 Dec-2008 6953 947 San Francisco, California bl
Margaret Stewart 7 Mar-2007 12846 862 Bay Area, CA mags
Paul Stamatiou 📷 7 Jan-2007 33711 599 San Francisco, CA Stammy
GV Design 7 Aug-2011 17799 515 San Francisco, CA GVDesignTeam
Marc Hemeon 6 Jan-2009 16957 3322 Southern California hemeon
Michael Leggett 6 Mar-2007 7999 209 Seattle leggett
M.G. Siegler 6 Jan-2007 190837 996 San Francisco, CA mgsiegler
Jenna Bilotta 6 Apr-2009 12558 287 San Francisco, CA jenna
Noah Levin 6 Nov-2008 5295 848 New York, NY nlevin
Geoff Teehan 6 Oct-2008 22642 810 Palo Alto, CA gt
Jonathan Lee 6 Apr-2009 2771 610 hifromjonathan
Google 6 Feb-2009 18020558 204 Mountain View, CA Google
Paul Adams 6 Jan-2007 22637 387 Dublin Padday
dustin senos 6 Sep-2006 11846 721 West Coast dustin
Doug Bowman 6 Mar-2007 232398 500 San Diego stop
Katie Dill 6 Apr-2009 4751 413 San Francisco lil_dill