When it comes to consuming real-time content, Twitter is the place to be; Be it sending out 'Tweets' real-time, or discovering latest online 'Trends' anywhere, or ability to begin a 'conversation' with anyone, Twitter does it all. In fact, Twitter Management wrote this in their letter to shareholders last year.

We’re focused now on what Twitter does best: live. Twitter is live: live commentary, live connections, live conversations. Whether it’s breaking news, entertainment, sports, or everyday topics, hearing about and watching a live event unfold is the fastest way to understand the power of Twitter.

Twitter has always been considered a “second screen” for what’s happening in the world and we believe we can become the first screen for everything that’s happening now. And by doing so, we believe we can build the planet’s largest daily connected audience. A connected audience is one that watches together, and can talk with one another in real-time. It’s what Twitter has provided for close to 10 years, and it’s what we will continue to drive in the future

Embedded in a Twitter User's Social-graph* is a wealth of information on User's likes and interests. Unlike Facebook or LinkedIn, the magic of Twitter is in its 'Follow' structure - where any can follow any without they knowing each other. This directed social-graph, when methodologically summarized, can reveal interesting information on the most influential/central friends in the network and also help personalize/enrich one's Twitter experience by unearthing implicit-clusters in the network.

*Social-graph: [User, 1st degree Connections, and the links between]

In this notebook, we'll look into these:

Extract the ego-network of 'self' using Twitter API

Identify the influential nodes using 'Page Rank' formulation
Identify implicit clusters formed
Recommend new friends to follow on the basis of cluster of interest

Note: This study is limited to Ego network: the focal node ("ego": here the self-node) and the nodes to whom ego is directly connected to ("alters") plus the ties, if any, among the alters.



In [1]:

    
from collections import Counter, defaultdict
from datetime import datetime
from sklearn.decomposition import PCA

import csv
import matplotlib.pyplot as plt
import numpy as np
import os.path
import pandas as pd
import re
import seaborn as sns;  sns.set()
import time
import twitter

% matplotlib inline
plt.rcParams['figure.figsize'] = (10,7)

import warnings
warnings.filterwarnings("ignore")
#print twitter.__path__  

import random
random.seed(1000)



In [2]:

    
# Pandas data-frame print formatting
from IPython.core.display import HTML
css = open('style-table.css').read() + open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))









    Out[2]:



In [3]:

    
self_screen_name = 'bala_io'            # Self

# Keep appending data
fof_filename     = "edges.csv"          # 'Alters' and their Source->Sink Edges. 
cache_filename   = "cache.csv"          # local cache of (TwitterId, FullName, UserName)

# One-time-use files
binaryMap_filename = "binaryMap.csv"   # Directed Graph. Adjacencies as 0/1.RowFollowCol
cluster_filename   = "results.csv"



In [4]:

    
# Twitter auth. https://dev.twitter.com/oauth/overview/application-owner-access-tokens

with open("../../passwd/credentials.txt", "r") as f:
    reader = csv.reader(f )
    login_dict = {line[0]: line[1]                       
                    for line in reader}        

api = twitter.Api(consumer_key=login_dict.get('consumer_key') ,
                  consumer_secret=login_dict.get('consumer_secret'),
                  access_token_key=login_dict.get('access_token_key'),
                  access_token_secret=login_dict.get('access_token_secret'))
api









    Out[4]:





<twitter.api.Api at 0x7f1c2f6bb898>



In [5]:

    
# 'Self' and Friends of Self

self_node    =  api.GetUser(screen_name = self_screen_name)    
self_node_id =  str(self_node.id)                      # Twitter Id of Self

friends_of_self = api.GetFriendIDs(user_id = self_node_id, 
                                   screen_name = self_screen_name , 
                                   stringify_ids = True)
index = [self_node_id] + friends_of_self



In [6]:

    
# GetFriendIDs() API call is rate-limited at 15 req / 15 min 
# https://dev.twitter.com/rest/public/rate-limiting

# For each of the list of nodes, fetch the list of nodes it follows, append to file

def update_FoF_File(fileName, to_fetch_list):
    
    with open(fileName, 'a') as f:
        apiReqCount = 0
        
        for node in to_fetch_list:
            friends_of_node = api.GetFriendIDs(user_id = node,  stringify_ids = True) 
            row = ','.join([str(i) for i in [node] +  friends_of_node ]) + "\n"
            f.write(row) 
            
            apiReqCount += 1
            
            if (apiReqCount == 15): 
                apiReqCount = 0
                
                print("Off to Sleep :)")
                time.sleep(15*60 + 10)



In [7]:

    
# parse FoF file and return list of nodes, for whom source->sink Edges are already there.

def getFinishList(fileName):
        
    if not os.path.isfile(fileName):
        return [] 
    with open(fileName, 'r') as f:
        return [ row.strip().split(',')[0] for row in f ]  # 1st entry is a user



In [8]:

    
# Ego-network as adjacency-matrix
# Parses FoF file in order of index, create list of adjacencies as 0 | 1
# Writes to File. Adjacency-matrix in Row_follows_Column format

def updateBinaryMapFile(fof_filename, binaryMap_filename, index):
    
    with open(fof_filename, "r") as f:
        stripped_f = (line.replace('\r', '') for line in f)
        reader = csv.reader(stripped_f)
        fof_dict = {line[0]: line[1:] for line in reader 
                    if line[0] in index}   # dict of node:his_followers
        if self_node_id not in fof_dict:
            fof_dict[self_node_id] = index[1:]             # dict of Self
    
    bool_list = []
    
    for user in index:
        user_friends = set( fof_dict[user] )  
        bool_row = [item in user_friends for item in index]  # for each, fill T/F 
        bool_list.append(bool_row)
    
    int_nparray = np.array(bool_list) + 0                    # Bool to int

    binaryMap_rfc = pd.DataFrame(data = int_nparray, columns= index, index = index)
    binaryMap_rfc.to_csv(binaryMap_filename)



In [9]:

    
# For list of Ids, fetch Profile details. If not in Offline file, make an API call
# Returns ['UserName', 'FullName', 'Followers_count', 'Friends_count', 'Location', 'Created_at']
# UsersLookup API 100 Ids/request

def lookup_in_cache(friendsIdsList):

    cache, delta_cache = pd.DataFrame(), pd.DataFrame()
    UserNameList, namesList     = [], []
    followers_count, friends_count = [], []
    location, created_at           = [], []
    
    if os.path.isfile(cache_filename):
        cache = pd.read_csv(cache_filename, skip_blank_lines=True,
                        dtype={'Ids':str, 'Friends_count':int, 'Followers_count':int})
        cache.set_index('Ids', inplace=True)
        to_fetch_list = list ( set (friendsIdsList) - set(cache.index) )
    else :        
        to_fetch_list = friendsIdsList   
    
    i = 0
    while (i < len(to_fetch_list) * 1./100):   
        print("... Cache-Miss for " + str(len(to_fetch_list)) +  " nodes. Updating Cache...")
        low, high = i * 100, min( len(to_fetch_list), (i+1)*100 )  # UsersLookup api
        twitterObjectsList = api.UsersLookup(user_id = to_fetch_list[low:high])    
        
        temp = zip(*[( tempObject.screen_name,        #ScreenName
                      tempObject.name,                #Name
                      tempObject.followers_count,     #Followers
                      tempObject.friends_count,       #Friends
                      tempObject.location,            #Location
                      tempObject.created_at           #CreatedAt
                     ) for tempObject in twitterObjectsList])
        temp = list(temp)
        
        UserNameList    += list(temp[0])
        namesList       += list(temp[1])
        followers_count += list(temp[2])
        friends_count   += list(temp[3])
        location        += list(temp[4])
        created_at      += list(temp[5])
        
        i = i + 1       
    
    if len(to_fetch_list) > 0:
        delta_cache  = pd.DataFrame({'UserName':UserNameList,
                                     'FullName':namesList,
                                     'Ids': to_fetch_list,
                                     'Followers':followers_count,
                                     'Friends': friends_count,
                                     'Location':location,
                                     'Created':created_at})
        
        delta_cache['Created'] = delta_cache['Created'].apply(lambda x: 
                                                datetime.strptime(
                                                    re.sub(r"\+[0-9]* ", "",x),'%c').
                                                strftime("%b-%Y"))
    
        delta_cache.set_index('Ids', inplace=True, drop = True)
        cache = cache.append(delta_cache)
        cache.to_csv(cache_filename)        

    return cache.loc[friendsIdsList]



In [10]:

    
# Display cluster-wise most-influential users, for the given clustering algo

def top_nodes_in_cluster(df, cluster_algo, n_clusters):
    dummy_df = pd.DataFrame()

    for i in range(n_clusters):
        nodes_in_cluster = list( df [df[cluster_algo] == i ]['FullName'] )     
        if len(nodes_in_cluster) >= 10:            # show only clusters of size > 10        
            col_name           = str(i) + " : " + str(len(nodes_in_cluster)) + " Ids"
            dummy_df[col_name] = nodes_in_cluster[:10]      
    
    return dummy_df



In [11]:

    
# identify 20 friends to follow after aggregating friends followed by top 50% in list

def discover_Friends_toFollow(ids_of_interest, friend_list, prop = .5, count = 20):    
    
    ids_of_interest  = ids_of_interest[:int(len(ids_of_interest) * prop)]
    
    if self_node_id in ids_of_interest:
        ids_of_interest.remove(self_node_id)

    print("'Who-To-Follow' reco after looking at %3d friends' friends:" %(len(ids_of_interest)))
    
    with open(fof_filename) as f:
        reader = csv.reader(f)
        fof_dict = {row[0]:row[0:] for row in reader}  # dict of node:her_followers

    friendsToFollow = []
    for id in ids_of_interest:
        friendsToFollow += list (set(fof_dict[str(id)])  - set(friend_list) ) 
        

    friendsToFollow = Counter(friendsToFollow).most_common(count)    
    tuples_list = list(zip(*friendsToFollow) )
           
    topFriendsToFollowDF = pd.DataFrame()
    topFriendsToFollowDF['Ids'] = list(tuples_list[0])
    topFriendsToFollowDF['Freq'] = list(tuples_list[1])
    topFriendsToFollowDF.set_index('Ids', drop = True, inplace = True)              
    index = topFriendsToFollowDF.index
    
    topFriendsToFollowDF = topFriendsToFollowDF.merge(lookup_in_cache(index), copy = False,
                              left_index = True, right_index = True)
    
    return topFriendsToFollowDF



In [12]:

    
# For the list of nodes I follow, fetch their friends-list

fof_finish_list   = getFinishList(fof_filename )        # Completed nodes
fof_to_fetch_list = list ( set(friends_of_self) - set(fof_finish_list) )  # Pending nodes

print( str(len(fof_to_fetch_list)) + " out of " + str(len(index)  - 1) + 
      " Friends details to be fetched")









    



1 out of 376 Friends details to be fetched



In [13]:

    
# For the remaining nodes, populate their details in fof_file
update_FoF_File(fof_filename, fof_to_fetch_list) 

# Build the adjacency matrix in terms of 0 and 1 (if there is an edge)
updateBinaryMapFile(fof_filename, binaryMap_filename, index)



In [14]:

    
# Read adj-matrix into df. Cell MxN is 1 iff node in Mth row follows node in Nth column 
binaryMap_rfc = pd.read_csv(binaryMap_filename, skip_blank_lines=True, index_col = 0)
print(binaryMap_rfc.shape)


outlinks_count = binaryMap_rfc.sum(axis = 1)   # horizontal-sum to count outlinks
inlinks_count = binaryMap_rfc.sum(axis = 0)   # vertical-sum to count inlinks



In [15]:

    
# Histogram of number of OutLinks per node, within ego-network
sns.distplot(outlinks_count, bins = len(index)//10, kde=False);



In [16]:

    
# Histogram of number of InLinks per node, within ego-network
sns.distplot(inlinks_count, bins = len(index)//10, kde=False);

2. Identify influential friends using 'Page Rank' formulation

From the adjacency matrix generated above, we can construct a column-stochastic matrix (also called a transition matrix) such that, a column with m outlinks will have 1/m as value in respective m cells. Conceptually, a value in the cell a x b gives the probability of a random-surfer in node B jumping to node A.



In [17]:

    
binaryMap_cfr      = binaryMap_rfc.transpose()            # column-values: Outlinks
binaryMap_cfr_norm = binaryMap_cfr / binaryMap_cfr.sum(axis = 0)
colStochMatrix = np.matrix( binaryMap_cfr_norm.fillna(0)) # column-stochastic-matrix

Initialize PageRank vector, such that all the nodes have equal PageRank score adding upto 1.



In [18]:

    
pageRankVector = np.matrix([1.0/len(index)] *  len(index)) # iniitialize page-rank-vector 
pageRankVector = pageRankVector.transpose()                # transpose to column-vector

On applying the above Transition-matrix transformation iteratively on the PageRank vector, the vector will eventully converge such that:

Matrix.Vector = Vector

Equivalently, this is Eigen-Vector formulation with PageRank vector being the principal eigen-vector of matrix corresponding to eigen-value 1 [Since M is column-stochastic matrix, principal eigen-value is 1]

Here we use Power Iteration method to solve for the PageRank vector. Inorder to handle the nodes which have zero out-links (dead-ends) and nodes with periodic-loops (spider-traps) in the ego-network, we introduce some randomness through parameter beta such that a random-surfer at any node picks a random path approx. every 1 out of 6 times ( 1 - beta = 0.15)



In [19]:

    
# PageRank algo: Power Iteration to solve Markov transition matrix 
# refer this     : http://setosa.io/blog/2014/07/26/markov-chains/index.html

beta = 0.85
epsilon = 999
iteration = 0
while epsilon > (1.0/(10**16)):
    pageRankVectorUpdating = colStochMatrix * pageRankVector * beta 
    
    # re-insert leaked page-ranks
    S = np.array(pageRankVectorUpdating).sum()                      
    pageRankVectorUpdated = pageRankVectorUpdating + (
        1 - S) * (1.0/len(index)) * np.ones_like(len(index))
    
    # compute the squared-difference and check for convergence
    error = np.array(pageRankVectorUpdated - pageRankVector)
    epsilon = np.sqrt((error *  error).sum())   
    iteration = iteration + 1
    pageRankVector = pageRankVectorUpdated    
    
print( "Sum of Page-Rank Scores: " + str(pageRankVector.sum()) + 
      "\nConverged in " + str(iteration) + " iterations")









    



Sum of Page-Rank Scores: 1.0
Converged in 58 iterations



In [20]:

    
# Collect the results

results_df = pd.DataFrame()
results_df['Ids'], results_df['PageRank'] = index, pageRankVector
results_df['Inlinks'], results_df['Outlinks'] = list(inlinks_count), list(outlinks_count)
results_df = results_df.set_index('Ids', drop = True )

results_df = results_df.merge(lookup_in_cache(index), copy = False,
                              left_index = True, right_index = True)
results_df = results_df[['PageRank','UserName', 'FullName', 'Inlinks' , 'Outlinks',
                         'Followers','Friends', 'Location', 'Created' ]]









    



... Cache-Miss for 1 nodes. Updating Cache...

2a. Nodes with high PageRank scores



In [21]:

    
results_df.fillna('').sort_values(by = 'PageRank', ascending =False).set_index('FullName').head(10)









    Out[21]:







  
    
      
      PageRank
      UserName
      Inlinks
      Outlinks
      Followers
      Friends
      Location
      Created
    
    
      FullName
      
      
      
      
      
      
      
      
    
  
  
    
      Elon Musk
      0.014158
      elonmusk
      201
      4
      8845530
      41
      Boring
      Jun-2009
    
    
      Bill Gates
      0.011921
      BillGates
      168
      16
      34940895
      183
      Seattle, WA
      Jun-2009
    
    
      Marc Andreessen
      0.011725
      pmarca
      192
      24
      645085
      13952
      Menlo Park, CA
      May-2007
    
    
      Edward Snowden
      0.010114
      Snowden
      97
      0
      3150406
      1
      
      Dec-2014
    
    
      Tim O'Reilly
      0.009973
      timoreilly
      178
      132
      1987340
      1709
      Oakland, CA
      Mar-2007
    
    
      WIRED
      0.009110
      WIRED
      103
      10
      8974304
      279
      San Francisco/New York
      Mar-2007
    
    
      Nate Silver
      0.008281
      NateSilver538
      145
      26
      2276360
      992
      New York
      Aug-2008
    
    
      Chris Dixon
      0.008217
      cdixon
      162
      184
      545467
      3054
      CA & NYC
      Mar-2007
    
    
      Fred Wilson
      0.008190
      fredwilson
      158
      75
      617149
      1160
      New York City
      Mar-2007
    
    
      Ev Williams
      0.008037
      ev
      141
      104
      1979132
      1781
      San Francisco, CA, US
      Mar-2006

2b. Top 100 influential-nodes in Ego-Network



In [22]:

    
dummy_df = pd.DataFrame()
temp_df  = results_df.sort_values( by = 'PageRank', ascending =False)

for i in range(10):
    dummy_df[i] = list (temp_df [10*i : 10* i + 10]['FullName'])
dummy_df









    Out[22]:







  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
    
  
  
    
      0
      Elon Musk
      Reid Hoffman
      Sam Altman
      Tim Urban
      Microsoft Research
      Brian Chesky
      Tony Fadell
      John D. Cook
      David Sacks
      Clayton Christensen
    
    
      1
      Bill Gates
      Bill Gurley
      Steven Levy
      Joi Ito
      John Markoff
      Max Roser
      Matt Cutts
      Naval Ravikant
      Tim Berners-Lee
      GV
    
    
      2
      Marc Andreessen
      Eric Schmidt
      Chris Anderson
      Benedict Evans
      Neil deGrasse Tyson
      Bradley Horowitz
      Yann LeCun
      Tim Ferriss
      Recode
      Mark Suster
    
    
      3
      Edward Snowden
      Paul Graham
      Hilary Mason
      Brad Stone
      Charlie Rose Show
      🍪Steven Sinofsky ॐ
      Atul Gawande
      John Lilly
      Sarah Lacy
      Drew Conway
    
    
      4
      Tim O'Reilly
      Vinod Khosla
      OpenAI
      NassimNicholasTaleb
      Bill Gross
      Gabe Rivera
      Jessica Lessin
      Dan Primack
      Edward Tufte
      Fei-Fei Li
    
    
      5
      WIRED
      Om Malik
      Steve Case
      Josh Kopelman
      danah boyd
      Walt Mossberg
      Google Research
      Jimmy Wales
      Sebastian Thrun
      Mike Bloomberg
    
    
      6
      Nate Silver
      TechCrunch
      Sundar Pichai
      DAVE MORIN
      Pierre Omidyar
      Steven Pinker
      Jonah Peretti
      Paul Kedrosky
      Josh Elman
      a16z
    
    
      7
      Chris Dixon
      Ben Horowitz
      Patrick Collison
      Hunter Walk
      Emily Chang
      Keith Rabois
      jason
      Startup L. Jackson
      Jeff Hammerbacher
      Kevin Weil
    
    
      8
      Fred Wilson
      Kara Swisher
      Max Levchin
      Y Combinator
      dj patil
      Andrew Ng
      EFF
      Kevin Kelly
      Bret Taylor
      Mike Bostock
    
    
      9
      Ev Williams
      Chris Sacca
      Nick Bilton
      Lessig
      Ben Thompson
      Glenn Greenwald
      Chamath Palihapitiya
      John Collison
      Paul Buchheit
      Jeff Jordan

2c. Histogram of PageRank scores

PageRank scores are scaled such that nodes have an average-score of 1. So the scores below give an idea of how influential are the nodes, with respect to an average node.



In [23]:

    
pageRank_to_plot = len(index) * results_df["PageRank"]
sns.distplot(pageRank_to_plot, kde=False, rug=True, bins = len(index)//10);

A joint-plot showing how Inlinks and Outlinks of the nodes are distributed (within the ego-network)



In [24]:

    
sns.jointplot(x="Inlinks", y="Outlinks", data=results_df, kind = "kde");

3. Identify implicit clusters using Clustering algos

Typically, number of clusters are chosen with a plot of within-cluster sum-of-squares-of-dstances vs number of clusters. Here for simplicity, we use a simple heuristic to fix the number of clusters in advnace.



In [25]:

    
n_clusters = min( int( round(np.sqrt(len(index)/2)) ), 10 ) # not more than 10 clusters
print(n_clusters)

3a. K-Means clustering

K Means is a point-assignment based clustering-algorithm: here we start with k points chosen randomly as centroids, assign the remaining points to k centroids by using certain distance measure (Euclidean / Cosine / Jaccardi). Then we compute new centroids, re-assign remaining points and repeat, until there is convergence of centroids.

Here we are more interested in clustering inherent in who-follows-me (network-driven) graph, rather than who-do-I-follow (self-driven). So the approach is to represent the nodes as Observations and whether other nodes follow them or not (1 or 0) as Features (One observation per row, and one feature per column) Thus the input matrix must be such that any value in cell implies whether node in the column follows node in the row.



In [26]:

    
from sklearn.cluster import KMeans
est = KMeans(max_iter = 100000, n_clusters = n_clusters, n_init = 200, init='k-means++')  
results_df['kmeans'] = est.fit_predict(binaryMap_cfr)

top_nodes_in_cluster(results_df.sort_values( by = 'PageRank', ascending =False), 
                     'kmeans', n_clusters)









    Out[26]:







  
    
      
      0 : 45 Ids
      1 : 67 Ids
      2 : 39 Ids
      3 : 28 Ids
      4 : 21 Ids
      5 : 17 Ids
      6 : 15 Ids
      7 : 24 Ids
      8 : 18 Ids
      9 : 103 Ids
    
  
  
    
      0
      Edward Snowden
      Tim Urban
      John D. Cook
      Hilary Mason
      OpenAI
      Elon Musk
      WIRED
      Mike Bostock
      NassimNicholasTaleb
      Charlie Rose Show
    
    
      1
      Nate Silver
      Tim Ferriss
      Lynn Cherny
      dj patil
      Microsoft Research
      Bill Gates
      Eric Schmidt
      Gilad Lotan
      Neil deGrasse Tyson
      Irene Au
    
    
      2
      Sundar Pichai
      GV
      Ben Hamner
      Jeff Hammerbacher
      Andrew Ng
      Marc Andreessen
      TechCrunch
      Jeffrey Heer
      Max Roser
      Steven Brooks
    
    
      3
      Patrick Collison
      Mike Bloomberg
      David Smith
      Drew Conway
      Yann LeCun
      Tim O'Reilly
      Steven Levy
      Martin Wattenberg
      Steven Pinker
      Udacity
    
    
      4
      Nick Bilton
      Garry Tan
      Fernando Perez
      Wes McKinney
      Google Research
      Chris Dixon
      Chris Anderson
      Andrew Gelman
      Glenn Greenwald
      briankrebs
    
    
      5
      Benedict Evans
      Joe Gebbia
      Andreas Mueller
      Monica Rogati
      Sebastian Thrun
      Fred Wilson
      Steve Case
      Fernanda Viégas
      Atul Gawande
      Christopher Soghoian
    
    
      6
      Brad Stone
      Yves Behar
      Kaggle
      Hadley Wickham
      Fei-Fei Li
      Ev Williams
      Joi Ito
      Steven Strogatz
      EFF
      IDEO
    
    
      7
      Josh Kopelman
      Ashlee Vance
      John Foreman
      Sean J. Taylor
      Andrej Karpathy
      Reid Hoffman
      Lessig
      Duncan Watts
      Kevin Kelly
      Stanford University
    
    
      8
      DAVE MORIN
      Dustin Moskovitz
      Jake VanderPlas
      Nathan Yau
      Shivon Zilis
      Bill Gurley
      John Markoff
      Kenneth Cukier
      Tim Berners-Lee
      Gary Marcus
    
    
      9
      Hunter Walk
      Balaji S. Srinivasan
      Scientific Python
      Peter Skomoroch
      Pedro Domingos
      Paul Graham
      Bill Gross
      Sinan Aral
      Edward Tufte
      Matías Duarte

3b. Spectral clustering

One problem in using K-Means algorithm to cluster the given social-graph is 'Curse of Dimensionality' i.e. at higher-dimensions (here ~400 dimensions), metric like 'Euclidean distance' or 'Centre' would have little meaning in the context of non-convex adjacency matrix.

On the other hand, spectral clustering attempts to partition the graph such that number of edges which connect different components are minimized. Below is the output from Spectral Clustering.



In [27]:

    
from sklearn import cluster

spectral = cluster.SpectralClustering(n_clusters=n_clusters,  n_init = 500,
                                          eigen_solver='arpack',
                                          affinity="nearest_neighbors")
spectral.fit(binaryMap_cfr)
results_df['spectral'] = spectral.labels_.astype(np.int)

top_nodes_in_cluster(results_df.sort_values( by = 'PageRank', ascending =False), 
                     'spectral', n_clusters)









    Out[27]:







  
    
      
      0 : 34 Ids
      2 : 19 Ids
      3 : 27 Ids
      4 : 55 Ids
      5 : 11 Ids
      6 : 10 Ids
      7 : 187 Ids
      8 : 14 Ids
      9 : 13 Ids
    
  
  
    
      0
      dj patil
      Lynn Cherny
      Tim O'Reilly
      Nate Silver
      Hilary Mason
      Elon Musk
      Edward Snowden
      Andrew Ng
      Luke Wroblewski
    
    
      1
      Jeff Hammerbacher
      Ben Hamner
      Ev Williams
      Sundar Pichai
      Drew Conway
      Bill Gates
      WIRED
      Yann LeCun
      Matías Duarte
    
    
      2
      Monica Rogati
      Olivier Grisel
      Bill Gurley
      Patrick Collison
      Wes McKinney
      Marc Andreessen
      Eric Schmidt
      Fei-Fei Li
      Julie Zhuo
    
    
      3
      Peter Skomoroch
      Fernando Perez
      Om Malik
      Benedict Evans
      Hadley Wickham
      Chris Dixon
      Vinod Khosla
      Andrej Karpathy
      Jon Wiley
    
    
      4
      Michael E. Driscoll
      Andreas Mueller
      Kara Swisher
      Y Combinator
      Sean J. Taylor
      Fred Wilson
      TechCrunch
      Stanford NLP Group
      Daniel Burka
    
    
      5
      bradford cross
      Jake VanderPlas
      Chris Sacca
      Emily Chang
      Nathan Yau
      Reid Hoffman
      Chris Anderson
      John Platt
      Braden Kowitz
    
    
      6
      Jeremy Howard
      Scientific Python
      Steven Levy
      Ben Thompson
      John Myles White
      Paul Graham
      OpenAI
      Richard
      Google Design
    
    
      7
      Mike Olson
      Sebastian Raschka
      Nick Bilton
      Brian Chesky
      chris wiggins
      Ben Horowitz
      Steve Case
      Oren Etzioni
      Josh Brewer
    
    
      8
      Anthony Goldbloom
      Trey Causey
      Josh Kopelman
      🍪Steven Sinofsky ॐ
      Ben Lorica 罗瑞卡
      Sam Altman
      Tim Urban
      Ian Goodfellow
      Jake Knapp
    
    
      9
      Joe Hellerstein
      Joshua Bloom
      DAVE MORIN
      Tony Fadell
      Josh Wills
      Max Levchin
      Joi Ito
      Nando de Freitas
      John Zeratsky

Nodes mainly into Deep Learning community have grouped into Cluster 8, Python Machine Learning community into Cluster 2, Design community into Cluster 9, general Data Science community into Cluster 0 and 5

One smaller clusters (1) wasn't shown above.



In [29]:

    
results_df [results_df['spectral'].isin([1])].sort_values(by = 'PageRank', ascending =False).set_index('FullName').head()









    Out[29]:







  
    
      
      PageRank
      UserName
      Inlinks
      Outlinks
      Followers
      Friends
      Location
      Created
      kmeans
      spectral
    
    
      FullName
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      JAMA
      0.000658
      JAMA_current
      4
      3
      205154
      559
      Chicago, IL
      May-2009
      9
      1
    
    
      FSI Stanford
      0.000643
      FSIStanford
      5
      5
      6614
      285
      Stanford, CA
      Dec-2010
      9
      1
    
    
      Adam Johnson
      0.000638
      AJInsight
      4
      26
      27945
      1232
      New York
      Jul-2010
      9
      1
    
    
      PNAS
      0.000595
      PNASNews
      4
      17
      66720
      1110
      Washington, DC
      Feb-2011
      9
      1
    
    
      IQSS @ Harvard
      0.000558
      IQSS
      5
      4
      4241
      360
      Cambridge, MA
      May-2009
      9
      1

3c. Affinity Propagation clustering

Unlike K-Means or Spectral clustering, affinity propagation doesn't require the number of clusters to be estimated beforehand. Here the algorithm finds 'exemplars' i.e. members of dataset that are representative of clusters. and tries to find clusters around them.



In [30]:

    
from sklearn.cluster import AffinityPropagation

af = AffinityPropagation(preference=-50).fit(binaryMap_cfr)
results_df['affinity'] = af.labels_
n_clusters_affinity = len(af.cluster_centers_indices_)

print(str(n_clusters_affinity) + " affinity clusters.")

top_nodes_in_cluster(results_df.sort_values( by = 'PageRank', ascending =False), 
                     'affinity', n_clusters_affinity)









    



128 affinity clusters.






    Out[30]:







  
    
      
      2 : 87 Ids
      6 : 34 Ids
      32 : 23 Ids
      38 : 11 Ids
      41 : 11 Ids
      111 : 42 Ids
    
  
  
    
      0
      Tim Urban
      Adam Grant
      Lynn Cherny
      John D. Cook
      Jeff Jordan
      Yves Behar
    
    
      1
      Charlie Rose Show
      Steven Strogatz
      Ben Hamner
      chris wiggins
      Dustin Moskovitz
      Ashlee Vance
    
    
      2
      Irene Au
      Tim Harford
      Olivier Grisel
      Jeremy Howard
      Balaji S. Srinivasan
      WikiLeaks
    
    
      3
      John Carmack
      Gary Marcus
      Fernando Perez
      Xavier Amatriain
      Horace Dediu
      Seth Godin
    
    
      4
      The Information
      DHH
      Andreas Mueller
      Anthony Goldbloom
      Jessica Livingston
      Tom Hulme
    
    
      5
      Paul Ford
      BJ Fogg
      Jake VanderPlas
      Kaggle
      David Marcus
      Udacity
    
    
      6
      briankrebs
      Joe Blitzstein
      Scientific Python
      John Foreman
      Frank Chen
      IDEO
    
    
      7
      Christopher Soghoian
      Brian Nosek
      Sebastian Raschka
      Edwin Chen
      Kleiner Perkins
      Scott Dadich
    
    
      8
      John Hagel
      Martin Varsavsky
      Trey Causey
      Mark Faridani
      Amir Efrati
      ReadWrite
    
    
      9
      Sarah Frier
      Russell Roberts
      Joshua Bloom
      Shane Conway
      Scott Kupor
      Peter Diamandis

3d. Principal Component Analysis

To handle the 'curse of dimensionality' problem inherent in high-dimension data, PCA is generally used to represent the high-dimensional data in a fewer number of dimensions - this helps in better visualizing of data as well as faster computation while running clustering algorithm.



In [31]:

    
pca = PCA(n_components=3)
Xproj = pca.fit_transform(binaryMap_cfr)

results_df['dim1'] = Xproj[:,0]
results_df['dim2'] = Xproj[:,1]
results_df['dim3'] = Xproj[:,2]

results_df = results_df.sort_values( by = 'PageRank', ascending =False)
results_df.to_csv(cluster_filename)

print("Explained-variance and Proportion of Explained-variance in 3 dimensions [dim1 dim2 dim3]")
print(pca.explained_variance_, pca.explained_variance_ratio_)









    



Explained-variance and Proportion of Explained-variance in 3 dimensions [dim1 dim2 dim3]
[ 7.09266174  3.16212677  1.21891396] [ 0.18363161  0.08186862  0.03155813]

A simpler visualization of Spectral-clustering outcome as rendered in 2 dimensions.



In [32]:

    
# Spectral clustering | Plot the ego-network in 2 dimensions

plt.figure(num=None, figsize=(20, 10), dpi=80, facecolor='w', edgecolor='k')
plt.scatter(results_df['dim1'], results_df['dim2'], s = 100  ,c= results_df['spectral'], 
            alpha=0.5, cmap=plt.cm.get_cmap('nipy_spectral', 10))
plt.colorbar();

More on Clustering Algorithms: http://scikit-learn.org/stable/modules/clustering.html

Now that we have PageRank in place for nodes in social-graph, we can ask for recommendations on the basis of top-ranked nodes in the graph. E.g. To get 20 recommendations, after looking at friends of top PageRank scoring nodes in my network

4a. Reco: After looking at top nodes in full ego-network



In [33]:

    
discover_Friends_toFollow(ids_of_interest = index,
                          friend_list = index, 
                          prop = .5 , count =  20).fillna('').set_index('FullName')









    



'Who-To-Follow' reco after looking at 187 friends' friends:






    Out[33]:







  
    
      
      Freq
      Created
      Followers
      Friends
      Location
      UserName
    
    
      FullName
      
      
      
      
      
      
    
  
  
    
      Medium
      67
      May-2012
      2188119
      78
      San Francisco, CA, US
      Medium
    
    
      Barack Obama
      67
      Mar-2007
      88892730
      629775
      Washington, DC
      BarackObama
    
    
      Aaron Levie
      52
      Mar-2007
      1626688
      423
      Palo Alto
      levie
    
    
      Clay Shirky
      51
      May-2007
      365919
      749
      Shanghai
      cshirky
    
    
      Mitch Kapor
      47
      Mar-2007
      116400
      553
      Oakland
      mkapor
    
    
      The New York Times
      45
      Mar-2007
      37377094
      894
      New York City
      nytimes
    
    
      jack
      45
      Mar-2006
      4055017
      2791
      California, USA
      jack
    
    
      marissamayer
      44
      Nov-2008
      1705305
      350
      San Francisco, CA
      marissamayer
    
    
      Stewart Butterfield
      43
      Sep-2006
      65193
      2418
      West coast
      stewart
    
    
      Anil Dash
      40
      Dec-2006
      613133
      4684
      NYC
      anildash
    
    
      dick costolo
      40
      May-2007
      1739649
      681
      San Francisco
      dickc
    
    
      Chris Anderson
      40
      Nov-2008
      184874
      977
      Berkeley
      chr1sa
    
    
      The Economist
      39
      May-2007
      20816240
      159
      London
      TheEconomist
    
    
      President Obama
      39
      Jun-2013
      15459612
      79
      Washington, D.C.
      POTUS44
    
    
      Farhad Manjoo 🐣
      39
      Mar-2007
      158093
      3674
      I'm right over here (it's not NYC)
      fmanjoo
    
    
      John Maeda
      38
      Jul-2008
      439273
      21724
      WordPressLandia
      johnmaeda
    
    
      David Pogue
      38
      Oct-2007
      1564476
      1794
      USA
      Pogue
    
    
      Hanna Wallach
      37
      Sep-2012
      7537
      795
      Brooklyn, NY
      hannawallach
    
    
      Tim Cook
      37
      Jul-2013
      4925005
      56
      Cupertino
      tim_cook
    
    
      Hillary Clinton
      37
      Apr-2013
      15590192
      762
      New York, NY
      HillaryClinton

Recommendations on the basis of a specific cluster outcome. E.g. Nodes to follow on the basis of top PageRank nodes in 'Spectral Clustering::Data' clusters.

Reco: 4b. After looking at Data-Science clustsers (Spectral clustering)



In [34]:

    
favorite_cluster_df = results_df [results_df['spectral'].isin([0,2,5,8])]
favorite_cluster_list = list(favorite_cluster_df.index)

discover_Friends_toFollow(ids_of_interest = favorite_cluster_list, 
                          friend_list = index, 
                          prop = .5, count = 30).fillna('').set_index('FullName')









    



'Who-To-Follow' reco after looking at  39 friends' friends:
... Cache-Miss for 3 nodes. Updating Cache...






    Out[34]:







  
    
      
      Freq
      Created
      Followers
      Friends
      Location
      UserName
    
    
      FullName
      
      
      
      
      
      
    
  
  
    
      Pete Warden
      21
      May-2008
      12142
      1559
      San Francisco, CA
      petewarden
    
    
      Hanna Wallach
      19
      Sep-2012
      7537
      795
      Brooklyn, NY
      hannawallach
    
    
      Barack Obama
      19
      Mar-2007
      88892730
      629775
      Washington, DC
      BarackObama
    
    
      Alex Smola
      19
      Jun-2008
      8768
      48
      Mountain View and Pittsburgh
      smolix
    
    
      Hal Daumé III
      18
      Jun-2010
      8702
      1017
      Washington DC
      haldaume3
    
    
      Chris Diehl
      17
      Apr-2008
      4302
      999
      San Francisco Bay Area, CA
      ChrisDiehl
    
    
      Julie Steele
      17
      Mar-2008
      4199
      913
      NYC and Providence
      jsteeleeditor
    
    
      Joseph Adler
      17
      Apr-2008
      1583
      623
      Silicon Valley
      jadler
    
    
      joseph reisinger
      17
      Jul-2009
      1933
      686
      現在地
      josephreisinger
    
    
      Daniel Tunkelang
      16
      Aug-2008
      8234
      20
      Mountain View
      dtunkelang
    
    
      KDnuggets
      16
      Feb-2009
      84083
      397
      Brookline, MA, USA
      kdnuggets
    
    
      Demis Hassabis
      16
      Jun-2013
      69914
      21
      
      demishassabis
    
    
      Ryan Adams
      16
      Jan-2015
      10417
      674
      
      ryan_p_adams
    
    
      Hugo Larochelle
      16
      Jun-2015
      22563
      350
      
      hugo_larochelle
    
    
      Oriol Vinyals
      16
      Oct-2015
      18911
      42
      London, England
      OriolVinyalsML
    
    
      Jack Clark
      16
      Oct-2009
      18015
      2506
      San Francisco, CA
      jackclarkSF
    
    
      Fernando Pereira
      16
      Oct-2010
      8045
      186
      Menlo Park, CA, USA
      earnmyturns
    
    
      Philip (flip) Kromer
      15
      Mar-2007
      4773
      1417
      Austin, TX
      mrflip
    
    
      Josh Reich
      15
      Dec-2008
      5315
      1278
      Portland
      i2pi
    
    
      Sean Gourley
      15
      Dec-2008
      8062
      1947
      San Francisco, New Zealand
      sgourley
    
    
      Doug Cutting
      15
      Jul-2007
      22676
      116
      California, USA
      cutting
    
    
      ML Hipster
      15
      Jul-2012
      12401
      0
      Some local minima
      ML_Hipster
    
    
      Panos Ipeirotis
      14
      Oct-2008
      3988
      259
      New York, NY
      ipeirotis
    
    
      Beau Cronin
      14
      Jun-2009
      3705
      291
      Neck deep in AWS
      beaucronin
    
    
      Florian Leibert
      14
      Jan-2009
      11076
      1741
      San Francisco, CA
      flo
    
    
      Mark Reid
      14
      Feb-2008
      4625
      718
      Sunnyvale, CA
      mdreid
    
    
      Claudia Perlich
      14
      Feb-2012
      2732
      105
      NYC
      claudia_perlich
    
    
      Alistair Croll
      14
      Mar-2007
      22391
      11537
      Montreal, Quebec
      acroll
    
    
      Joseph Turian
      14
      Dec-2007
      2873
      991
      San Francisco
      turian
    
    
      Neha Narkhede
      14
      Sep-2009
      11158
      678
      Mountain View
      nehanarkhede

4c.Reco: After looking at Design clustser (Spectral clustering)



In [35]:

    
discover_Friends_toFollow(ids_of_interest = list(results_df [results_df['spectral'] == 9].index), 
                          friend_list = index, 
                          prop = 1, count = 20).fillna('').set_index('FullName')









    



'Who-To-Follow' reco after looking at  13 friends' friends:
... Cache-Miss for 18 nodes. Updating Cache...






    Out[35]:







  
    
      
      Freq
      Created
      Followers
      Friends
      Location
      UserName
    
    
      FullName
      
      
      
      
      
      
    
  
  
    
      Nicholas Jitkoff
      8
      Jan-2007
      4605
      204
      Palo Alto, CA
      alcor
    
    
      John Maeda
      8
      Jul-2008
      439273
      21724
      WordPressLandia
      johnmaeda
    
    
      Joshua Porter
      7
      Mar-2007
      33903
      817
      Newburyport, MA
      bokardo
    
    
      Khoi Vinh
      7
      Dec-2006
      347073
      2347
      New York, NY
      khoi
    
    
      Brett Lider
      7
      Dec-2008
      6953
      947
      San Francisco, California
      bl
    
    
      Margaret Stewart
      7
      Mar-2007
      12846
      862
      Bay Area, CA
      mags
    
    
      Paul Stamatiou 📷
      7
      Jan-2007
      33711
      599
      San Francisco, CA
      Stammy
    
    
      GV Design
      7
      Aug-2011
      17799
      515
      San Francisco, CA
      GVDesignTeam
    
    
      Marc Hemeon
      6
      Jan-2009
      16957
      3322
      Southern California
      hemeon
    
    
      Michael Leggett
      6
      Mar-2007
      7999
      209
      Seattle
      leggett
    
    
      M.G. Siegler
      6
      Jan-2007
      190837
      996
      San Francisco, CA
      mgsiegler
    
    
      Jenna Bilotta
      6
      Apr-2009
      12558
      287
      San Francisco, CA
      jenna
    
    
      Noah Levin
      6
      Nov-2008
      5295
      848
      New York, NY
      nlevin
    
    
      Geoff Teehan
      6
      Oct-2008
      22642
      810
      Palo Alto, CA
      gt
    
    
      Jonathan Lee
      6
      Apr-2009
      2771
      610
      
      hifromjonathan
    
    
      Google
      6
      Feb-2009
      18020558
      204
      Mountain View, CA
      Google
    
    
      Paul Adams
      6
      Jan-2007
      22637
      387
      Dublin
      Padday
    
    
      dustin senos
      6
      Sep-2006
      11846
      721
      West Coast
      dustin
    
    
      Doug Bowman
      6
      Mar-2007
      232398
      500
      San Diego
      stop
    
    
      Katie Dill
      6
      Apr-2009
      4751
      413
      San Francisco
      lil_dill

	PageRank	UserName	Inlinks	Outlinks	Followers	Friends	Location	Created
FullName
Elon Musk	0.014158	elonmusk	201	4	8845530	41	Boring	Jun-2009
Bill Gates	0.011921	BillGates	168	16	34940895	183	Seattle, WA	Jun-2009
Marc Andreessen	0.011725	pmarca	192	24	645085	13952	Menlo Park, CA	May-2007
Edward Snowden	0.010114	Snowden	97	0	3150406	1		Dec-2014
Tim O'Reilly	0.009973	timoreilly	178	132	1987340	1709	Oakland, CA	Mar-2007
WIRED	0.009110	WIRED	103	10	8974304	279	San Francisco/New York	Mar-2007
Nate Silver	0.008281	NateSilver538	145	26	2276360	992	New York	Aug-2008
Chris Dixon	0.008217	cdixon	162	184	545467	3054	CA & NYC	Mar-2007
Fred Wilson	0.008190	fredwilson	158	75	617149	1160	New York City	Mar-2007
Ev Williams	0.008037	ev	141	104	1979132	1781	San Francisco, CA, US	Mar-2006

	0	1	2	3	4	5	6	7	8	9
0	Elon Musk	Reid Hoffman	Sam Altman	Tim Urban	Microsoft Research	Brian Chesky	Tony Fadell	John D. Cook	David Sacks	Clayton Christensen
1	Bill Gates	Bill Gurley	Steven Levy	Joi Ito	John Markoff	Max Roser	Matt Cutts	Naval Ravikant	Tim Berners-Lee	GV
2	Marc Andreessen	Eric Schmidt	Chris Anderson	Benedict Evans	Neil deGrasse Tyson	Bradley Horowitz	Yann LeCun	Tim Ferriss	Recode	Mark Suster
3	Edward Snowden	Paul Graham	Hilary Mason	Brad Stone	Charlie Rose Show	🍪Steven Sinofsky ॐ	Atul Gawande	John Lilly	Sarah Lacy	Drew Conway
4	Tim O'Reilly	Vinod Khosla	OpenAI	NassimNicholasTaleb	Bill Gross	Gabe Rivera	Jessica Lessin	Dan Primack	Edward Tufte	Fei-Fei Li
5	WIRED	Om Malik	Steve Case	Josh Kopelman	danah boyd	Walt Mossberg	Google Research	Jimmy Wales	Sebastian Thrun	Mike Bloomberg
6	Nate Silver	TechCrunch	Sundar Pichai	DAVE MORIN	Pierre Omidyar	Steven Pinker	Jonah Peretti	Paul Kedrosky	Josh Elman	a16z
7	Chris Dixon	Ben Horowitz	Patrick Collison	Hunter Walk	Emily Chang	Keith Rabois	jason	Startup L. Jackson	Jeff Hammerbacher	Kevin Weil
8	Fred Wilson	Kara Swisher	Max Levchin	Y Combinator	dj patil	Andrew Ng	EFF	Kevin Kelly	Bret Taylor	Mike Bostock
9	Ev Williams	Chris Sacca	Nick Bilton	Lessig	Ben Thompson	Glenn Greenwald	Chamath Palihapitiya	John Collison	Paul Buchheit	Jeff Jordan

	0 : 34 Ids	2 : 19 Ids	3 : 27 Ids	4 : 55 Ids	5 : 11 Ids	6 : 10 Ids	7 : 187 Ids	8 : 14 Ids	9 : 13 Ids
0	dj patil	Lynn Cherny	Tim O'Reilly	Nate Silver	Hilary Mason	Elon Musk	Edward Snowden	Andrew Ng	Luke Wroblewski
1	Jeff Hammerbacher	Ben Hamner	Ev Williams	Sundar Pichai	Drew Conway	Bill Gates	WIRED	Yann LeCun	Matías Duarte
2	Monica Rogati	Olivier Grisel	Bill Gurley	Patrick Collison	Wes McKinney	Marc Andreessen	Eric Schmidt	Fei-Fei Li	Julie Zhuo
3	Peter Skomoroch	Fernando Perez	Om Malik	Benedict Evans	Hadley Wickham	Chris Dixon	Vinod Khosla	Andrej Karpathy	Jon Wiley
4	Michael E. Driscoll	Andreas Mueller	Kara Swisher	Y Combinator	Sean J. Taylor	Fred Wilson	TechCrunch	Stanford NLP Group	Daniel Burka
5	bradford cross	Jake VanderPlas	Chris Sacca	Emily Chang	Nathan Yau	Reid Hoffman	Chris Anderson	John Platt	Braden Kowitz
6	Jeremy Howard	Scientific Python	Steven Levy	Ben Thompson	John Myles White	Paul Graham	OpenAI	Richard	Google Design
7	Mike Olson	Sebastian Raschka	Nick Bilton	Brian Chesky	chris wiggins	Ben Horowitz	Steve Case	Oren Etzioni	Josh Brewer
8	Anthony Goldbloom	Trey Causey	Josh Kopelman	🍪Steven Sinofsky ॐ	Ben Lorica 罗瑞卡	Sam Altman	Tim Urban	Ian Goodfellow	Jake Knapp
9	Joe Hellerstein	Joshua Bloom	DAVE MORIN	Tony Fadell	Josh Wills	Max Levchin	Joi Ito	Nando de Freitas	John Zeratsky

	PageRank	UserName	Inlinks	Outlinks	Followers	Friends	Location	Created	kmeans	spectral
FullName
JAMA	0.000658	JAMA_current	4	3	205154	559	Chicago, IL	May-2009	9	1
FSI Stanford	0.000643	FSIStanford	5	5	6614	285	Stanford, CA	Dec-2010	9	1
Adam Johnson	0.000638	AJInsight	4	26	27945	1232	New York	Jul-2010	9	1
PNAS	0.000595	PNASNews	4	17	66720	1110	Washington, DC	Feb-2011	9	1
IQSS @ Harvard	0.000558	IQSS	5	4	4241	360	Cambridge, MA	May-2009	9	1

	2 : 87 Ids	6 : 34 Ids	32 : 23 Ids	38 : 11 Ids	41 : 11 Ids	111 : 42 Ids
0	Tim Urban	Adam Grant	Lynn Cherny	John D. Cook	Jeff Jordan	Yves Behar
1	Charlie Rose Show	Steven Strogatz	Ben Hamner	chris wiggins	Dustin Moskovitz	Ashlee Vance
2	Irene Au	Tim Harford	Olivier Grisel	Jeremy Howard	Balaji S. Srinivasan	WikiLeaks
3	John Carmack	Gary Marcus	Fernando Perez	Xavier Amatriain	Horace Dediu	Seth Godin
4	The Information	DHH	Andreas Mueller	Anthony Goldbloom	Jessica Livingston	Tom Hulme
5	Paul Ford	BJ Fogg	Jake VanderPlas	Kaggle	David Marcus	Udacity
6	briankrebs	Joe Blitzstein	Scientific Python	John Foreman	Frank Chen	IDEO
7	Christopher Soghoian	Brian Nosek	Sebastian Raschka	Edwin Chen	Kleiner Perkins	Scott Dadich
8	John Hagel	Martin Varsavsky	Trey Causey	Mark Faridani	Amir Efrati	ReadWrite
9	Sarah Frier	Russell Roberts	Joshua Bloom	Shane Conway	Scott Kupor	Peter Diamandis

	Freq	Created	Followers	Friends	Location	UserName
FullName
Medium	67	May-2012	2188119	78	San Francisco, CA, US	Medium
Barack Obama	67	Mar-2007	88892730	629775	Washington, DC	BarackObama
Aaron Levie	52	Mar-2007	1626688	423	Palo Alto	levie
Clay Shirky	51	May-2007	365919	749	Shanghai	cshirky
Mitch Kapor	47	Mar-2007	116400	553	Oakland	mkapor
The New York Times	45	Mar-2007	37377094	894	New York City	nytimes
jack	45	Mar-2006	4055017	2791	California, USA	jack
marissamayer	44	Nov-2008	1705305	350	San Francisco, CA	marissamayer
Stewart Butterfield	43	Sep-2006	65193	2418	West coast	stewart
Anil Dash	40	Dec-2006	613133	4684	NYC	anildash
dick costolo	40	May-2007	1739649	681	San Francisco	dickc
Chris Anderson	40	Nov-2008	184874	977	Berkeley	chr1sa
The Economist	39	May-2007	20816240	159	London	TheEconomist
President Obama	39	Jun-2013	15459612	79	Washington, D.C.	POTUS44
Farhad Manjoo 🐣	39	Mar-2007	158093	3674	I'm right over here (it's not NYC)	fmanjoo
John Maeda	38	Jul-2008	439273	21724	WordPressLandia	johnmaeda
David Pogue	38	Oct-2007	1564476	1794	USA	Pogue
Hanna Wallach	37	Sep-2012	7537	795	Brooklyn, NY	hannawallach
Tim Cook	37	Jul-2013	4925005	56	Cupertino	tim_cook
Hillary Clinton	37	Apr-2013	15590192	762	New York, NY	HillaryClinton

	Freq	Created	Followers	Friends	Location	UserName
FullName
Pete Warden	21	May-2008	12142	1559	San Francisco, CA	petewarden
Hanna Wallach	19	Sep-2012	7537	795	Brooklyn, NY	hannawallach
Barack Obama	19	Mar-2007	88892730	629775	Washington, DC	BarackObama
Alex Smola	19	Jun-2008	8768	48	Mountain View and Pittsburgh	smolix
Hal Daumé III	18	Jun-2010	8702	1017	Washington DC	haldaume3
Chris Diehl	17	Apr-2008	4302	999	San Francisco Bay Area, CA	ChrisDiehl
Julie Steele	17	Mar-2008	4199	913	NYC and Providence	jsteeleeditor
Joseph Adler	17	Apr-2008	1583	623	Silicon Valley	jadler
joseph reisinger	17	Jul-2009	1933	686	現在地	josephreisinger
Daniel Tunkelang	16	Aug-2008	8234	20	Mountain View	dtunkelang
KDnuggets	16	Feb-2009	84083	397	Brookline, MA, USA	kdnuggets
Demis Hassabis	16	Jun-2013	69914	21		demishassabis
Ryan Adams	16	Jan-2015	10417	674		ryan_p_adams
Hugo Larochelle	16	Jun-2015	22563	350		hugo_larochelle
Oriol Vinyals	16	Oct-2015	18911	42	London, England	OriolVinyalsML
Jack Clark	16	Oct-2009	18015	2506	San Francisco, CA	jackclarkSF
Fernando Pereira	16	Oct-2010	8045	186	Menlo Park, CA, USA	earnmyturns
Philip (flip) Kromer	15	Mar-2007	4773	1417	Austin, TX	mrflip
Josh Reich	15	Dec-2008	5315	1278	Portland	i2pi
Sean Gourley	15	Dec-2008	8062	1947	San Francisco, New Zealand	sgourley
Doug Cutting	15	Jul-2007	22676	116	California, USA	cutting
ML Hipster	15	Jul-2012	12401	0	Some local minima	ML_Hipster
Panos Ipeirotis	14	Oct-2008	3988	259	New York, NY	ipeirotis
Beau Cronin	14	Jun-2009	3705	291	Neck deep in AWS	beaucronin
Florian Leibert	14	Jan-2009	11076	1741	San Francisco, CA	flo
Mark Reid	14	Feb-2008	4625	718	Sunnyvale, CA	mdreid
Claudia Perlich	14	Feb-2012	2732	105	NYC	claudia_perlich
Alistair Croll	14	Mar-2007	22391	11537	Montreal, Quebec	acroll
Joseph Turian	14	Dec-2007	2873	991	San Francisco	turian
Neha Narkhede	14	Sep-2009	11158	678	Mountain View	nehanarkhede

	Freq	Created	Followers	Friends	Location	UserName
FullName
Nicholas Jitkoff	8	Jan-2007	4605	204	Palo Alto, CA	alcor
John Maeda	8	Jul-2008	439273	21724	WordPressLandia	johnmaeda
Joshua Porter	7	Mar-2007	33903	817	Newburyport, MA	bokardo
Khoi Vinh	7	Dec-2006	347073	2347	New York, NY	khoi
Brett Lider	7	Dec-2008	6953	947	San Francisco, California	bl
Margaret Stewart	7	Mar-2007	12846	862	Bay Area, CA	mags
Paul Stamatiou 📷	7	Jan-2007	33711	599	San Francisco, CA	Stammy
GV Design	7	Aug-2011	17799	515	San Francisco, CA	GVDesignTeam
Marc Hemeon	6	Jan-2009	16957	3322	Southern California	hemeon
Michael Leggett	6	Mar-2007	7999	209	Seattle	leggett
M.G. Siegler	6	Jan-2007	190837	996	San Francisco, CA	mgsiegler
Jenna Bilotta	6	Apr-2009	12558	287	San Francisco, CA	jenna
Noah Levin	6	Nov-2008	5295	848	New York, NY	nlevin
Geoff Teehan	6	Oct-2008	22642	810	Palo Alto, CA	gt
Jonathan Lee	6	Apr-2009	2771	610		hifromjonathan
Google	6	Feb-2009	18020558	204	Mountain View, CA	Google
Paul Adams	6	Jan-2007	22637	387	Dublin	Padday
dustin senos	6	Sep-2006	11846	721	West Coast	dustin
Doug Bowman	6	Mar-2007	232398	500	San Diego	stop
Katie Dill	6	Apr-2009	4751	413	San Francisco	lil_dill

1. Extract social-graph strucutre using Twitter API