NTDS'17 assignment 1: Collect and Analyze a Twitter Network

Effrosyni Simou, EPFL LTS4

Objective of Exercise

The aim of this exercise is to learn how to create your own, real network using data collected from the Internet and then to discover some properties of the collected network.

Resources

You might want to have a look at the following resources before starting:

1. Collect a Twitter Network

In order to collect data from Twitter you will need to generate access tokens. To do this you will need to register a client application with Twitter. Once you are done you should have your tokens. You can now create a credentials.ini file as follows:

[twitter]
consumer_key = YOUR-CONSUMER-KEY
consumer_secret = YOUR-CONSUMER-SECRET
access_token = YOUR-ACCESS-TOKEN
access_secret = YOUR-ACCESS-SECRET

In this way you will have this information readily available to you.


In [ ]:
%matplotlib inline

import random
import configparser
import matplotlib.pyplot as plt
import numpy as np
from pprint import pprint
import tweepy  # you will need to install tweepy first

In [ ]:
# Read the confidential token.
credentials = configparser.ConfigParser()
credentials.read('credentials.ini')

#authentication
auth = tweepy.OAuthHandler(credentials.get('twitter', 'consumer_key'), credentials.get('twitter', 'consumer_secret'))
auth.set_access_token(credentials.get('twitter', 'access_token'), credentials.get('twitter', 'access_secret'))

#construct API instance
#deal with rate limits and notify when delayed because of rate limits
api = tweepy.API(auth,wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

Now you are all set up to start collecting data from Twitter!

In this exercise we will construct a network with the following logic:

1) We will chose a user_id in Twitter to be our first node.

2) We will find (some) of the users who are both following user_id and are being followed by user_id. From now on we will call such users "connections" of user_id. We will place these user ids in a list called first_nodes.

3) For every node in the list first_nodes we will then find (some) of the users who are following and are being followed by this node (aka the connections of this node). The user ids collected in this step will be placed in a list called second_nodes.

4) The collection of the ids of all nodes (aka Twitter users) that we have collected so far will be placed in a list called all_nodes.

5) Since we have only collected a subset of all possible "connections" for our nodes we have to check if there are any remaining inner connections that we have missed.

The entire network is to be organized in a dictionary with entries that will have as key the Twitter id of the user (this is a number characterizing each user in Twitter) and as value the list of ids of his connections.

So, let us begin. The first thing that you will have to do is to chose the node from which everything will start. I have chosen the Twitter account of Applied Machine Learning Days that will take place in January 2018 in EPFL. You may change that if you wish to, but please make sure that the user you chose has both followers and friends and that he allows you to access this data.


In [ ]:
user = 'appliedmldays '
user_id=api.get_user(user).id

In the following cell write a function that takes as an argument the Twitter id of a user and returns a list with the ids of his connections. Take into account the case where a user does not allow you to access this information.

Reminder: By connections we mean users that are both followers and friends of a given user.


In [ ]:
def find_connections(user_id):
    followers = []
    friends=[]
    #your code here
    return connections

In [ ]:
first_connections=find_connections(user_id)
print('{}has {} connections'.format(user, len(first_connections)))

Collect your first_nodes and second_nodes and organize your collected nodes and their connections in the dictionary called network.

Hints:

  • Use random.choice([1,3,4]) to randomly choose a number in [1, 3, 4].
  • Use the append and remove methods to add and remove an element from a Python list.
  • The pop method removes the last item in the list.

In [ ]:
network={}
first_nodes=[]
second_nodes=[]
how_many=3#This is the number of connections you are sampling. 
          #Keep small (e.g.3) for development, larger later (e.g. 10)
#your code here

Be careful! You should only keep a small value for the how_many parameter while you are developing your code. In order to answer to the questions you should raise the value of this parameter to how_many=10 at least. This will take a while to execute because of the API rate limit (plan your time accordingly). You should also remember to submit your jupyter notebook with the output shown for a large value of the how_many parameter.


In [ ]:
network[user_id] = first_nodes

In [ ]:
pprint(network)

In [ ]:
all_nodes=#your code here

In [ ]:
print('There are {} first hop nodes'.format(len(first_nodes)))
print('There are {} second hop nodes'.format(len(second_nodes)))
print('There are overall {} nodes in the collected network'.format(len(all_nodes)))

In [ ]:
for i in second_nodes:
    network[i]=[]

Find the inner connections between your collected nodes that you might have missed because you sampled the connections.


In [ ]:
for i in range(0,len(all_nodes)):
    # your code here

2. Discover some of the properties of the collected network

2.1 Adjacency matrix

Congradulations! You have now created a dictionary that describes a real Twitter network! We now want to transform this dictionary into the adjacency (or weight) matrix that you learned about in your first class.


In [ ]:
W=np.zeros([len(all_nodes),len(all_nodes)], dtype=int)

In [ ]:
# your code here

Remember that a weight matrix should be symmetric. Check if it is:


In [ ]:
np.nonzero(W-W.transpose())

Question 1: It might happen that $W \neq W^{T} $ for some $(i,j)$. Explain why this might be the case.

Your answer here:

Impose your weight matrix to be symmetric.


In [ ]:
# Make W is symmetric
bigger = W.transpose() > W

In [ ]:
W = W - W*bigger + W.transpose()*bigger

Plot the weight matrix of your collected network.

Hint: use plt.spy() to visualize a matrix.


In [ ]:
# your code here
plt.title('Adjacency Matrix W')

Question 2: What is the maximum number of links $L_{max}$ in a network with $N$ nodes (where $N$ is the number of nodes in your collected network)? How many links $L$ are there in your collected network? Comment on how $L$ and $L_{max}$ compare.

Your answer here:

2.2 Degrees distribution

Plot a histogram of the degree distribution.


In [ ]:
p = # your code here

In [ ]:
plt.hist(p);

Question 3: Comment on the plot. What do you observe? Would you expect a similar degree disribution in the complete Twitter network?

Your answer here:

2.3 Average degree

Calculate the average degree of your collected network.


In [ ]:
d_avg = # your code here

2.4 Diameter of the collected network

Question 4: What is the diameter of the collected network? Please justify.

Your answer here:

2.5 Pruning the collected network

You might notice that some nodes have very few connections and hence our matrix is very sparse. Prune the collected network so that you keep only the nodes that have a degree that is greater than the average degree and plot the new adjacency matrix.


In [ ]:
Wpruned = # your code here

In [ ]:
plt.spy(Wpruned, markersize=1)
plt.title('Adjacency Matrix W');