In order to collect data from Twitter you will need to generate access tokens. To do this you will need to register a client application with Twitter. Once you are done you should have your tokens. You can now create a credentials.ini
file as follows:
[twitter]
consumer_key = YOUR-CONSUMER-KEY
consumer_secret = YOUR-CONSUMER-SECRET
access_token = YOUR-ACCESS-TOKEN
access_secret = YOUR-ACCESS-SECRET
In this way you will have this information readily available to you.
In [ ]:
%matplotlib inline
import random
import configparser
import matplotlib.pyplot as plt
import numpy as np
from pprint import pprint
import tweepy # you will need to install tweepy first
In [ ]:
# Read the confidential token.
credentials = configparser.ConfigParser()
credentials.read('credentials.ini')
#authentication
auth = tweepy.OAuthHandler(credentials.get('twitter', 'consumer_key'), credentials.get('twitter', 'consumer_secret'))
auth.set_access_token(credentials.get('twitter', 'access_token'), credentials.get('twitter', 'access_secret'))
#construct API instance
#deal with rate limits and notify when delayed because of rate limits
api = tweepy.API(auth,wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
Now you are all set up to start collecting data from Twitter!
In this exercise we will construct a network with the following logic:
1) We will chose a user_id
in Twitter to be our first node.
2) We will find (some) of the users who are both following user_id
and are being followed by user_id
. From now on we will call such users "connections" of user_id
. We will place these user ids in a list called first_nodes
.
3) For every node in the list first_nodes
we will then find (some) of the users who are following and are being followed by this node (aka the connections of this node). The user ids collected in this step will be placed in a list called second_nodes
.
4) The collection of the ids of all nodes (aka Twitter users) that we have collected so far will be placed in a list called all_nodes
.
5) Since we have only collected a subset of all possible "connections" for our nodes we have to check if there are any remaining inner connections that we have missed.
The entire network is to be organized in a dictionary with entries that will have as key the Twitter id of the user (this is a number characterizing each user in Twitter) and as value the list of ids of his connections.
So, let us begin. The first thing that you will have to do is to chose the node from which everything will start. I have chosen the Twitter account of Applied Machine Learning Days that will take place in January 2018 in EPFL. You may change that if you wish to, but please make sure that the user you chose has both followers and friends and that he allows you to access this data.
In [ ]:
user = 'appliedmldays '
user_id=api.get_user(user).id
In the following cell write a function that takes as an argument the Twitter id of a user and returns a list with the ids of his connections. Take into account the case where a user does not allow you to access this information.
Reminder: By connections we mean users that are both followers and friends of a given user.
In [ ]:
def find_connections(user_id):
followers = []
friends=[]
#your code here
return connections
In [ ]:
first_connections=find_connections(user_id)
print('{}has {} connections'.format(user, len(first_connections)))
Collect your first_nodes
and second_nodes
and organize your collected nodes and their connections in the dictionary called network
.
Hints:
random.choice([1,3,4])
to randomly choose a number in [1, 3, 4]
.append
and remove
methods to add and remove an element from a Python list.pop
method removes the last item in the list.
In [ ]:
network={}
first_nodes=[]
second_nodes=[]
how_many=3#This is the number of connections you are sampling.
#Keep small (e.g.3) for development, larger later (e.g. 10)
#your code here
Be careful! You should only keep a small value for the how_many
parameter while you are developing your code. In order to answer to the questions you should raise the value of this parameter to how_many=10
at least. This will take a while to execute because of the API rate limit (plan your time accordingly). You should also remember to submit your jupyter notebook with the output shown for a large value of the how_many
parameter.
In [ ]:
network[user_id] = first_nodes
In [ ]:
pprint(network)
In [ ]:
all_nodes=#your code here
In [ ]:
print('There are {} first hop nodes'.format(len(first_nodes)))
print('There are {} second hop nodes'.format(len(second_nodes)))
print('There are overall {} nodes in the collected network'.format(len(all_nodes)))
In [ ]:
for i in second_nodes:
network[i]=[]
Find the inner connections between your collected nodes that you might have missed because you sampled the connections.
In [ ]:
for i in range(0,len(all_nodes)):
# your code here
Congradulations! You have now created a dictionary that describes a real Twitter network! We now want to transform this dictionary into the adjacency (or weight) matrix that you learned about in your first class.
In [ ]:
W=np.zeros([len(all_nodes),len(all_nodes)], dtype=int)
In [ ]:
# your code here
Remember that a weight matrix should be symmetric. Check if it is:
In [ ]:
np.nonzero(W-W.transpose())
Question 1: It might happen that $W \neq W^{T} $ for some $(i,j)$. Explain why this might be the case.
Your answer here:
Impose your weight matrix to be symmetric.
In [ ]:
# Make W is symmetric
bigger = W.transpose() > W
In [ ]:
W = W - W*bigger + W.transpose()*bigger
Plot the weight matrix of your collected network.
Hint: use plt.spy()
to visualize a matrix.
In [ ]:
# your code here
plt.title('Adjacency Matrix W')
Question 2: What is the maximum number of links $L_{max}$ in a network with $N$ nodes (where $N$ is the number of nodes in your collected network)? How many links $L$ are there in your collected network? Comment on how $L$ and $L_{max}$ compare.
Your answer here:
Plot a histogram of the degree distribution.
In [ ]:
p = # your code here
In [ ]:
plt.hist(p);
Question 3: Comment on the plot. What do you observe? Would you expect a similar degree disribution in the complete Twitter network?
Your answer here:
Calculate the average degree of your collected network.
In [ ]:
d_avg = # your code here
Question 4: What is the diameter of the collected network? Please justify.
Your answer here:
You might notice that some nodes have very few connections and hence our matrix is very sparse. Prune the collected network so that you keep only the nodes that have a degree that is greater than the average degree and plot the new adjacency matrix.
In [ ]:
Wpruned = # your code here
In [ ]:
plt.spy(Wpruned, markersize=1)
plt.title('Adjacency Matrix W');