Facebook has been around for the past 10 years. Since the, it grew to have more than 1.3 billion active users, that is 1 out 7 people in the world has an account and is active on Facebook. Although the demographics of Facebook is getting old, we need to know that some other social networks popular among young people also belong to Facebook Inc., such as WhatsApp or Instragram. But the importance of Facebook is not trivial, as it helped establish and expand the general conception of social network analysis.
A 2014 study of "Emotional Contagion Through Social Networks" manipulated the balance of positive and negative messages seen by 689,000 Facebook users. The researchers concluded that they had found "some of the first experimental evidence to support the controversial claims that emotions can spread throughout a network, [though] the effect sizes from the manipulations are small."
In this assignment you will explore a social network in Facebook to determine who are the key players and how they act.
The goal will be to analyze a real social network.
Your mission will be to complete the Python code in the cells below and execute it until the output looks similar or identical to the output shown. I would recommend to use a temporary notebook to work with the dataset, and when the code is ready and producing the expected output, copypaste it to this notebook. Once is done and validated, just copy the file elsewhere, as the notebook will be the only file to be sent for evaluation. Of course, everything starts by downloading this notebook.
April $7^{th}$.
Until very recently, any Facebook user could download his or her own Facebook ego-network. In January of 2015, and according to their new platform policies, Facebook suspended the application that allowed it, making it really hard for a user to extract his/her own data. For that reason, instead of using your very own social network for this assignment, we will be using an already provived and anonymized network. Not even half the fun, but still interesting to study. The data is available as an edge lists file in facebook_combined.txt
.
In [1]:
# Nothing to change in this cell
%matplotlib inline
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import pandas as pd
import requests
# Set some options
plt.rcParams['figure.figsize'] = (12.0, 6.0)
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 25)
In [2]:
facebook = nx...
facebook..., facebook...
Out[2]:
In [3]:
degrees = nx..
sorted(degrees.items(), key=lambda x: x[1], reverse=True)[:5]
Out[3]:
In [4]:
len(list(nx...))
Out[4]:
In [5]:
nx...
Out[5]:
In [6]:
nx...
Out[6]:
By looking at the numbers, it might seem like this Facebook graph is following a small-world model. If that is the case, there should be tightly knit cliques connected by weak ties. Let's see then the global clustering coefficient (transitivity).
In [7]:
nx...
Out[7]:
In order to calculate the communities, we will use the Louvain community detection library. The main and only file you need is community.py
, so download it and put it where this notebook can read it. The next cell will look for the file in the current environment, if that doesn't work it will try to fetch the file automatically.
In [8]:
# Nothing to change in this cell
try:
import community
except ImportError:
import imp
import sys
community = sys.modules["community"] = imp.new_module("community")
community.__file__ = "https://bitbucket.org/taynaud/python-louvain/raw/default/community/__init__.py"
exec(requests.get(community.__file__).text, community.__dict__)
community
Out[8]:
Once the library is loaded, let's calculate the number of communities. The way it works is by calculating the best partition for the graph. Then it returns a dictionary with nodes as keys and the community number as values. Given that information, let's extract the biggest and the smallest communities. First let's create a dictionary with the number of nodes per community, and print it sorted by the the number of nodes.
In [9]:
partition = community.best_partition(facebook)
communities = {}
for i in set(partition.values()):
communities[i] = ...
communities
Out[9]:
In [10]:
sorted_communities = ...
for community_id, nodes in sorted_communities:
print("Community {:2.0f}: {:3.0f} nodes".format(community_id, nodes))
Once we have the communities, let's print the members for the biggest and the smallest. But two communities have the same amount of nodes, so we will print both of them.
In [11]:
# We need to use partition again!
# this list will contain 3 numbers, although depending on the machine, the might be different to the ones shown below
for community_id in [...]:
nodes = ...
print("Community {}:".format(community_id), nodes)
Let's now see how the smallest communities look like.
In [12]:
small_community_1_nodes = ...
subgraph = facebook...
nx...
In [13]:
small_community_1_nodes = ...
subgraph = facebook...
nx...
And for the list of nodes in the biggest community, let's sort the nodes by the total degree-centrality, clustering, and betweenness centrality. Show only the first 10.
In [14]:
big_community_nodes = ...
In [15]:
degree = nx...
sorted_degree = ...
for node_id, value in sorted_degree[:10]:
print("{}: {}".format(node_id, value))
In [16]:
clustering = nx...(..., nodes=big_community_nodes)
sorted_clustering = ...
for node_id, value in sorted_clustering[:10]:
print("{}: {}".format(node_id, value))
In [17]:
centrality = nx...(..., k=len(big_community_nodes))
sorted_centrality = ...
for node_id, value in sorted_centrality[:10]:
print("{}: {}".format(node_id, value))
For the nodes with the highest degree let's plot their ego-network of radius 2.
In [18]:
nx...
For the nodes with the highest clustering coeffient (any if more than one) let's plot their ego-network of radius 3.
In [19]:
nx...
For the highest betweenness with radius 1.
In [20]:
nx...
And also let's plot the degree distribution of the total graph.
In [21]:
degree_hist = plt...
plt.loglog(..., ...)
plt...
Out[21]:
Facebook suggests people you may be (or should be) friends with. Netflix suggests movies you might like. Amazon suggests products to buy. How do they do that? Although, the actual algorithms used by these companies are closely-guarded trade secrets, we will try now one simple way to make such suggestions called "collaborative filtering".
A computer system that makes suggestions is called a recommender system. As background, there are two general approaches: collaborative filtering and content-based filtering.
Let's implement one those mechanisms for recommending a new friend in a social network. A simple way to state this question is, "For user X, who is the best person to recommend as a friend?" Which is the same as "For user X, list some non-friends in order, starting with the best friend recommendation and ending with the worst." A non-friend is a user who is not X and is not a friend of X. Depending on the recommendation algorithm, the list may include all non-friends or some of them. You will do this by assigning each potential friend a number called a score, where higher scores indicate a better match. Then you can sort your list according to the score.
If non-friend Y is your friend's friend, then maybe Y should be your friend too. If person Y is the friend of many of your friends, then Y is an even better recommendation. The best friend recommendation is the person with whom you have the largest number of mutual friends. You will implement this heuristic.
Let's say we are interested in giving "A" recommendations of people that they might want to be friends with. By this algorithm, the number of friends you have in common with someone is a measure of how likely it is that they would be a good friend recommendation for you. Thus the more friends someone has in common with you, the better their "friendship score" is. We will actually use the number of friends in common as the friendship score itself. Thus we want to find out who has the most friends in common with "A". If we had:
D is the best recommendation for A, F is the second best recommendation. We would not recommend B or C because A is already friends with them. We also would not recommend E because A has no friends in common with E.
Now let's think about how we might create this list. Obviously we will need to calculate these "friendship scores" for some set of people in the graph. By this "number of common friends" metric, for a given person, we really only care about calculating such scores for people that are "friends of our current friends" who are not yet our friends. (There could be many people in a large graph, so we do not want to simply calculate friendship scores for every person in the graph as many of those scores are likely to be zero.) So it would be useful to be able to calculate the set of "friends of friends" for a given user.
We will use the node with id "0"
in the facebook
graph as our user.
In [22]:
def friends_of_friends(graph, user):
"""Returns a set of friends of friends of the given user, in the given graph.
The result does not include the given user nor any of that user's friends.
"""
ego_nodes_r2 = ...
ego_nodes_r1 = ...
return ... - ... - ...
friends_of_friends(facebook, "0")
Out[22]:
For each of those friends-of-friends we will want to be able to calculate the set of friends that they have in common with the user.
In [23]:
def common_friends(graph, user1, user2):
"""Returns the set of friends that user1 and user2 have in common."""
user1_friends = ...
user2_friends = ...
return ...
common_friends(facebook, "0", "100")
Out[23]:
If we want to return to the user a ranked list of recommendations from best recommendation to worst, then it would be useful to have a data structure like a dictionary to keep track of the mapping of "friend of friend" to friendship score.
In [24]:
def number_of_common_friends(graph, user):
"""Returns a dictionary from each user U to the number of friends U has in common with the given user.
The map keys are the users who have at least one friend in common with the
given user, and are neither the given user nor one of the given user's friends.
Take a graph G for example:
- A and B have two friends in common
- A and C have one friend in common
- A and D have one friend in common
- A and E have no friends in common
- A is friends with D
number_of_common_friends(G, "A") => { 'B':2, 'C':1 }
"""
common_friends_count = {}
for fof in ...
...
return common_friends_count
number_of_common_friends(facebook, "0")
Out[24]:
Furthermore, given this mapping of people to friendship scores, we will want to sort the potential friends from best to worst before presenting it to the user.
In [25]:
def number_dict_to_sorted_list(dic):
"""Given a dict whose values are numbers, return a list of the keys.
The keys are sorted by the number they map to, from greatest to least.
When two keys map to the same number, the keys are sorted by their
natural sort order, from least to greatest."""
return ...
number_dict_to_sorted_list({'A': 1, 'B': 3, 'C': 1})
Out[25]:
Finally, the function that provides with the recommendations for an user.
In [26]:
def recommend_by_number_of_common_friends(graph, user):
"""Return a list of friend recommendations for the given user.
The friend recommendation list consists of names of people in the graph
who are not yet a friend of the given user.
The order of the list is determined by the number of common friends.
"""
return ...
recommend_by_number_of_common_friends(facebook, "0")[:10]
Out[26]: