Over the past year or so I have been working with some friends to create a 501(c)3 non-profit organization called Anidata. Anidata has a two-prong mission: 1) we provide mentorship opportunities where aspiring data scientists can get hands-on experience with real-world data problems in order to improve their skills and 2) we provide data science services to social good organizations that might not otherwise have access to data science capabilities.
{:style="text-align:center" }
In our first project as an organization, we supported the Fulton County's District Attorney's office to help find, fight, investigate, and prosecute human trafficking crimes. In consultation with the DA's office, we learned that many of the investigative steps require manually search adult services websites to build a case.
After gathering this initial information, we held a hackathon with Anidata members and law enforcement/DA staff to converge on a concept for a data product that would help them fight human trafficking.
{:style="text-align:center" }
The hackathon was a big success with over 30 attendees. Teams presented various project ideas that were in varying states of code completion. The common theme for a useful product that emerged was a searchable database of internet entities that post on adult ad websites.
The basic ides would be to allow investigators to take one piece of evidence they might have on an individual under investigation--such as a phone number--and search on that information in a database for all related phone numbers and email addresses. This helps the DA more efficiently link burner phones to people and posts so that they can build a case.
At the end of the hackathon we had many elements of what would eventually be this search product in place including: 1) a webscraper, 2) a luigi job framework for running daily jobs and dumping the output to a postgresql database, 3) cleaning algorithms to find and clean emails and phone numbers, and 4) an entity resolution algorithm to group postings by entity.
The core algorithmic challenge in this project was to create an ER algorithm that would group posts by posting entity and allow us to extract a set of emails and phone numbers from those posts to associate with that entity. Let's start by taking a look at the data.
I've flatten several months of the database data and pruned it down to just the columns we care about. In the following dataframe each row is a ad posting and the metadata we were able to scrape from that posting. Sample data includes three flat files that pair a post_id
with an email, user ID, or email address. We have the following columns:
I have the data for this post stored in GitHub. Let's take a look at the head of the dataframe:
In [ ]:
import pandas as pd
df = pd.read_csv(
'https://github.com/gte620v/graph_entity_resolution/raw/master/data/scraped_data.csv.gz',
converters={'name': lambda x: str(x).lower(),
'number': str,
'oid': str,
'post_id': str},
parse_dates=['postdate'],
compression='gzip')
df.head()
In [ ]:
df[['post_id','name','number','oid','region']].describe(include = 'all')
In [ ]:
df[['posterage']].dropna().describe()
Seaborn has some cool canned functions for examining data distributions by category. I am particuraly fond sns.swarmplot
that makes plots like the one below. The plot is a sampling of the distribution of first 1000 rows by region and poster age. Strangely, there are a few posters with ages around 100 years old.
In [ ]:
import seaborn as sns
%matplotlib inline
from pylab import rcParams
rcParams['figure.figsize'] = (7.0, 7.0)
sns.swarmplot(y='region',x='posterage',data=df.head(1000),size=2)
In [ ]:
df[['postdate']].describe(include = 'all')
After wrestling with the data a bit, we realized that we can conceptualize the data as a graph and then cluster posts into entities by finding the connected subgraphs. To do this, our graph has the following semantics:
Our approach is to create subgraphs using each type of edge information and then use subgraph clustering to find the entity subgraphs.
The clustered subgraphs will be entities and each entity will have posted some number of ads. The block below analyzes the count of the most posted phone numbers. These are phone numbers that the poster put in the ad as contact information. The output shows that some phone numbers are associated with over a thousand posts.
In [ ]:
df\
.groupby('number')\
.count()\
.sort_values('post_id',ascending=False)[['post_id']]\
.head()
One challenge is to efficiently create the sub graphs. Our first approach was to make fully connected graph out of the data subsets. But this has the short coming of creating a bunch of graph edges that we will later show are not necessary. With a fully connected graph of \(N\) nodes, we will have \(N \cdot (N-1)\) connections, which is a lot of graph data to store and reason about.
Here is an example of a phone number that is seen on 11 posts:
In [ ]:
ph_sample = df[df.number=='7865032020']
ph_sample.sort_values('name',ascending=False).head()
To create the subgraph we iterate through all pair-wise combinations of posts (a cross join) and populate the networkx graph. You'll notice in the function below that we also add an attribute to each edge called d3color
.
I made a plotting library called pyd3netviz to use d3 to visualize network graphs in python (see my previous post about it). pyd3netviz can take networkx node an edge attributes as directives for styling the plots. In this example, out function takes in a rgb tuple called color
and uses that to specify the edge color.
In [ ]:
from pyd3netviz import ForceChart
import itertools
import networkx as nx
def make_graph_data(in_data, data_type, color, G=None):
''' Make and plot graph
do_plot can be 'networkx' or 'pyd3'
'''
if G is None:
G = nx.Graph()
out = []
for a, b in itertools.product(in_data, in_data):
out.append((a, b, {'type': data_type, 'color': color,
'd3color': '#%02x%02x%02x' % tuple(c*255 for c in color)}))
G.add_edges_from(out)
return G
In [ ]:
G_samp = make_graph_data(ph_sample.post_id, 'phone', [0,0,1])
width = 590
height = 400
plot_kargs={'charge':-10,'link_distance':200,'width':width,'height':height,
'default_node_radius':10,'default_link_width':1,
'link_color_field':'d3color'};
fc =ForceChart(G_samp,**plot_kargs)
fc.to_notebook()
In [ ]:
em_sample = df[df['name'].str.contains('tuc',False)]
em_sample
In [ ]:
G_samp_em = make_graph_data(em_sample.post_id, 'email', [1,0,0])
fc =ForceChart(G_samp_em,**plot_kargs)
fc.to_notebook()
If we combine all the posts from the previous two examples and plot them, we notice that they are part of a common ER subgraph that contains the email address tucenicienta360@gmail.com
and the phone number 7865032020
. We'll see in a minute that there are other phone numbers connected to this entity as well. In the plot below the phone number edges are blue and the email address edges are red. You may have to drag around some of the vertices in order to see the red edges.
In [ ]:
out = []
for a, b in itertools.product(em_sample.post_id, em_sample.post_id):
out.append((a, b, {'type': 'email', 'color': 'r', 'd3color': '#f00'}))
G_samp.add_edges_from(out)
fc =ForceChart(G_samp,**plot_kargs)
fc.to_notebook()
The process above works, but having a fully connected set of graphs ends up taking a bunch of memory. To simplify the graph for our problem, we only need each network of posts to be connected--not fully connected. To do that, we can create subgraphs that are loosely connected loops instead of densely connected balls of edges.
In [ ]:
G_samp_loop = nx.Graph()
# No product for loop
v = ph_sample.post_id.values.tolist()
v_right = v[1:]
if len(v) == 1:
v_right = v
else:
v_right[-1] = v[0]
out = [(a, b,{'type':'phone','color':'b', 'd3color': '#00f'}) for a, b in zip(v, v_right)]
G_samp_loop.add_edges_from(out)
colors = [G_samp_loop[u][v]['color'] for u,v in G_samp_loop.edges()]
plot_kargs['charge'] = -150
plot_kargs['link_distance'] = 10
fc =ForceChart(G_samp_loop,**plot_kargs)
fc.to_notebook()
In [ ]:
v = em_sample.post_id.values.tolist()
v_right = v[1:]
if len(v) == 1:
v_right = v
else:
v_right[-1] = v[0]
out += [(a, b,{'type':'phone','color':'r', 'd3color': '#f00'}) for a, b in zip(v, v_right)]
G_samp_loop.add_edges_from(out)
colors = [G_samp_loop[u][v]['color'] for u,v in G_samp_loop.edges()]
fc =ForceChart(G_samp_loop,**plot_kargs)
fc.to_notebook()
When viewed this way, a set of connected posts (vertices) and poster attributes (edges) constitute an entity. With that clear, we simply have to create a graph out of all connections and then find the disjoint subgraphs. The function below takes one form of edge information and makes a connectivity list.
In [ ]:
def make_graph(df, color, data_type):
'''
Makes a list of tuple lists for each node-edge-node segment in the graph
'''
out = []
for i, (k, v) in enumerate(df.groupby(df.columns[-1])):
v = v.values.tolist()
v = [x[0] for x in v]
v_right = v[1:]
if len(v) == 1:
v_right = v
else:
v_right[-1] = v[0]
out.append([(a, b, {'type': data_type,
'color': color,
'd3color': '#%02x%02x%02x' % tuple(c*255 for c in color)}) for a, b in zip(v, v_right)])
out = [item for sublist in out for item in sublist]
return out
In [ ]:
out = make_graph(df[df.name!=''][['post_id','name']],[1,0,0],'email')
out += make_graph(df[df.number!=''][['post_id','number']],[0,0,1],'number')
out += make_graph(df[df.oid!=''][['post_id','oid']],[0,1,0],'oid')
In [ ]:
G = nx.Graph()
G.add_edges_from(out)
sub_graphs = []
for i, x in enumerate(nx.connected_component_subgraphs(G)):
nodes = nx.nodes(x)
sub_graphs.append(list(zip([i] * len(nodes), nodes)))
sub_graphs = [item for sublist in sub_graphs for item in sublist]
In [ ]:
df_out = pd.DataFrame(sub_graphs,
columns=['entity_id',
'post_id'])
df_out.groupby('entity_id').count().head(10)
In [ ]:
df_out = df_out.merge(df,on='post_id')
df_out.set_index(['entity_id','number','name','oid'],inplace=True)
df_out.head(5)
In [ ]:
df_out.query('name=="tucenicienta360@gmail.com"')
In [ ]:
entity_id = df_out.query('name=="tucenicienta360@gmail.com"').index.get_level_values('entity_id')[0]
df_out.query('entity_id=={}'.format(entity_id)).head(20)
Finally, we can get a sense of the network by plotting a subset of 50 entities. There are several orphan posts that aren't connected to an entity as well as several large entities with many connected posts. A quick glance at the plot shows that most of the ID information are phone numbers (blue edges), with a few email addresses (red) and poster_ids (green) sprinkled in.
In [ ]:
G_check = G.subgraph(df_out.loc[560:610].post_id.values)
pos = nx.spring_layout(G_check)
colors = [G_check[u][v]['color'] for u,v in G_check.edges()]
plot_kargs['charge'] = -5
plot_kargs['link_distance'] = 5
plot_kargs['default_node_radius'] = 2
plot_kargs['height'] = 600
fc =ForceChart(G_check,**plot_kargs)
fc.to_notebook()
In [ ]:
rcParams['figure.figsize'] = (7.0, 7.0)
nx.draw(G_check,pos,node_color='k',edge_color=colors,width=2,node_size=5)