Anidata

Over the past year or so I have been working with some friends to create a 501(c)3 non-profit organization called Anidata. Anidata has a two-prong mission: 1) we provide mentorship opportunities where aspiring data scientists can get hands-on experience with real-world data problems in order to improve their skills and 2) we provide data science services to social good organizations that might not otherwise have access to data science capabilities.

{:style="text-align:center" } {:width="180px"}

Stop Human Trafficking

In our first project as an organization, we supported the Fulton County's District Attorney's office to help find, fight, investigate, and prosecute human trafficking crimes. In consultation with the DA's office, we learned that many of the investigative steps require manually search adult services websites to build a case.

Hackathon

After gathering this initial information, we held a hackathon with Anidata members and law enforcement/DA staff to converge on a concept for a data product that would help them fight human trafficking.

{:style="text-align:center" } {:width="330px"}{:vertical-align="middle"} {:width="230px"}{:vertical-align="middle"}

The hackathon was a big success with over 30 attendees. Teams presented various project ideas that were in varying states of code completion. The common theme for a useful product that emerged was a searchable database of internet entities that post on adult ad websites.

The basic ides would be to allow investigators to take one piece of evidence they might have on an individual under investigation--such as a phone number--and search on that information in a database for all related phone numbers and email addresses. This helps the DA more efficiently link burner phones to people and posts so that they can build a case.

At the end of the hackathon we had many elements of what would eventually be this search product in place including: 1) a webscraper, 2) a luigi job framework for running daily jobs and dumping the output to a postgresql database, 3) cleaning algorithms to find and clean emails and phone numbers, and 4) an entity resolution algorithm to group postings by entity.

Entity Resolution (ER)

The core algorithmic challenge in this project was to create an ER algorithm that would group posts by posting entity and allow us to extract a set of emails and phone numbers from those posts to associate with that entity. Let's start by taking a look at the data.

Data

I've flatten several months of the database data and pruned it down to just the columns we care about. In the following dataframe each row is a ad posting and the metadata we were able to scrape from that posting. Sample data includes three flat files that pair a post_id with an email, user ID, or email address. We have the following columns:

  • name
  • phone number
  • oid (poster unique ID)
  • posterage
  • region
  • type

Data

I have the data for this post stored in GitHub. Let's take a look at the head of the dataframe:


In [ ]:
import pandas as pd
df = pd.read_csv(
    'https://github.com/gte620v/graph_entity_resolution/raw/master/data/scraped_data.csv.gz',
    converters={'name': lambda x: str(x).lower(),
                'number': str,
                'oid': str,
                'post_id': str},
    parse_dates=['postdate'],
    compression='gzip')
df.head()

Data Description

To get a better sense of the data, we can use DataFrame.describe to find the statistics.

Ordinal Columns


In [ ]:
df[['post_id','name','number','oid','region']].describe(include = 'all')

Age


In [ ]:
df[['posterage']].dropna().describe()

Age Swamplot

Seaborn has some cool canned functions for examining data distributions by category. I am particuraly fond sns.swarmplot that makes plots like the one below. The plot is a sampling of the distribution of first 1000 rows by region and poster age. Strangely, there are a few posters with ages around 100 years old.


In [ ]:
import seaborn as sns
%matplotlib inline
from pylab import rcParams
rcParams['figure.figsize'] = (7.0, 7.0)

sns.swarmplot(y='region',x='posterage',data=df.head(1000),size=2)

Post Date


In [ ]:
df[['postdate']].describe(include = 'all')

Entity Resolution

After wrestling with the data a bit, we realized that we can conceptualize the data as a graph and then cluster posts into entities by finding the connected subgraphs. To do this, our graph has the following semantics:

  • Vertices: Ad posts
  • Edges: Common attributes (email, phone number, or poster ID)

Our approach is to create subgraphs using each type of edge information and then use subgraph clustering to find the entity subgraphs.

Explore Subgraph Sizes

The clustered subgraphs will be entities and each entity will have posted some number of ads. The block below analyzes the count of the most posted phone numbers. These are phone numbers that the poster put in the ad as contact information. The output shows that some phone numbers are associated with over a thousand posts.


In [ ]:
df\
.groupby('number')\
.count()\
.sort_values('post_id',ascending=False)[['post_id']]\
.head()

Example Sub-Graph

One challenge is to efficiently create the sub graphs. Our first approach was to make fully connected graph out of the data subsets. But this has the short coming of creating a bunch of graph edges that we will later show are not necessary. With a fully connected graph of \(N\) nodes, we will have \(N \cdot (N-1)\) connections, which is a lot of graph data to store and reason about.

Here is an example of a phone number that is seen on 11 posts:


In [ ]:
ph_sample = df[df.number=='7865032020']
ph_sample.sort_values('name',ascending=False).head()

Fully Connected Subgraph

To create the subgraph we iterate through all pair-wise combinations of posts (a cross join) and populate the networkx graph. You'll notice in the function below that we also add an attribute to each edge called d3color.

I made a plotting library called pyd3netviz to use d3 to visualize network graphs in python (see my previous post about it). pyd3netviz can take networkx node an edge attributes as directives for styling the plots. In this example, out function takes in a rgb tuple called color and uses that to specify the edge color.


In [ ]:
from pyd3netviz import ForceChart
import itertools

import networkx as nx

def make_graph_data(in_data, data_type, color, G=None):
    ''' Make and plot graph 
    
    do_plot can be 'networkx' or 'pyd3'
    '''
    if G is None:
        G = nx.Graph()
    out = []
    for a, b in itertools.product(in_data, in_data):
        out.append((a, b, {'type': data_type, 'color': color, 
                           'd3color': '#%02x%02x%02x' % tuple(c*255 for c in color)}))
    G.add_edges_from(out)
    return G

Phone Numbers Only

The example below shows the fully connected subgraph for one phone number (the graph is interactive). Out high-level ER objective is solved as long as these nodes are connected as a group. In this case is looks like have the graph fully connected is overkill.


In [ ]:
G_samp = make_graph_data(ph_sample.post_id, 'phone', [0,0,1])
width = 590
height = 400
plot_kargs={'charge':-10,'link_distance':200,'width':width,'height':height,
             'default_node_radius':10,'default_link_width':1,
             'link_color_field':'d3color'};
fc =ForceChart(G_samp,**plot_kargs)
fc.to_notebook()

Email Addresses Only

Looking at a set of carefully chosen email addresses, we see the same sort of graph. I've chosen these email addresses because they share a common post with the phone numbers from the previous example.


In [ ]:
em_sample = df[df['name'].str.contains('tuc',False)]
em_sample

Email Address Graph


In [ ]:
G_samp_em = make_graph_data(em_sample.post_id, 'email', [1,0,0])

fc =ForceChart(G_samp_em,**plot_kargs)
fc.to_notebook()

Combined Graph with Email and Phone Numbers

If we combine all the posts from the previous two examples and plot them, we notice that they are part of a common ER subgraph that contains the email address tucenicienta360@gmail.com and the phone number 7865032020. We'll see in a minute that there are other phone numbers connected to this entity as well. In the plot below the phone number edges are blue and the email address edges are red. You may have to drag around some of the vertices in order to see the red edges.


In [ ]:
out = []
for a, b in itertools.product(em_sample.post_id, em_sample.post_id):
    out.append((a, b, {'type': 'email', 'color': 'r', 'd3color': '#f00'}))
    
G_samp.add_edges_from(out)

fc =ForceChart(G_samp,**plot_kargs)
fc.to_notebook()

Simplifying The Graph

The process above works, but having a fully connected set of graphs ends up taking a bunch of memory. To simplify the graph for our problem, we only need each network of posts to be connected--not fully connected. To do that, we can create subgraphs that are loosely connected loops instead of densely connected balls of edges.

Phone Numbers Only


In [ ]:
G_samp_loop = nx.Graph()

# No product for loop
v = ph_sample.post_id.values.tolist()
v_right = v[1:]
if len(v) == 1:
    v_right = v
else:
    v_right[-1] = v[0]
out = [(a, b,{'type':'phone','color':'b', 'd3color': '#00f'}) for a, b in zip(v, v_right)]

G_samp_loop.add_edges_from(out)

colors = [G_samp_loop[u][v]['color'] for u,v in G_samp_loop.edges()]

plot_kargs['charge'] = -150
plot_kargs['link_distance'] = 10

fc =ForceChart(G_samp_loop,**plot_kargs)
fc.to_notebook()

Phone Numbers and Email Addresses

It is clear that the graph below satisfies our constraint of keeping these nodes connected, while requiring many fewer edges.


In [ ]:
v = em_sample.post_id.values.tolist()
v_right = v[1:]
if len(v) == 1:
    v_right = v
else:
    v_right[-1] = v[0]
out += [(a, b,{'type':'phone','color':'r', 'd3color': '#f00'}) for a, b in zip(v, v_right)]

G_samp_loop.add_edges_from(out)


colors = [G_samp_loop[u][v]['color'] for u,v in G_samp_loop.edges()]

fc =ForceChart(G_samp_loop,**plot_kargs)
fc.to_notebook()

Graph Clusters

When viewed this way, a set of connected posts (vertices) and poster attributes (edges) constitute an entity. With that clear, we simply have to create a graph out of all connections and then find the disjoint subgraphs. The function below takes one form of edge information and makes a connectivity list.


In [ ]:
def make_graph(df, color, data_type):
    '''
    Makes a list of tuple lists for each node-edge-node segment in the graph
    '''
    out = []
    for i, (k, v) in enumerate(df.groupby(df.columns[-1])):
        
        v = v.values.tolist()
        v = [x[0] for x in v]
        v_right = v[1:]
        if len(v) == 1:
            v_right = v
        else:
            v_right[-1] = v[0]
        out.append([(a, b, {'type': data_type,
                            'color': color, 
                           'd3color': '#%02x%02x%02x' % tuple(c*255 for c in color)}) for a, b in zip(v, v_right)])
    out = [item for sublist in out for item in sublist]
    return out

Add Graphs for Each Type of Connection

For each identifying attribute, use make_graph to generate the connectivity list and then combine all lists.


In [ ]:
out = make_graph(df[df.name!=''][['post_id','name']],[1,0,0],'email')
out += make_graph(df[df.number!=''][['post_id','number']],[0,0,1],'number')
out += make_graph(df[df.oid!=''][['post_id','oid']],[0,1,0],'oid')

Use NetworkX to Find Disjoint SubGraphs

Now that we have the mast graph, we can use NetworkX to find all subgraphs and then loop through them to find each distinct entity.


In [ ]:
G = nx.Graph()
G.add_edges_from(out)

sub_graphs = []
for i, x in enumerate(nx.connected_component_subgraphs(G)):
    nodes = nx.nodes(x)
    sub_graphs.append(list(zip([i] * len(nodes), nodes)))

sub_graphs = [item for sublist in sub_graphs for item in sublist]

Check Entity Data

To check the entities we can start by just looking at the number of posts associated with each entity id. That is, the number of posts in each disjoint subgraph.


In [ ]:
df_out = pd.DataFrame(sub_graphs,
                      columns=['entity_id',
                               'post_id'])
df_out.groupby('entity_id').count().head(10)

Merge With Original Data

Finally, we use the entity dataframe to join back in the post data so that we can can see the identity data for each entity.


In [ ]:
df_out = df_out.merge(df,on='post_id')
df_out.set_index(['entity_id','number','name','oid'],inplace=True)
df_out.head(5)

Check Results

To check the results, let's look at the entity we started to manually analyze in the first examples. If we search this entity table by email address we see two phone numbers attached to the entity including the one we originally experimented with. That seems reasonable.


In [ ]:
df_out.query('name=="tucenicienta360@gmail.com"')

Check Entity

Taking that entity_id, we can see all the connected posts. We can see a variety of phone numbers, emails, and post locations.


In [ ]:
entity_id = df_out.query('name=="tucenicienta360@gmail.com"').index.get_level_values('entity_id')[0]
df_out.query('entity_id=={}'.format(entity_id)).head(20)

Plot Network

Finally, we can get a sense of the network by plotting a subset of 50 entities. There are several orphan posts that aren't connected to an entity as well as several large entities with many connected posts. A quick glance at the plot shows that most of the ID information are phone numbers (blue edges), with a few email addresses (red) and poster_ids (green) sprinkled in.


In [ ]:
G_check = G.subgraph(df_out.loc[560:610].post_id.values)

pos = nx.spring_layout(G_check)
colors = [G_check[u][v]['color'] for u,v in G_check.edges()]

plot_kargs['charge'] = -5
plot_kargs['link_distance'] = 5
plot_kargs['default_node_radius'] = 2
plot_kargs['height'] = 600

fc =ForceChart(G_check,**plot_kargs)
fc.to_notebook()

NetworkX Plot

NetworkX also has a plotting tool that we can use to see the graph. The block below shows how we use nx to make the same plot.


In [ ]:
rcParams['figure.figsize'] = (7.0, 7.0)
nx.draw(G_check,pos,node_color='k',edge_color=colors,width=2,node_size=5)

Resources