Goal: This notebook aims to show how to use PyGraphistry to visualize data from Neo4j. We also show how to use graph algorithms in Neo4j and use PyGraphistry to visualize the result of those algorithms.
Prerequesties:
neo4j-driver
- pip install neo4j-driver
pygraphistry
- pip install "graphistry[all]"
In [1]:
# import required dependencies
from neo4j.v1 import GraphDatabase, basic_auth
from pandas import DataFrame
import graphistry
In [2]:
# register Graphisty API key
# request an API key if you don't have one: https://www.graphistry.com/api-request
#graphistry.register(key='YOUR API KEY HERE')
If you haven't already, create an instance of the Russian Twitter Trolls sandbox on Neo4j Sandbox. We'll use the Python driver for Neo4j to fetch data from Neo4j. To do this we'll need to instantiate a Driver
object, passing in the credentials for our Neo4j instance. If using Neo4j Sandbox you can find the credentials for your Neo4j instance in the "Details" tab. Specifically we need the IP address, bolt port, username, and password. Bolt is the binary protocol used by the Neo4j drivers so a typical database URL string takes the form bolt://<IP_ADDRESS>:<BOLT_PORT>
In [3]:
# instantiate Neo4j driver instance
# be sure to replace the connection string and password with your own
driver = GraphDatabase.driver("bolt://34.201.165.36:34532", auth=basic_auth("neo4j", "capitals-quality-loads"))
Once we've instantiated our Driver, we can use Session
objects to execute queries against Neo4j. Here we'll use session.run()
to execute a Cypher query. Cypher is the query language for graphs that we use with Neo4j (you can think of Cypher as SQL for graphs).
In [4]:
# neo4j-driver hello world
# execute a simple query to count the number of nodes in the database and print the result
with driver.session() as session:
results = session.run("MATCH (a) RETURN COUNT(a) AS num")
for record in results:
print(record)
If we inspect the datamodel in Neo4j we can see that we have inormation about Tweets and specifically Users mentioned in tweets.
Let's use Graphistry to visualize User-User Tweet mention interactions. We'll do this by querying Neo4j for all tweets that mention users.
Currently, PyGraphistry can work with data as a pandas DataFrame, NetworkX graph or IGraph graph object. In this section we'll show how to load data from Neo4j into PyGraphistry by converting results from the Python Neo4j driver into a pandas DataFrame.
Our goal is to visualize User-User Tweet mention interactions. We'll create two pandas DataFrames, one representing our nodes (Users) and a second representing the relationships in our graph (mentions).
Some users are known Troll accounts so we include a flag variable, troll
to indicate when the user is a Troll. This will be used in our visualization to set the color of the known Troll accounts.
In [7]:
# Create User DataFrame by querying Neo4j, converting the results into a pandas DataFrame
with driver.session() as session:
results = session.run("""
MATCH (u:User)
WITH u.user_key AS screen_name, CASE WHEN "Troll" IN labels(u) THEN 5 ELSE 0 END AS troll
RETURN screen_name, troll""")
users = DataFrame(results.data())
# show the first 5 rows of the DataFrame
users[:5]
Out[7]:
Next, we need some relationships to visualize. In this case we are interested in visualizing user interactions, specifically where users have mentioned users in Tweets.
In [8]:
# Query for tweets mentioning a user and create a DataFrame adjacency list using screen_name
# where u1 posted a tweet(s) that mentions u2
# num is the number of time u1 mentioned u2 in the dataset
with driver.session() as session:
results = session.run("""
MATCH (u1:User)-[:POSTED]->(:Tweet)-[:MENTIONS]->(u2:User)
RETURN u1.user_key AS u1, u2.user_key AS u2, COUNT(*) AS num
""")
mentions = DataFrame(results.data())
mentions[:5]
Out[8]:
Now we can visualize this mentions network using Graphistry. We'll specify the nodes and relationships for our graph. We'll also use the troll
property to color the known Troll nodes red, setting them apart from other users in the graph.
In [9]:
viz = graphistry.bind(source="u1", destination="u2", node="screen_name", point_color="troll").nodes(users).edges(mentions)
viz.plot()
Out[9]:
After running the above Python cell you should see an interactive Graphistry visualization like this:
Known Troll user nodes are colored red, regular users colored blue. By default, the size of the nodes is proportional to the degree of the node (number of relationships). We'll see in the next section how we can use graph algorithms such as PageRank and visualize the results of those algorithms in Graphistry.
The above visualization shows us User-User Tweet mention interactions from the data. What if we wanted to answer the question "Who is the most important user in this network?". One way to answer that would be to look at the degree, or number of relationships, of each node. By default, PyGraphistry uses degree to style the size of the node, allowing us to determine importance of nodes at a glance.
We can also use graph algorithms such as PageRank to determine importance in the network. In this section we show how to run graph algorithms in Neo4j and use the results of these algorithms in our Graphistry visualization.
In [10]:
# run PageRank on the projected mentions graph and update nodes by adding a pagerank property score
with driver.session() as session:
session.run("""
CALL algo.pageRank("MATCH (t:User) RETURN id(t) AS id",
"MATCH (u1:User)-[:POSTED]->(:Tweet)-[:MENTIONS]->(u2:User)
RETURN id(u1) as source, id(u2) as target", {graph:'cypher', write: true})
""")
Now that we've calculated PageRank for each User node we need to create a new pandas DataFrame for our user nodes by querying Neo4j:
In [11]:
# create a new users DataFrame, now including PageRank score for each user
with driver.session() as session:
results = session.run("""
MATCH (u:User)
WITH u.user_key AS screen_name, u.pagerank AS pagerank, CASE WHEN "Troll" IN labels(u) THEN 5 ELSE 0 END AS troll
RETURN screen_name, pagerank, troll""")
users = DataFrame(results.data())
users[:5]
Out[11]:
In [12]:
# render the Graphistry visualization, binding node size to PageRank score
viz = graphistry.bind(source="u1", destination="u2", node="screen_name", point_size="pagerank", point_color="troll").nodes(users).edges(mentions)
viz.plot()
Out[12]:
Now when we render the Graphistry visualization, node size is proprtional to the node's PageRank score. This results in a different set of nodes that are identified as most important.
By binding node size to the results of graph algorithms we are able to draw insight from the data at a glance and further explore the interactive visualization.
In [ ]: