In [1]:
__author__ = 'The Roam Analytics team'
TL;DR The latest build of the Roam Core Public Health Knowledge Graph passed 1 billion edges. This post marks that milestone and talks about why it is meaningful.
The Roam Core Public Health Knowledge Graph is a rich picture of the world of healthcare, with connections into numerous data sources: diverse medical ontologies, provider profiles and networks, product approvals and recalls, population health statistics, academic publications, financial data, clinical trial summaries and statistics, and many others.
We currently rebuild the Graph on a biweekly schedule. The build that finished on June 2, 2017, has 209,053,294 nodes, 1,021,163,726 edges, and 6,231,287,999 node/edge attributes. These big numbers are great, of course, but the more interesting statistic for us at Roam is much smaller: the 5:1 edge-to-node ratio, as a measure of the relationships that the Graph captures.
The Core Public Health Graph is the foundation of Roam's data infrastructure. Internally, we use it to pursue hypotheses about healthcare and to explore new methods. In commercial engagements, each of our customers gets their own copy of it. We infuse that graph with (typically private, protected) project-specific data sets and serve applications on top of it. Having separate instances per project ensures that sensitive data and insights remain separate. (These project-specific graphs are generally much larger than the Core Public Health Graph because of the sheer volume of, for example, insurance claims.)
The guiding idea here is that everything – a name, a number, a date, an event description – acquires meaning only in context, where it can be compared with other things of its type. In other words, context is essential for understanding, and the more context one has, the fuller one's understanding can be. Having invested in building these huge graphs for healthcare, we always have the richest possible context at our fingertips.
The nodes in these graphs represent entities in healthcare: drugs, procedures, providers, hospitals, statistics, events, etc. The edges capture relationships between nodes: treats, prescribed, has_symptom, is, etc. We have always aspired to have the number of edges grow faster than the number of nodes. The world of healthcare is big, but so are many domains. What makes healthcare stand out is the incredible number of important, complex inter-relationships between entities.
That's why we emphasize the number of edges, and why a 5:1 edge-to-node ratio (on 277 edge types) is so meaningful. We're happy to celebrate node-count milestones, of course, but adding entities isn't all that useful unless they get embedded in a web of connections to other entities.
In March, the graph had 200M nodes and 632M edges – much smaller in absolute terms, and with a mere 3:1 edge-to-node ratio. So, of late, we've mainly been adding edges.
We manage the entire graph building process with Airflow. Here's a snapshot of the directed acyclic graph (DAG) of process dependencies for the "connector/enricher" phase. This DAG adds about 40% of the edges (and most of the really crucial ones). This represents a large and diverse effort that blends software engineering, database management, natural language processing, and machine learning.
There are always more data sets to ingest, and more relationships to uncover. What will this DAG look like at 2 billion edges? At 100 billion?
In [2]:
from IPython.display import display, Image
display(Image('roam-core-public-connector-enricher-dag.png'))