We define enrichment as the process of augmenting your data with new variables by means of a spatial join between your data and a Dataset
aggregated at a given spatial resolution in the CARTO Data Observatory, or in other words:
"Enrichment is the process of adding variables to a geometry, which we call the target, (point, line, polygon…) from a spatial (polygon) dataset, which we call the source"
We recommend you check out the CARTOframes quickstart since this guide uses some of the generated DataFrames as well as the Data Discovery guide to learn about exploring the Data Observatory catalog to find variables of interest for your analyses.
Let's follow up with the Data Discovery guide, where we subscribed to the AGS demographics dataset and listed the variables available to enrich our own data.
In [1]:
from cartoframes.auth import set_default_credentials
set_default_credentials('creds.json')
In [2]:
from cartoframes.data.observatory import Catalog, Dataset, Variable, Geography
Catalog().subscriptions().datasets
Out[2]:
In [3]:
dataset = Dataset.get('ags_sociodemogr_f510a947')
variables = dataset.variables
variables
Out[3]:
As we saw in the Data Discovery guide, the ags_sociodemogr_f510a947
dataset contains socio-demographic variables aggregated to the Census block group level.
Let's try and find a variable for total population:
In [4]:
vdf = variables.to_dataframe()
vdf[vdf['name'].str.contains('pop', case=False, na=False)]
Out[4]:
We can store the variable instance we need by searching the Catalog by its slug
, in this case POPCY_5e23b8f4
:
In [5]:
variable = Variable.get('POPCY_5e23b8f4')
variable.to_dict()
Out[5]:
The POPCY
variable contains the SUM
of the population for blockgroup for the year 2019. Let's enrich our stores DataFrame with that variable.
In the CARTOframes Quickstart you learned how to load your own data (in this case Starbucks stores) and geocode the addresses to coordinates for further analysis.
Let's start by loading those geocoded Starbucks stores:
In [6]:
from geopandas import read_file
stores_gdf = read_file('http://libs.cartocdn.com/cartoframes/files/starbucks_brooklyn_geocoded.geojson')
stores_gdf.head(5)
Out[6]:
Note: Alternatively, you can load data in any geospatial format supported by GeoPandas or CARTO.
As we can see, for each store we have its name, address, the total revenue by year and a geometry
column indicating the location of the store. This is important because for the enrichment service to work, we need a DataFrame with a geometry column encoded as a shapely object.
We can now create a new Enrichment
instance, and since the stores_gdf
dataset represents store locations (points), we can use the enrich_points
function passing as arguments, the stores DataFrame and a list of Variables
(that we have a valid subscription from the Data Observatory catalog for).
In this case we are only enriching one variable (the total population), but we could enrich a list of them.
In [7]:
from cartoframes.data.observatory import Enrichment
enriched_stores_gdf = Enrichment().enrich_points(stores_gdf, [variable])
enriched_stores_gdf.head(5)
Out[7]:
Once the enrichment finishes, there is a new column in our DataFrame called POPCY
with population projected for the year 2019, from the US Census block group which contains each one of our Starbucks stores.
All this information, is available in the ags_sociodemogr_e92b1637
metadata. Let's take a look:
In [8]:
dataset.to_dict()
Out[8]:
Next, let's do a second enrichment, but this time using a DataFrame with areas of influence calculated using the CARTOframes isochrones service to obtain the polygon around each store that covers the area within an 8, 17 and 25 minute walk.
In [9]:
aoi_gdf = read_file('http://libs.cartocdn.com/cartoframes/files/starbucks_brooklyn_isolines.geojson')
aoi_gdf.head(5)
Out[9]:
In this case we have a DataFrame which, for each index in the stores_gdf
, contains a polygon of the areas of influence around each store at 8, 17 and 25 minute walking intervals. Again the geometry
is encoded as a shapely
object.
In this case, the Enrichment
service provides an enrich_polygons
function, which in its basic version, works in the same way as the enrich_points
function. It just needs a DataFrame with polygon geometries and a list of variables to enrich:
In [10]:
from cartoframes.data.observatory import Enrichment
enriched_aoi_gdf = Enrichment().enrich_polygons(aoi_gdf, [variable])
enriched_aoi_gdf.head(5)
Out[10]:
We now have a new column in our areas of influence DataFrame, SUM_POPCY
which represents the SUM
of total population in the Census block groups that instersect with each polygon in our DataFrame.
Let's take a deeper look into what happens under the hood when you execute a polygon enrichment.
Imagine we have polygons representing municipalities, in blue, each of which have a population attribute, and we want to find out the population inside the green circle.
We don’t know how the population is distributed inside these municipalities. They are probably concentrated in cities somewhere, but, since we don’t know where they are, our best guess is to assume that the population is evenly distributed in the municipality (i.e. every point inside the municipality has the same population density).
Population is an extensive property (it grows with area), so we can subset it (a region inside the municipality will always have a smaller population than the whole municipality), and also aggregate it by summing.
In this case, we’d calculate the population inside each part of the circle that intersects with a municipality.
Default aggregation methods
In the Data Observatory, we suggest a default aggregation method for certain fields. However, some fields don’t have a clear best method, and some just can’t be aggregated. In these cases, we leave the agg_method
field blank and let the user choose the method that best fits their needs.
In this guide you've seen how to use CARTOframes in conjunction with the Data Observatory to enrich a Starbucks dataset with a new population variable for the use case of revenue prediction analysis by:
In addition, you were introduced to some more advanced concepts and further explanation of how the enrichment itself works.