PyGw Showcase

This notebook demonstrates the some of the utility provided by the pygw python package.

In this guide, we will show how you can use pygw to easily:

  • Define a data schema for Geotools SimpleFeature/Vector data (aka create a new data type)
  • Create instances for the new type
  • Create a GeoWave Data Store
  • Register a DataType Adapter & Index to the data store for your new data type
  • Ingest user-created data into the GeoWave Data Store
  • Query data out of the data store

To make this guide more interesting, we will be playing with this toy-data set from Kaggle on Boston Public School buildings

Installation

We can use pip to install pygw!


In [1]:
# Install pygw
!pip install ../main/python/


Processing d:\programming\java\geowave\python\src\main\python
Building wheels for collected packages: PyGw
  Building wheel for PyGw (setup.py): started
  Building wheel for PyGw (setup.py): finished with status 'done'
  Stored in directory: C:\Users\ngile\AppData\Local\Temp\pip-ephem-wheel-cache-_q9x2vfs\wheels\b5\fe\87\545dd16dd789406d9a2f04e05899a2a8163ea0a6a1eb1de314
Successfully built PyGw
Installing collected packages: PyGw
  Found existing installation: PyGw 0.1.dev0
    Uninstalling PyGw-0.1.dev0:
      Successfully uninstalled PyGw-0.1.dev0
Successfully installed PyGw-0.1.dev0

Importing pygw


In [2]:
import pygw

# --- Importing Relevant Modules ---
# Data Stores module
import pygw.stores
# Index module
import pygw.indices
# Geotools support
import pygw.geotools
# Query module
import pygw.query


---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
D:\Conda\lib\site-packages\py4j\java_gateway.py in _get_connection(self)
    957         try:
--> 958             connection = self.deque.pop()
    959         except IndexError:

IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

ConnectionRefusedError                    Traceback (most recent call last)
D:\Conda\lib\site-packages\py4j\java_gateway.py in start(self)
   1095         try:
-> 1096             self.socket.connect((self.address, self.port))
   1097             self.stream = self.socket.makefile("rb")

ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it

During handling of the above exception, another exception occurred:

Py4JNetworkError                          Traceback (most recent call last)
D:\Conda\lib\site-packages\pygw\__init__.py in <module>
     16     # This should be called only once.
---> 17     config.init()
     18 except Py4JNetworkError as exc:

D:\Conda\lib\site-packages\pygw\config.py in init(self)
     10                 ### Reflection utility ###
---> 11                 self.reflection_util= self.GATEWAY.jvm.py4j.reflection.ReflectionUtil
     12 

D:\Conda\lib\site-packages\py4j\java_gateway.py in __getattr__(self, name)
   1677             proto.REFL_GET_UNKNOWN_SUB_COMMAND_NAME + name + "\n" + self._id +
-> 1678             "\n" + proto.END_COMMAND_PART)
   1679         if answer == proto.SUCCESS_PACKAGE:

D:\Conda\lib\site-packages\py4j\java_gateway.py in send_command(self, command, retry, binary)
   1011         """
-> 1012         connection = self._get_connection()
   1013         try:

D:\Conda\lib\site-packages\py4j\java_gateway.py in _get_connection(self)
    959         except IndexError:
--> 960             connection = self._create_connection()
    961         return connection

D:\Conda\lib\site-packages\py4j\java_gateway.py in _create_connection(self)
    965             self.gateway_parameters, self.gateway_property)
--> 966         connection.start()
    967         return connection

D:\Conda\lib\site-packages\py4j\java_gateway.py in start(self)
   1107             logger.exception(msg)
-> 1108             raise Py4JNetworkError(msg, e)
   1109 

Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:25333)

The above exception was the direct cause of the following exception:

PyGwJavaGatewayNotStartedError            Traceback (most recent call last)
<ipython-input-2-86fcf192413c> in <module>
----> 1 import pygw
      2 
      3 # --- Importing Relevant Modules ---
      4 # Data Stores module
      5 import pygw.stores

D:\Conda\lib\site-packages\pygw\__init__.py in <module>
     17     config.init()
     18 except Py4JNetworkError as exc:
---> 19     raise PyGwJavaGatewayNotStartedError("The JavaGateway must be running before you can import pygw.") from exc

PyGwJavaGatewayNotStartedError: The JavaGateway must be running before you can import pygw.

Loading the Boston Public Schools Data Set


In [ ]:
import csv

with open("public_schools.csv", encoding='utf-8-sig') as f:
    reader = csv.DictReader(f)
    raw_data = [row for row in reader]

In [ ]:
# Let's take a look at what the data looks like
raw_data[0]

For the purposes of this exercise, let's just look at the ADDRESS, X, Y, and BLDG_NAME properties of each datapoint.

Creating a new SimpleFeature data type for the Boston Public Schools Data Set

We can define a data schema for our needs & create an appropriate SimpleFeatureType. The SimpleFeatureType constructor takes in varargs for the kinds of attributes we want our type to have.

We can easily create these with data-type specific convenience methods for constructing Attributes like SimpleFeatureTypeAttribute.string


In [ ]:
from pygw.geotools import SimpleFeatureType as SFT
from pygw.geotools import SimpleFeatureTypeAttribute as SFTAttr

# Creating the Data Type for Public Schools data
pub_school_dt = SFT("public_schools",
                    SFTAttr.string("building_name"),
                    SFTAttr.string("address"),
                    SFTAttr.geometry("coordinates"))  # Let's group X and Y as a coordinate

Creating features for each data point using our new SimpleFeatureType

PyGw allows you to create SimpleFeature instances straight from a SimpleFeatureType. We can use the SimpleFeatureType.create_feature method to do so easily!

SimpleFeatureType.create_feature takes in an id and kwargs corresponding to the attribute descriptions associated with the type when we first created it.


In [ ]:
features = []
for bldg in raw_data:
    
    data_id = int(bldg["BLDG_ID"])
    addr = bldg["ADDRESS"]
    name = bldg["BLDG_NAME"]
    coords = (float(bldg["X"]), float(bldg["Y"]))
    
    ft = pub_school_dt.create_feature(data_id, building_name=name, address=addr, coordinates=coords)
    
    features.append(ft)

Creating a Data Store

Let's now create a Data Store to ingest our data. A simple one we can use for this example is RocksDbDs.


In [ ]:
store = pygw.stores.RocksDbDs(gw_namespace="pygw.boston_schools.example", dir="./schools")

An aside: help()

Much of pygw is well-documented, and the help method in python can be useful for figuring out what a pygw instance can do. Let's try it out on our store.


In [ ]:
help(store)

Registering our Data Type to the data store

To store data into our data store, we first have to register a DataTypeAdapter and designate an Index to put our data into.


In [ ]:
# We provide a convenience method to get the type adapter straight from the SimpleFeatureType!
pub_school_adapter = pub_school_dt.get_type_adapter()

In [ ]:
# We want to index by coordinates so we want a spatial index
index = pygw.indices.SpatialIndex()

In [ ]:
# Add our type to our data store
store.add_type(pub_school_adapter, index)

In [ ]:
# Check that we've successfully registered an index and type
store.get_types()

In [ ]:
store.get_indices()

Writing data to our store


In [ ]:
# Create a writer for our data
writer = store.create_writer(pub_school_dt.get_name())

In [ ]:
# Writing data to the data store
for ft in features:
    writer.write(ft)

In [ ]:
writer.close()

Querying our store to make sure the data was ingested properly


In [ ]:
from pygw.query import Query

# `Query.everything` is a convenience method for creating an 'Everything` query
results = store.query(Query.everything())

In [ ]:
# The results returned above was an interator, so let's convert to a list
results = [r for r in results]

In [ ]:
# Do we have anything"?
len(results)

Unfortunately pretty pygw wrapping of returned results from a query is not yet supported. However, we can use the pygw.debug.print_obj method to see what things look like:


In [ ]:
from pygw.debug import print_obj

In [ ]:
print_obj(results[0])

Something more interesting...

Woo-hoo! We've successfully ingested our custom data into our data store. That's cool, but now what? ... Can pygw do more?

Let's say we wanted to get retrieve only the public school buildings in East Boston -- How would we go about doing that? For the purposes of this, let's just say we want schools to the East of Franklin Park Zoo, which has coordinates: 42.3055° N, 71.0900° W --> (-71.0900, 42.3055)


In [ ]:
# A CQL query for things east of the zoo
cql_query_string = "BBOX(coordinates,-71.0900,-180,180,180)"

In [ ]:
# Getting the results iterable
results = store.query(Query.cql(cql_query_string))

In [ ]:
# list of results
results = [r for r in results]

In [ ]:
# Less than before!
len(results)

In [ ]:
print_obj(results[0])

Let's say we still want to query for buildings to the East of the zoo, but also we only want to find buildings that exist on "Avenue"s. We can do that!


In [ ]:
cql_query_string = "BBOX(coordinates,-71.0900,-180,180,180) and address like '%Avenue'"

In [ ]:
# Getting the results iterable
results = store.query(Query.cql(cql_query_string))
results = [r for r in results]
len(results)

In [ ]:
print_obj(results[0])

In [ ]:
# DELETE EVERYTHING
store.delete_all()

In [ ]: