PyGw Showcase

This notebook demonstrates the some of the utility provided by the pygw python package.

In this guide, we will show how you can use pygw to easily:

  • Define a data schema for Geotools SimpleFeature/Vector data (aka create a new data type)
  • Create instances for the new type
  • Create a RocksDB GeoWave Data Store
  • Register a DataType Adapter & Index to the data store for your new data type
  • Write user-created data into the GeoWave Data Store
  • Query data out of the data store

In [ ]:
%pip install ../../../../python/src/main/python

Loading state capitals test data set

Load state capitals from CSV


In [1]:
import csv

with open("../../../java-api/src/main/resources/stateCapitals.csv", encoding="utf-8-sig") as f:
    reader = csv.reader(f)
    raw_data = [row for row in reader]

In [2]:
# Let's take a look at what the data looks like
raw_data[0]


Out[2]:
['Alabama',
 'Montgomery',
 '-86.2460375',
 '32.343799',
 '1846',
 '155.4',
 '205764',
 'scala']

For the purposes of this exercise, we will use the state name ([0]), capital name ([1]), longitude ([2]), latitude ([3]), and the year that the capital was established ([4]).

Creating a new SimpleFeatureType for the state capitals data set

We can define a data schema for our data by using a SimpleFeatureTypeBuilder to build a SimpleFeatureType.

We can use the convenience methods defined in AttributeDescriptor to define each field of the feature type.


In [3]:
from pygw.geotools import SimpleFeatureTypeBuilder
from pygw.geotools import AttributeDescriptor

# Create the feature type builder
type_builder = SimpleFeatureTypeBuilder()
# Set the name of the feature type
type_builder.set_name("StateCapitals")
# Add the attributes
type_builder.add(AttributeDescriptor.point("location"))
type_builder.add(AttributeDescriptor.string("state_name"))
type_builder.add(AttributeDescriptor.string("capital_name"))
type_builder.add(AttributeDescriptor.date("established"))
# Build the feature type
state_capitals_type = type_builder.build_feature_type()

Creating features for each data point using our new SimpleFeatureType

pygw allows you to create SimpleFeature instances for SimpleFeatureType using a SimpleFeatureBuilder.

The SimpleFeatureBuilder allows us to specify all of the attributes of a feature, and then build it by providing a feature ID. For this exercise, we will use the index of the data as the unique feature id. We will use shapely to create the geometries for each feature.


In [4]:
from pygw.geotools import SimpleFeatureBuilder
from shapely.geometry import Point
from datetime import datetime

feature_builder = SimpleFeatureBuilder(state_capitals_type)

features = []
for idx, capital in enumerate(raw_data):
    state_name = capital[0]
    capital_name = capital[1]
    longitude = float(capital[2])
    latitude = float(capital[3])
    established = datetime(int(capital[4]), 1, 1)
    
    feature_builder.set_attr("location", Point(longitude, latitude))
    feature_builder.set_attr("state_name", state_name)
    feature_builder.set_attr("capital_name", capital_name)
    feature_builder.set_attr("established", established)
    
    feature = feature_builder.build(str(idx))
    
    features.append(feature)

Creating a data store

Now that we have a set of SimpleFeatures, let's create a data store to write the features into. pygw supports all of the data store types that GeoWave supports. All that is needed is to first construct the appropriate DataStoreOptions variant that defines the parameters of the data store, then to pass those options to a DataStoreFactory to construct the DataStore. In this example we will create a new RocksDB data store.


In [5]:
from pygw.store import DataStoreFactory
from pygw.store.rocksdb import RocksDBOptions

# Specify the options for the data store
options = RocksDBOptions()
options.set_geowave_namespace("geowave.example")
# NOTE: Directory is relative to the JVM working directory.
options.set_directory("./datastore")
# Create the data store
datastore = DataStoreFactory.create_data_store(options)

An aside: help()

Much of pygw is well-documented, and the help method in python can be useful for figuring out what a pygw instance can do. Let's try it out on our data store.


In [6]:
help(datastore)


Help on DataStore in module pygw.store.data_store object:

class DataStore(pygw.base.geowave_object.GeoWaveObject)
 |  DataStore(java_ref)
 |  
 |  This class models the DataStore interface methods.
 |  
 |  Method resolution order:
 |      DataStore
 |      pygw.base.geowave_object.GeoWaveObject
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, java_ref)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  add_index(self, type_name, *indices)
 |      Add new indices for the given type. If there is data in other indices for this type, for
 |      consistency it will need to copy all of the data into the new indices, which could be a long
 |      process for lots of data.
 |      
 |      Args:
 |          type_name (str): Name of data type to register indices to.
 |          *indices (pygw.index.index.Index): Index to add.
 |  
 |  add_type(self, type_adapter, *initial_indices)
 |      Add this type to the data store. This only needs to be called one time per type.
 |      
 |      Args:
 |          type_adapter (pygw.base.data_type_adapter.DataTypeAdapter): The data type adapter to add to the data store.
 |          *initial_indices (pygw.index.index.Index): The initial indices for this type.
 |  
 |  aggregate(self, q)
 |  
 |  aggregate_statistics(self, q)
 |  
 |  copy_to(self, other, q=None)
 |      Copy data from this data store to another.
 |      
 |      All data is copied if `q` is None, else only the data queried by `q`.
 |      
 |      Args:
 |          other (pygw.store.data_store.DataStore): The data store to copy to.
 |          q (pygw.query.query.Query): Query filter for data to be copied.
 |  
 |  create_writer(self, type_adapter_name)
 |      Returns an index writer to perform batched write operations for the given data type name.
 |      
 |      Assumes the type has already been used previously or added using `add_type` and assumes one or
 |      more indices have been provided for this type.
 |      
 |      Args:
 |          type_name (str): The name of the type to write to.
 |      Returns:
 |          A `pygw.base.writer.Writer`, which can be used to write entries into the data store of the given type.
 |  
 |  delete(self, q)
 |      Delete all data in this data store that matches the query parameter.
 |      
 |      Args:
 |          q (pygw.query.query.Query): The query criteria to use for deletion.
 |      Returns:
 |          True on success, False on fail.
 |  
 |  delete_all(self)
 |      Delete ALL data and ALL metadata for this datastore.
 |      
 |      Returns:
 |          True on success, False on fail.
 |  
 |  get_indices(self, type_name=None)
 |      Get the indices that have been registered with this data store for a given type.
 |      
 |      Gets all registered indices if `type_name` is None.
 |      
 |      Args:
 |          type_name (str): The name of the type.
 |      Returns:
 |          List of `pygw.index.index.Index` in the data store.
 |  
 |  get_types(self)
 |      Get all the data type adapters that have been used within this data store.
 |      
 |      Returns:
 |          List of `pygw.base.data_type_adapter.DataTypeAdapter` used in the data store.
 |  
 |  ingest(self, url, *indices, ingest_options=None)
 |      Ingest from URL.
 |      
 |      If this is a directory, this method will recursively search for valid files to
 |      ingest in the directory. This will iterate through registered IngestFormatPlugins to find one
 |      that works for a given file.
 |      
 |      Args:
 |          url (str): The URL for data to read and ingest into this data store.
 |          *indices (pygw.index.index.Index): Index to ingest into.
 |          ingest_options: Options for ingest (Not yet supported).
 |  
 |  query(self, q)
 |      Returns all data in this data store that matches the query parameter. All data that matches the
 |      query will be returned as an instance of the native data type. The Iterator must be closed when
 |      it is no longer needed - this wraps the underlying scanner implementation and closes underlying
 |      resources.
 |      
 |      Args:
 |          q (pygw.query.query.Query): The query to preform.
 |      Returns:
 |          A closeable iterable of results.  The `pygw.base.closeable_iterator.CloseableIterator.close` method should be called
 |          on the iterator when it is done being used.
 |  
 |  query_statistics(self, q)
 |  
 |  remove_index(self, index_name, type_name=None)
 |      Remove an index for a given data type.
 |      
 |      If `type_name` is None, the specified index is removed for all types.
 |      
 |      Args:
 |          index_name (str): Name of the index to be removed.
 |          type_name (str): Name of data type to remove.
 |      Raises:
 |          Exception: If the index was the last index of a type.
 |  
 |  remove_type(self, type_name)
 |      Remove all data and statistics associated with the given type.
 |      
 |      Args:
 |          type_name (str): Name of the data type.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from pygw.base.geowave_object.GeoWaveObject:
 |  
 |  __eq__(self, other)
 |      Return self==value.
 |  
 |  __repr__(self)
 |      Return repr(self).
 |  
 |  is_instance_of(self, java_class)
 |      Returns:
 |          True if this object is of the type represented by the given java class.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from pygw.base.geowave_object.GeoWaveObject:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes inherited from pygw.base.geowave_object.GeoWaveObject:
 |  
 |  __hash__ = None

Adding our data to the data store

To store data into our data store, we first have to register a DataTypeAdapter for our simple feature data and create an index that defines how the data is queried. GeoWave supports simple feature data through the use of a FeatureDataAdapter. All that is needed for a FeatureDataAdapter is a SimpleFeatureType. We will also add both spatial and spatial/temporal indices.


In [7]:
from pygw.geotools import FeatureDataAdapter

# Create an adapter for feature type
state_capitals_adapter = FeatureDataAdapter(state_capitals_type)

In [8]:
from pygw.index import SpatialIndexBuilder
from pygw.index import SpatialTemporalIndexBuilder

# Add a spatial index
spatial_idx = SpatialIndexBuilder().set_name("spatial_idx").create_index()

# Add a spatial/temporal index
spatial_temporal_idx = SpatialTemporalIndexBuilder().set_name("spatial_temporal_idx").create_index()

In [9]:
# Now we can add our type to the data store with our spatial index
datastore.add_type(state_capitals_adapter, spatial_idx, spatial_temporal_idx)

In [10]:
# Check that we've successfully registered an index and type
registered_types = datastore.get_types()

for t in registered_types:
    print(t.get_type_name())


StateCapitals

In [11]:
registered_indices = datastore.get_indices(state_capitals_adapter.get_type_name())

for i in registered_indices:
    print(i.get_name())


spatial_idx
spatial_temporal_idx

Writing data to our store

Now our data store is ready to receive our feature data. To do this, we must create a Writer for our data type.


In [12]:
# Create a writer for our data
writer = datastore.create_writer(state_capitals_adapter.get_type_name())

In [13]:
# Writing data to the data store
for ft in features:
    writer.write(ft)

In [14]:
# Close the writer when we are done with it
writer.close()

Querying our store to make sure the data was ingested properly

pygw supports querying data in the same fashion as the Java API. You can use a VectorQueryBuilder to create queries on simple feature data sets. We will use one now to query all of the state capitals in the data store.


In [15]:
from pygw.query import VectorQueryBuilder

# Create the query builder
query_builder = VectorQueryBuilder()

# When you don't supply any constraints to the query builder, everything will be queried
query = query_builder.build()

# Execute the query
results = datastore.query(query)

The results returned above is a closeable iterator of SimpleFeature objects. Let's define a function that we can use to print out some information about these feature and then close the iterator when we are finished with it.


In [16]:
def print_results(results):
    for result in results:
        capital_name = result.get_attribute("capital_name")
        state_name = result.get_attribute("state_name")
        established = result.get_attribute("established")
        print("{}, {} was established in {}".format(capital_name, state_name, established.year))
    
    # Close the iterator
    results.close()

In [17]:
# Print the results
print_results(results)


Honolulu, Hawaii was established in 1845
Phoenix, Arizona was established in 1889
Baton Rouge, Louisiana was established in 1880
Jackson, Mississippi was established in 1821
Austin, Texas was established in 1839
Topeka, Kansas was established in 1856
Oklahoma City, Oklahoma was established in 1910
Little Rock, Arkansas was established in 1821
Jefferson City, Missouri was established in 1826
Des Moines, Iowa was established in 1857
Saint Paul, Minnesota was established in 1849
Lincoln, Nebraska was established in 1867
Pierre, South Dakota was established in 1889
Cheyenne, Wyoming was established in 1869
Denver, Colorado was established in 1867
Santa Fe, New Mexico was established in 1610
Salt Lake City, Utah was established in 1858
Boise, Idaho was established in 1865
Salem, Oregon was established in 1855
Carson City, Nevada was established in 1861
Sacramento, California was established in 1854
Juneau, Alaska was established in 1906
Olympia, Washington was established in 1853
Helena, Montana was established in 1875
Bismarck, North Dakota was established in 1883
Augusta, Maine was established in 1832
Montpelier, Vermont was established in 1805
Boston, Massachusetts was established in 1630
Concord, New Hampshire was established in 1808
Providence, Rhode Island was established in 1900
Hartford, Connecticut was established in 1875
Dover, Delaware was established in 1777
Raleigh, North Carolina was established in 1792
Richmond, Virginia was established in 1780
Annapolis, Maryland was established in 1694
Harrisburg, Pennsylvania was established in 1812
Trenton, New Jersey was established in 1784
Albany, New York was established in 1797
Columbus, Ohio was established in 1816
Lansing, Michigan was established in 1847
Madison, Wisconsin was established in 1838
Springfield, Illinois was established in 1837
Indianapolis, Indiana was established in 1825
Frankfort, Kentucky was established in 1792
Nashville, Tennessee was established in 1826
Atlanta, Georgia was established in 1868
Charleston, West Virginia was established in 1885
Columbia, South Carolina was established in 1786
Tallahassee, Florida was established in 1824
Montgomery, Alabama was established in 1846

Constraining the results

Querying all of the data can be useful occasionally, but most of the time we will want to filter the data to only return results that we are interested in. pygw supports several types of constraints to make querying data as flexible as possible.

CQL Constraints

One way you might want to query the data is using a simple CQL query.


In [18]:
# A CQL expression for capitals that are in the northeastern part of the US
cql_expression = "BBOX(location, -87.83,36.64,-66.74,48.44)"

In [19]:
# Create the query builder
query_builder = VectorQueryBuilder()
query_builder.add_type_name(state_capitals_adapter.get_type_name())

# If we want, we can tell the query builder to use the spatial index, since we aren't using time
query_builder.index_name(spatial_idx.get_name())

# Get the constraints factory
constraints_factory = query_builder.constraints_factory()
# Create the cql constraints
constraints = constraints_factory.cql_constraints(cql_expression)

# Set the constraints and build the query
query = query_builder.constraints(constraints).build()
# Execute the query
results = datastore.query(query)

In [20]:
# Display the results
print_results(results)


Augusta, Maine was established in 1832
Montpelier, Vermont was established in 1805
Boston, Massachusetts was established in 1630
Concord, New Hampshire was established in 1808
Providence, Rhode Island was established in 1900
Hartford, Connecticut was established in 1875
Dover, Delaware was established in 1777
Richmond, Virginia was established in 1780
Annapolis, Maryland was established in 1694
Harrisburg, Pennsylvania was established in 1812
Trenton, New Jersey was established in 1784
Albany, New York was established in 1797
Columbus, Ohio was established in 1816
Lansing, Michigan was established in 1847
Indianapolis, Indiana was established in 1825
Frankfort, Kentucky was established in 1792
Charleston, West Virginia was established in 1885

Spatial/Temporal Constraints

You may also want to contrain the data by both spatial and temporal constraints using the SpatialTemporalConstraintsBuilder. For this example, we will query all capitals that were established after 1800 within 10 degrees of Washington DC.


In [21]:
# Create the query builder
query_builder = VectorQueryBuilder()
query_builder.add_type_name(state_capitals_adapter.get_type_name())

# We can tell the builder to use the spatial/temporal index
query_builder.index_name(spatial_temporal_idx.get_name())

# Get the constraints factory
constraints_factory = query_builder.constraints_factory()
# Create the spatial/temporal constraints builder
constraints_builder = constraints_factory.spatial_temporal_constraints()
# Create the spatial constraint geometry.
washington_dc_buffer = Point(-77.035, 38.894).buffer(10.0)
# Set the spatial constraint
constraints_builder.spatial_constraints(washington_dc_buffer)
# Set the temporal constraint
constraints_builder.add_time_range(datetime(1800,1,1), datetime.now())
# Build the constraints
constraints = constraints_builder.build()

# Set the constraints and build the query
query = query_builder.constraints(constraints).build()
# Execute the query
results = datastore.query(query)

In [22]:
# Display the results
print_results(results)


Harrisburg, Pennsylvania was established in 1812
Columbus, Ohio was established in 1816
Indianapolis, Indiana was established in 1825
Montpelier, Vermont was established in 1805
Concord, New Hampshire was established in 1808
Providence, Rhode Island was established in 1900
Hartford, Connecticut was established in 1875
Charleston, West Virginia was established in 1885
Atlanta, Georgia was established in 1868
Augusta, Maine was established in 1832
Lansing, Michigan was established in 1847

Filter Factory Constraints

We can also use the FilterFactory to create more complicated filters. For example, if we wanted to find all of the capitals within 500 miles of Washington DC that contain the letter L that were established after 1830.


In [23]:
from pygw.query import FilterFactory

# Create the filter factory
filter_factory = FilterFactory()

# Create a filter that passes when the capital location is within 500 miles of the
# literal location of Washington DC
location_prop = filter_factory.property("location")
washington_dc_lit = filter_factory.literal(Point(-77.035, 38.894))
distance_km = 500 * 1.609344 # Convert miles to kilometers
distance_filter = filter_factory.dwithin(location_prop, washington_dc_lit, distance_km, "kilometers")

# Create a filter that passes when the capital name contains the letter L.
capital_name_prop = filter_factory.property("capital_name")
name_filter = filter_factory.like(capital_name_prop, "*l*")

# Create a filter that passes when the established date is after 1830
established_prop = filter_factory.property("established")
date_lit = filter_factory.literal(datetime(1830, 1, 1))
date_filter = filter_factory.after(established_prop, date_lit)

# Combine the name, distance, and date filters
combined_filter = filter_factory.and_([distance_filter, name_filter, date_filter])

# Create the query builder
query_builder = VectorQueryBuilder()
query_builder.add_type_name(state_capitals_adapter.get_type_name())

# Get the constraints factory
constraints_factory = query_builder.constraints_factory()
# Create the filter constraints
constraints = constraints_factory.filter_constraints(combined_filter)

# Set the constraints and build the query
query = query_builder.constraints(constraints).build()
# Execute the query
results = datastore.query(query)

In [24]:
# Display the results
print_results(results)


Lansing, Michigan was established in 1847
Atlanta, Georgia was established in 1868
Charleston, West Virginia was established in 1885

Using Pandas with GeoWave query results

It's fairly easy to load vector features from GeoWave queries into a Pandas DataFrame. To do this, make sure pandas is installed.


In [ ]:
%pip install pandas

Next we will import pandas and issue a query to the datastore to load into a dataframe.


In [25]:
from pandas import DataFrame

# Query everything
query = VectorQueryBuilder().build()
results = datastore.query(query)

# Load the results into a pandas dataframe
dataframe = DataFrame.from_records([feature.to_dict() for feature in results])

# Display the dataframe
dataframe


Out[25]:
id location state_name capital_name established
0 10 POINT (-157.7989705 21.3280681) Hawaii Honolulu 1845-01-01 00:00:00
1 2 POINT (-112.125051 33.6054149) Arizona Phoenix 1889-01-01 00:00:00
2 17 POINT (-91.11141859999999 30.441474) Louisiana Baton Rouge 1880-01-01 00:00:00
3 23 POINT (-90.1888874 32.3103284) Mississippi Jackson 1821-01-01 00:00:00
4 42 POINT (-97.7534014 30.3077609) Texas Austin 1839-01-01 00:00:00
5 15 POINT (-95.70803100000001 39.0130545) Kansas Topeka 1856-01-01 00:00:00
6 35 POINT (-97.4791974 35.4826479) Oklahoma Oklahoma City 1910-01-01 00:00:00
7 3 POINT (-92.33792750000001 34.7240049) Arkansas Little Rock 1821-01-01 00:00:00
8 24 POINT (-92.1624049 38.5711659) Missouri Jefferson City 1826-01-01 00:00:00
9 14 POINT (-93.606516 41.5666699) Iowa Des Moines 1857-01-01 00:00:00
10 22 POINT (-93.10605339999999 44.9397075) Minnesota Saint Paul 1849-01-01 00:00:00
11 26 POINT (-96.6907283 40.800609) Nebraska Lincoln 1867-01-01 00:00:00
12 40 POINT (-100.3205385 44.3708241) South Dakota Pierre 1889-01-01 00:00:00
13 49 POINT (-104.7674045 41.1475325) Wyoming Cheyenne 1869-01-01 00:00:00
14 5 POINT (-104.8551114 39.7643389) Colorado Denver 1867-01-01 00:00:00
15 30 POINT (-105.983036 35.6824934) New Mexico Santa Fe 1610-01-01 00:00:00
16 43 POINT (-111.920485 40.7766079) Utah Salt Lake City 1858-01-01 00:00:00
17 11 POINT (-116.2338979 43.6008061) Idaho Boise 1865-01-01 00:00:00
18 36 POINT (-123.0282074 44.9329915) Oregon Salem 1855-01-01 00:00:00
19 27 POINT (-119.7526546 39.1678334) Nevada Carson City 1861-01-01 00:00:00
20 4 POINT (-121.4429125 38.5615405) California Sacramento 1854-01-01 00:00:00
21 1 POINT (-134.1765792 58.3844634) Alaska Juneau 1906-01-01 00:00:00
22 46 POINT (-122.8938687 47.0393335) Washington Olympia 1853-01-01 00:00:00
23 25 POINT (-112.0156939 46.5933579) Montana Helena 1875-01-01 00:00:00
24 33 POINT (-100.7670546 46.809076) North Dakota Bismarck 1883-01-01 00:00:00
25 18 POINT (-69.730692 44.3334319) Maine Augusta 1832-01-01 00:00:00
26 44 POINT (-72.5687199 44.2739708) Vermont Montpelier 1805-01-01 00:00:00
27 20 POINT (-71.0571571 42.3133735) Massachusetts Boston 1630-01-01 00:00:00
28 28 POINT (-71.5626055 43.2308015) New Hampshire Concord 1808-01-01 00:00:00
29 38 POINT (-71.42118050000001 41.8169925) Rhode Island Providence 1900-01-01 00:00:00
30 6 POINT (-72.680087 41.7656874) Connecticut Hartford 1875-01-01 00:00:00
31 7 POINT (-75.5134199 39.1564159) Delaware Dover 1777-01-01 00:00:00
32 32 POINT (-78.6450559 35.843768) North Carolina Raleigh 1792-01-01 00:00:00
33 45 POINT (-77.49326139999999 37.524661) Virginia Richmond 1780-01-01 00:00:00
34 19 POINT (-76.5046945 38.9724689) Maryland Annapolis 1694-01-01 00:00:00
35 37 POINT (-76.8804255 40.2821445) Pennsylvania Harrisburg 1812-01-01 00:00:00
36 29 POINT (-74.7741221 40.2162772) New Jersey Trenton 1784-01-01 00:00:00
37 31 POINT (-73.8113997 42.6681399) New York Albany 1797-01-01 00:00:00
38 34 POINT (-82.99082900000001 39.9829515) Ohio Columbus 1816-01-01 00:00:00
39 21 POINT (-84.559032 42.7086815) Michigan Lansing 1847-01-01 00:00:00
40 48 POINT (-89.4064204 43.0849935) Wisconsin Madison 1838-01-01 00:00:00
41 12 POINT (-89.6708313 39.7638375) Illinois Springfield 1837-01-01 00:00:00
42 13 POINT (-86.13275 39.7797845) Indiana Indianapolis 1825-01-01 00:00:00
43 16 POINT (-84.8666254 38.1944455) Kentucky Frankfort 1792-01-01 00:00:00
44 41 POINT (-86.7852455 36.1866405) Tennessee Nashville 1826-01-01 00:00:00
45 9 POINT (-84.420604 33.7677129) Georgia Atlanta 1868-01-01 00:00:00
46 47 POINT (-81.6405384 38.3560436) West Virginia Charleston 1885-01-01 00:00:00
47 39 POINT (-80.9375649 34.0375089) South Carolina Columbia 1786-01-01 00:00:00
48 8 POINT (-84.25685590000001 30.4671395) Florida Tallahassee 1824-01-01 00:00:00
49 0 POINT (-86.2460375 32.343799) Alabama Montgomery 1846-01-01 00:00:00

In [ ]: