This notebook is intended to show how to use pandas, and sql alchemy to upload data into DB2-switch and create geospatial coordinate and indexes.

Install using pip or any other package manager pandas, sqlalchemy and pg8000. The later one is the driver to connect to the db.



In [1]:

    
import pandas as pd
from sqlalchemy import create_engine

After importing the required packages, first create the engine to connect to the DB. The approach I generally use is to create a string based on the username and password. The code is a function, you just need to fill in with the username, password and the dbname.

It allows you to create different engines to connect to serveral dbs.



In [2]:

    
def connection(user,passwd,dbname, echo_i=False):
    str1 = ('postgresql+pg8000://' + user +':' + passw + '@switch-db2.erg.berkeley.edu:5432/' 
            + dbname + '?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory')
    engine = create_engine(str1,echo=echo_i,isolation_level='AUTOCOMMIT')
    return engine



In [3]:

    
user = 'jdlara'
passw = 'Amadeus-2010'
dbname = 'apl_cec' 
engine_db= connection(user,passw,dbname)

Afterwards, use pandas to import the data from Excel files or any other text file format. Make sure that the data in good shape before trying to push it into the server. In this example I use previous knowledge of the structure of the tabs in the excel file to recursively upload each tab and match the name of the table with the tab name.

If you are using csv files just change the commands to pd.read_csv() in this link you can find the documentation.

Before doing this I already checked that the data is properly organized, crate new cells to explore the data beforehand if needed

excel_file = 'substations_table.xlsx' tab_name = 'sheet1' schema_for_upload = 'geographic_data' pd_data.to_sql(name, engine_db, schema=schema_for_upload, if_exists='replace',chunksize=100)



In [ ]:



In [ ]:

    
#excel_file = 'substations_table.xlsx'
#tab_name = 'sheet1'
csv_name = ['LEMMA_ADS_AllSpp_2016_Turbo_01252016.csv']
schema_for_upload = 'lemma2016'
for name in csv_name:
    pd_data = pd.read_csv(name, encoding='UTF-8')
    pd_data.to_sql(name, engine_db, schema=schema_for_upload, if_exists='replace',chunksize=1000)

Once the data is updated, it is possible to run the SQL commands to properly create geom columns in the tables, this can be done as follows. The ojective is to run an SQL querie like this:

PGSQL
set search_path = SCHEMA, public;
alter table vTABLE drop column if exists geom;
SELECT AddGeometryColumn ('SCHEMA','vTABLE','geom',4326,'POINT',2);
UPDATE TABLE set geom = ST_SetSRID(st_makepoint(vTABLE.lon, vTABLE.lat), 4326)::geometry;

where SCHEMA and vTABLE are the variable portions. Also note, that this query assumes that your columns with latitude and longitude are named lat and lon respectively; moreover, it also assumes that the coordinates are in the 4326 projection.

The following function runs the query for you, considering again that the data is clean and nice.



In [9]:

    
def create_geom(table,schema,engine, projection=4326):
    k = engine.connect()
    query = ('set search_path = "'+ schema +'"'+ ', public;')
    print query
    k.execute(query)
    query = ('alter table ' + table + ' drop column if exists geom;')
    print query
    k.execute(query)
    query = 'SELECT AddGeometryColumn (\''+ schema + '\',\''+ table + '\',\'geom\''+',4326,\'POINT\',2);'
    print query
    k.execute(query)
    query = ('UPDATE ' + table + ' set geom = ST_SetSRID(st_makepoint(' + table + '.lon, ' + 
             table + '.lat),' + str(projection) + ')::geometry;')
    k.execute(query)
    print query
    k = engine.dispose()
    return 'geom column added with SRID ' + str(projection)



In [10]:

    
table = 'substation_table'
schema = 'geographic_data'
create_geom(table,schema,engine_db)









    



set search_path = "geographic_data", public;
alter table substation_table drop column if exists geom;
SELECT AddGeometryColumn ('geographic_data','substation_table','geom',4326,'POINT',2);
UPDATE substation_table set geom = ST_SetSRID(st_makepoint(substation_table.lon, substation_table.lat),4326)::geometry;






    Out[10]:





'geom column added with SRID 4326'

The function created the geom column, the next step is to define a function to create the Primary-Key in the db. Remember that the index from the data frame is included as an index in the db, sometimes an index is not really neded and might need to be dropped.



In [ ]:

    
def create_pk(table,schema,column,engine):
    k = engine.connect()
    query = ('set search_path = "'+ schema +'"'+ ', public;')
    print query
    k.execute(query)
    query = ('alter table ' + table + ' ADD CONSTRAINT '+ table +'_pk PRIMARY KEY (' + column + ')')
    print query 
    k.execute(query)
    k = engine.dispose()
    return 'Primary key created with column' + column



In [ ]:

    
col = ''
create_pk(table,schema,col,engine_db)

The reason why we use postgis is to improve geospatial queries and provide a better data structure for geospatial operations. Many of the ST_ functions have improved performance when a geospatial index is created. The process implemented here comes from this workshop. This re-creates the process using python functions so that it can be easily replicated for many tables.

The query to create a geospatial index is as follows:

set search_path = SCHEMA, public;
CREATE INDEX vTABLE_gix ON vTABLE USING GIST (geom);

This assumes that the column name with the geometry is named geom. If the process follows from the previous code, it will work ok.

The following step is to run a VACUUM, creating an index is not enough to allow PostgreSQL to use it effectively. VACUUMing must be performed when ever a new index is created or after a large number of UPDATEs, INSERTs or DELETEs are issued against a table.

VACUUM ANALYZE vTABLE;

The final step corresponds to CLUSTERING, this process re-orders the table according to the geospatial index we created. This ensures that records with similar attributes have a high likelihood of being found in the same page, reducing the number of pages that must be read into memory for some types of queries. When a query to find nearest neighbors or within a certain are is needed, geometries that are near each other in space are near each other on disk. The query to perform this clustering is as follows:

CLUSTER vTABLE USING vTABLE_gix;
ANALYZE vTABLE;



In [11]:

    
def create_gidx(table,schema,engine,column='geom'):
    k = engine.connect()
    query = ('set search_path = "'+ schema +'"'+ ', public;')
    k.execute(query)
    print query
    query = ('CREATE INDEX ' + table + '_gix ON ' + table + ' USING GIST (' + column + ');')
    k.execute(query)
    print query
    query = ('VACUUM ' + table + ';')
    k.execute(query)
    print query
    query = ('CLUSTER ' + table + ' USING ' + table + '_gix;')
    k.execute(query)
    print query
    query = ('ANALYZE ' + table + ';')
    k.execute(query)
    print query
    k = engine.dispose()
    return k



In [12]:

    
create_gidx(table,schema,engine_db)









    



set search_path = "geographic_data", public;
CREATE INDEX substation_table_gix ON substation_table USING GIST (geom);
VACUUM substation_table;
CLUSTER substation_table USING substation_table_gix;
ANALYZE substation_table;



In [ ]: