In [1]:
#by convention, we use these shorter two-letter names
import pysal as ps
import pandas as pd
import numpy as np
PySAL has two simple ways to read in data. But, first, you need to get the path from where your notebook is running on your computer to the place the data is. For example, to find where the notebook is running:
In [2]:
!pwd
PySAL has a command that it uses to get the paths of its example datasets. Let's work with a commonly-used dataset first.
In [3]:
dbf_path = ps.examples.get_path('NAT.dbf')
print(dbf_path)
For the purposes of this part of the workshop, we'll use the NAT.dbf
example data, and the usjoin.csv
data.
In [4]:
csv_path = ps.examples.get_path('usjoin.csv')
To read in a shapefile, we will need the path to the file.
In [5]:
shp_path = ps.examples.get_path('NAT.shp')
print(shp_path)
Then, we open the file using the ps.open
command:
In [6]:
f = ps.open(shp_path)
f
is what we call a "file handle." That means that it only points to the data and provides ways to work with it. By itself, it does not read the whole dataset into memory. To see basic information about the file, we can use a few different methods.
For instance, the header of the file, which contains most of the metadata about the file:
In [7]:
f.header
Out[7]:
To actually read in the shapes from memory, you can use the following commands:
In [8]:
f.by_row(14) #gets the 14th shape from the file
Out[8]:
In [9]:
all_polygons = f.read() #reads in all polygons from memory
In [10]:
len(all_polygons)
Out[10]:
So, all 3085 polygons have been read in from file. These are stored in PySAL shape objects, which can be used by PySAL and can be converted to other Python shape objects. ]
They typically have a few methods. So, since we've read in polygonal data, we can get some properties about the polygons. Let's just have a look at the first polygon:
In [11]:
all_polygons[0].centroid #the centroid of the first polygon
Out[11]:
In [12]:
all_polygons[0].area
Out[12]:
In [13]:
all_polygons[0].perimeter
Out[13]:
While in the Jupyter Notebook, you can examine what properties an object has by using the tab key.
In [14]:
polygon = all_polygons[0]
In [15]:
polygon. #press tab when the cursor is right after the dot
When you're working with tables of data, like a csv
or dbf
, you can extract your data in the following way. Let's open the dbf file we got the path for above.
In [16]:
f = ps.open(dbf_path)
Just like with the shapefile, we can examine the header of the dbf file
In [17]:
f.header
Out[17]:
So, the header is a list containing the names of all of the fields we can read. If we were interested in getting the ['NAME', 'STATE_NAME', 'HR90', 'HR80']
fields.
If we just wanted to grab the data of interest, HR90
, we can use either by_col
or by_col_array
, depending on the format we want the resulting data in:
In [18]:
HR90 = f.by_col('HR90')
print(type(HR90).__name__, HR90[0:5])
HR90 = f.by_col_array('HR90')
print(type(HR90).__name__, HR90[0:5])
As you can see, the by_col
function returns a list of data, with no shape. It can only return one column at a time:
In [19]:
HRs = f.by_col('HR90', 'HR80')
This error message is called a "traceback," as you see in the top right, and it usually provides feedback on why the previous command did not execute correctly. Here, you see that one-too-many arguments was provided to __call__
, which tells us we cannot pass as many arguments as we did to by_col
.
If you want to read in many columns at once and store them to an array, use by_col_array
:
In [20]:
HRs = f.by_col_array('HR90', 'HR80')
In [21]:
HRs
Out[21]:
It is best to use by_col_array
on data of a single type. That is, if you read in a lot of columns, some of them numbers and some of them strings, all columns will get converted to the same datatype:
In [22]:
allcolumns = f.by_col_array(['NAME', 'STATE_NAME', 'HR90', 'HR80'])
In [23]:
allcolumns
Out[23]:
Note that the numerical columns, HR90
& HR80
are now considered strings, since they show up with the single tickmarks around them, like '0.0'
.
These methods work similarly for .csv
files as well
A new functionality added to PySAL recently allows you to work with shapefile/dbf pairs using Pandas. This optional extension is only turned on if you have Pandas installed. The extension is the ps.pdio
module:
In [24]:
ps.pdio
Out[24]:
To use it, you can read in shapefile/dbf pairs using the ps.pdio.read_files
command.
In [25]:
data_table = ps.pdio.read_files(shp_path)
This reads in the entire database table and adds a column to the end, called geometry
, that stores the geometries read in from the shapefile.
Now, you can work with it like a standard pandas dataframe.
In [26]:
data_table.head()
Out[26]:
The read_files
function only works on shapefile/dbf pairs. If you need to read in data using CSVs, use pandas directly:
In [27]:
usjoin = pd.read_csv(csv_path)
#usjoin = ps.pdio.read_files(usjoin) #will not work, not a shp/dbf pair
The nice thing about working with pandas dataframes is that they have very powerful baked-in support for relational-style queries. By this, I mean that it is very easy to find things like:
The number of counties in each state:
In [28]:
data_table.groupby("STATE_NAME").size()
Out[28]:
Or, to get the rows of the table that are in Arizona, we can use the query
function of the dataframe:
In [29]:
data_table.query('STATE_NAME == "Arizona"')
Out[29]:
Behind the scenes, this uses a fast vectorized library, numexpr
, to essentially do the following.
First, compare each row's STATE_NAME
column to 'Arizona'
and return True
if the row matches:
In [30]:
data_table.STATE_NAME == 'Arizona'
Out[30]:
Then, use that to filter out rows where the condition is true:
In [31]:
data_table[data_table.STATE_NAME == 'Arizona']
Out[31]:
We might need this behind the scenes knowledge when we want to chain together conditions, or when we need to do spatial queries.
This is because spatial queries are somewhat more complex. Let's say, for example, we want all of the counties in the US to the West of -121
longitude. We need a way to express that question. Ideally, we want something like:
SELECT
*
FROM
data_table
WHERE
x_centroid < -121
So, let's refer to an arbitrary polygon in the the dataframe's geometry column as poly
. The centroid of a PySAL polygon is stored as an (X,Y)
pair, so the longidtude is the first element of the pair, poly.centroid[0]
.
Then, applying this condition to each geometry, we get the same kind of filter we used above to grab only counties in Arizona:
In [32]:
data_table.geometry.apply(lambda poly: poly.centroid[0] < -121)
Out[32]:
If we use this as a filter on the table, we can get only the rows that match that condition, just like we did for the STATE_NAME
query:
In [33]:
data_table[data_table.geometry.apply(lambda x: x.centroid[0] < -121)]
Out[33]:
This works on any type of spatial query.
For instance, if we wanted to find all of the counties that are within a threshold distance from an observation's centroid, we can do it in the following way.
First, specify the observation. Here, we'll use Cook County, IL:
In [34]:
data_table.query('(NAME == "Cook") & (STATE_NAME == "Illinois")')
Out[34]:
In [35]:
geom = data_table.query('(NAME == "Cook") & (STATE_NAME == "Illinois")').geometry
In [36]:
geom.values[0].centroid
Out[36]:
In [37]:
cook_county_centroid = geom.values[0].centroid
In [44]:
import scipy.spatial.distance as d
def near_target_point(polygon, target=cook_county_centroid, threshold=1):
return d.euclidean(polygon.centroid, target) < threshold
In [45]:
data_table[data_table.geometry.apply(near_target_point)]
Out[45]:
Most things in PySAL will be explicit about what type their input should be. Most of the time, PySAL functions require either lists or arrays. This is why the file-handler methods are the default IO method in PySAL: the rest of the computational tools are built around their datatypes.
However, it is very easy to get the correct datatype from Pandas using the values
and tolist
commands.
tolist()
will convert its entries to a list. But, it can only be called on individual columns (called Series
in pandas
documentation)
So, to turn the NAME
column into a list:
In [46]:
data_table.NAME.tolist()
Out[46]:
To extract many columns, you must select the columns you want and call their .values
attribute.
If we were interested in grabbing all of the HR
variables in the dataframe, we could first select those column names:
In [47]:
HRs = [col for col in data_table.columns if col.startswith('HR')]
We can use this to focus only on the columns we want:
In [48]:
data_table[HRs]
Out[48]:
With this, calling .values
gives an array containing all of the entries in this subset of the table:
In [49]:
data_table[HRs].values
Out[49]:
Using the PySAL pdio tools means that if you're comfortable with working in Pandas, you can continue to do so.
If you're more comfortable using Numpy or raw Python to do your data processing, PySAL's IO tools naturally support this.
In [ ]: