Leveraging Open-Source Python Packages for Data Analysis within the ArcGIS Environment (Direct Integration Strategy)

Using NumPy as the common denominator

  • Could use the ArcPy Data Access Module directly, but there are host of issues/information one must take into account:
    • How to deal with projections and other environment settings?
    • How Cursors affect the accounting of features?
    • How to deal with bad records/bad data and error handling?
    • How to honor/account for full field object control?
    • How do I create output features that correspond to my inputs?
      • Points are easy, what about Polygons and Polylines?
  • Spatial Statistics Data Object (SSDataObject)
    • Almost 30 Spatial Statistics Tools written in Python that ${\bf{must}}$ behave like traditional GP Tools
    • Use SSDataObject and your code should adhere

Basic Imports


In [1]:
import arcpy as ARCPY
import numpy as NUM
import SSDataObject as SSDO
import scipy as SCIPY
import pandas as PANDA
import pysal as PYSAL

Initialize Data Object and Query Attribute Fields


In [2]:
inputFC = r'../data/CA_Polygons.shp'
ssdo = SSDO.SSDataObject(inputFC)
for fieldName, fieldObject in ssdo.allFields.iteritems():
    print fieldName, fieldObject.type


FID OID
POP1999 Integer
POP1998 Integer
POP1997 Integer
POP1996 Integer
POP1995 Integer
POP1994 Integer
POP1993 Integer
POP1992 Integer
POP1991 Integer
POP1990 Integer
POPDEN70 Double
SOUTH Integer
AMERI_ES Double
MED_AGE_F Double
MULT_RACE Double
HAWN_PI Double
PERCVACANT Double
POP2009 Integer
NEW_NAME String
POP1988 Integer
POP1989 Integer
POP1984 Integer
POP1985 Integer
POP1986 Integer
POP1987 Integer
POP1980 Integer
POP1981 Integer
POP1982 Integer
POP1983 Integer
PCR1991 Double
PCR1990 Double
PCR1993 Double
GROWTH Double
PCR1995 Double
PCR1994 Double
PCR1997 Double
PCR1996 Double
PCR1999 Double
PCR1998 Double
POV1979 Integer
NEW_FIPS String
SHAPE_LENG Double
SHAPE_AREA Double
AGE_UNDER5 Double
AGE_18_21 Double
PCR1988 Double
PCR1989 Double
PCR1986 Double
PCR1987 Double
PCR1984 Double
PCR1985 Double
PCR1982 Double
PCR1983 Double
PCR1980 Double
PCR1981 Double
BIRTH1970 Integer
HOUSEHOLDS Double
PERCPOV69 Double
POP2010 Integer
CRUDEDEATH Double
AGE_65_UP Double
POV2009 Integer
BEA_6 Integer
AVG_SALE97 Double
VACANT Double
GR_69_02 Double
SOCAL SmallInteger
POP2000 Integer
POP2001 Integer
POP2002 Integer
POP2003 Integer
POP2004 Integer
POP2005 Integer
POP2006 Integer
POP2007 Integer
POP2008 Integer
PCR2010 Double
CRUDEBIRTH Double
LOGPCR69 Double
BEA_4 Double
MYID Integer
AGE_22_29 Double
BEA_1 Integer
BEA_2 Integer
BEA_3 Integer
FEMALES Double
BEA_5 Double
RENTER_OCC Double
HISPANIC Double
BEA_8 Integer
PCR1971 Double
PCR1973 Double
PCR1972 Double
SHAPE Geometry
PCR1970 Double
PCR1977 Double
PCR1976 Double
PCR1975 Double
PCR1974 Double
POINT_X Double
PCR1979 Double
PCR1978 Double
ASIAN Double
PERCNOHS Double
POV1989 Integer
STATE String
STATEFIP String
AGE_40_49 Double
MED_AGE Double
BEA_7 Integer
POINT_Y Double
OWNER_OCC Double
PCR1969 Double
BEA_NAME String
POP1975 Integer
POP1974 Integer
POP1977 Integer
POP1976 Integer
POP1971 Integer
POP1970 Integer
POP1973 Integer
POP1972 Integer
POP1979 Integer
POP1978 Integer
BLACK Double
POV1969 Integer
TOTVACANT7 Integer
TOTHOUSE70 Integer
MALES Double
PCR2006 Double
PCR2007 Double
PCR2004 Double
PCR2005 Double
PCR2002 Double
PCR2003 Double
PCR2000 Double
PCR2001 Double
MED_AGE_M Double
PCR2008 Double
PCR2009 Double
POP1969 Integer
OTHER Double
WHITE Double
AVE_HH_SZ Double
AGE_30_39 Double
TESTZVAR Double
AGE_50_64 Double
DEATH1970 Integer
PCR1992 Double
POV1999 Integer

Select Fields to Read Into NumPy Arrays

  • The Unique ID Field (Object ID in this example) will keep track of the order of your features
    • You have no control over Object ID Fields. It is quick, assures "uniqueness", but can't assume they will not get "scrambled" during copies.
    • To assure full control I advocate the "Add Field (LONG)" --> "Calculate Field (From Object ID)" workflow.

In [3]:
ssdo.obtainData(ssdo.oidName, ['GROWTH', 'PCR1970', 'POPDEN70', 'PERCNOHS'])

In [4]:
popInfo = ssdo.fields['POPDEN70']
popData = popInfo.data
print popData[0:5]


[  5.56947324e-01   2.68032831e-04   7.75864375e-03   2.38301421e-02
   5.10823361e-03]

Adding Results to Input/Output

  • Example: Adding a field of random standard normal values to your input/output

Create a Dictionary of Candidate Fields


In [5]:
import numpy.random as RAND
ARCPY.env.overwriteOutput = True
outArray = RAND.normal(0,1, (ssdo.numObs,))
outDict = {}
outField = SSDO.CandidateField('STDNORM', 'DOUBLE', outArray, alias = 'My Standard Normal Result')
outDict[outField.name] = outField

Add New Field to Input

  • Be Carefull!

In [6]:
ssdo.addFields2FC(outDict)

Copy Features, Selected Attribute Field(s), New Result Field(s) to Output


In [7]:
import os as OS
outputFC = OS.path.abspath(r'../data/testMyOutput.shp')
ssdo.output2NewFC(outputFC, outDict, appendFields = ['GROWTH', 'PERCNOHS'])
del ssdo

Getting More Advanced - SciPy and PANDAS


In [8]:
ssdo = SSDO.SSDataObject(inputFC)
years = NUM.arange(1975, 2015, 5)
fieldNames = ['PCR' + str(i) for i in years]
fieldNamesAll = fieldNames + ['NEW_NAME', 'SOCAL']
ssdo.obtainData("MYID", fieldNamesAll)
ids = [ssdo.order2Master[i] for i in xrange(ssdo.numObs)]
convertDictDF = {}
for fieldName, fieldObject in ssdo.fields.iteritems():
    convertDictDF[fieldName] = fieldObject.data
df = PANDA.DataFrame(convertDictDF, index = ids)
print df[0:5]


      NEW_NAME   PCR1975   PCR1980   PCR1985   PCR1990   PCR1995   PCR2000   PCR2005  \
158    Alameda  1.169255  1.195712  1.200988  1.165406  1.158115  1.307115  1.248997   
159     Alpine  0.844546  0.906803  0.855655  0.924508  0.820581  0.949886  0.930033   
160     Amador  0.991467  0.963228  0.921839  0.823639  0.815521  0.814954  0.864324   
161      Butte  0.910668  0.898385  0.817796  0.794387  0.773955  0.763665  0.790418   
162  Calaveras  0.941372  0.875469  0.891595  0.870938  0.806776  0.867385  0.880388   

      PCR2010  SOCAL  
158  1.206422      0  
159  1.079837      0  
160  0.886305      0  
161  0.816018      0  
162  0.877746      0  

Using GroupBy for Conditional Statistics

Example: One Liner for Average Incomes Based on Southern/Non-Southern California


In [9]:
groups = df.groupby('SOCAL')
print groups.mean()


        PCR1975   PCR1980   PCR1985   PCR1990   PCR1995   PCR2000   PCR2005   PCR2010
SOCAL                                                                                
0      1.075988  1.063532  0.978680  0.946142  0.932030  0.970583  0.971438  0.977341
1      1.076797  1.097469  1.051641  1.016739  0.952055  0.943493  0.986927  0.954825

Now the Median...


In [10]:
print groups.median()


        PCR1975   PCR1980   PCR1985   PCR1990   PCR1995   PCR2000   PCR2005   PCR2010
SOCAL                                                                                
0      1.015875  1.002767  0.902323  0.871296  0.841908  0.837941  0.862593  0.887863
1      1.071270  1.093269  1.076200  1.012586  0.960921  0.965993  1.015992  1.013383

Example: Calculating the Trend of Rolling Means


In [11]:
pcr = df.ix[:,1:9]
rollMeans = NUM.apply_along_axis(PANDA.rolling_mean, 1, pcr, 4)
timeInts = NUM.arange(0, 5)
outArray = NUM.empty((ssdo.numObs, 5), float)
for i in xrange(ssdo.numObs):
    outArray[i] = SCIPY.stats.linregress(timeInts, rollMeans[i,3:])

Write to Output (Same as Always...)


In [12]:
outputFC = OS.path.abspath(r'../data/testMyRollingMeanInfo.shp')
outFields = [ "SLOPE", "INTERCEPT", "R_SQRAURED", "P_VALUE", "STD_ERR" ]
outDict = {}
for fieldInd, fieldName in enumerate(outFields):
    outDict[fieldName] = SSDO.CandidateField(fieldName, "DOUBLE", outArray[:,fieldInd])
ssdo.output2NewFC(outputFC, outDict, fieldOrder = outFields)
del ssdo

Even More Advanced: PySAL

Example: Max(p) Regional Clustering


In [18]:
ssdo = SSDO.SSDataObject(inputFC)
ssdo.obtainData(ssdo.oidName, ['GROWTH', 'POP1970', 'PERCNOHS'])
w = PYSAL.weights.knnW(ssdo.xyCoords, k=5)
X = NUM.empty((ssdo.numObs,2), float)
X[:,0] = ssdo.fields['GROWTH'].data
X[:,1] = ssdo.fields['PERCNOHS'].data
floorVal = 1000000.0
floorVar = ssdo.fields['POP1970'].returnDouble()
maxp = PYSAL.region.Maxp(w, X, floorVal, floor_variable = floorVar)
outArray = NUM.empty((ssdo.numObs,), int)
for regionID, orderIDs in enumerate(maxp.regions):
    outArray[orderIDs] = regionID
    print regionID, orderIDs


0 [7, 11, 52, 44, 22, 51, 17, 5, 24, 16, 3, 46, 8, 45, 10, 4, 54, 57, 50, 21, 9]
1 [1, 2, 33, 47, 56, 25, 13, 37, 27, 30, 28, 31]
2 [36, 32]
3 [41, 55, 29]
4 [15, 23, 53, 34, 14, 49, 19, 38]
5 [40, 0, 6, 42]
6 [26, 39, 43]
7 [18]
8 [20, 48]
9 [12, 35]

In [19]:
outputFC = OS.path.abspath(r'../data/testMaxPInfo.shp')
outDict = {}
outDict["REGIONID"] = SSDO.CandidateField("REGIONID", "DOUBLE", outArray)
ssdo.output2NewFC(outputFC, outDict, appendFields = ['GROWTH', 'POP1970', 'PERCNOHS'])
del ssdo

In [ ]: