Collecting and Using Data in Python

Laila A. Wahedi, PhD

Massive Data Institute Postdoctoral Fellow
McCourt School of Public Policy

Follow along:

Slides: http://Wahedi.us, Tutorial
Interactive Notebook: https://notebooks.azure.com/Laila/libraries/MDI-workshopFA18

Follow Along

Go to https://notebooks.azure.com/Laila/libraries/MDI-workshopFA18
Clone the directory

Follow Along

Sign in with any Microsoft Account (Hotmail, Outlook, Azure, etc.)
- Create a folder to put it in, mark as private or public

Follow Along

Open a notebook
- Open this notebook to have the code to play with
- Open a blank notebook to follow along and try on your own.

Do you get this error?

HTTP Error 400. The size of the request headers is too long

Clear your cookies then refresh the browser.

Your Environment

Jupyter Notebook Hosted in Azure
Want to install it at home?
- Install the Anaconda distribution of Python https://www.anaconda.com/download/
- Install Jupyter Notebooks http://jupyter.org/install

Your Environment

ctrl/apple+ enter runs a cell

Your Environment

Persistent memory
- If you run a cell, results remain as long as the kernel

ORDER MATTERS!

Your Environment: Saving

If your kernel dies, data are gone.
Not R or Stata, you can't save your whole environment
Data in memory more than spreadsheets
Think carefully about what you want to save and how.

Easy Saving (more later)

dump to save the data to hard drive (out of memory)
Contents of the command:
- variable to save,
- File to dump the variable into:
  - open(
    "name of file in quotes",
    "wb") "Write Binary"
Note: double and single quotes both work



In [4]:

    
import pickle
mydata = [1,2,3,4,5,6,7,8,9,10]
pickle.dump(mydata, open('mydata.p','wb'))

Save more than one variable:

Put them in a list



In [6]:

    
more_data = [10,9,8,7,6,5,4,3,2,1]
pickle.dump([mydata,more_data], open('so_much_data.p','wb'))

Loading the data from a pickle

open(<"path to file">
"rb") "Read Binary"
Don't mix up rb and wb. wb will overwrite rb.



In [7]:

    
mydata = pickle.load(open("mydata.p",'rb'))
print(mydata)









    



[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Unpack the variables you saved on the fly



In [12]:

    
[mydata, more_data] = pickle.load(open('so_much_data.p','rb'))
print(mydata)
print(more_data)









    



[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1]

Representing Data

1D Vectors of data in:
- Lists
- Arrays
- Series

Lists

Square brackets []
Can contain anything
Ordered zero-indexing
Slice with :
Negatives to go backwards
Third position to skip



In [119]:

    
my_list = [1,3,2,4,7,'Sandwich']
print(len(my_list))
print(my_list[0:2])
print(my_list[-1])
print(my_list[0:4:2])









    



6
[1, 3]
Sandwich
[1, 2]

Arrays in Numpy

Like arrays from Matlab
Vectors and multi-dimensional arrays
Numpy and scipy do math functions, and output in arrays
Index like lists



In [14]:

    
import numpy as np
my_array = np.random.poisson(lam=3,size=10)
print(my_array)
print(my_array.shape)









    



[0 5 4 2 6 2 3 1 2 2]
(10,)

Series in Pandas

Pandas is a package that creates labeled data frames
Series are 1d Vectors
Instantiate from list or array
Built on Numpy



In [16]:

    
import pandas as pd
my_series = pd.Series(my_list)
my_series.shape









    Out[16]:





(6,)

Why Series: Label your data



In [17]:

    
my_series = pd.Series(my_array,
                      index = [1,2,3,'cat','dog','10','n',8,7,6])

print(my_series)









    



(10,)
1      0
2      5
3      4
cat    2
dog    6
10     2
n      3
8      1
7      2
6      2
dtype: int64

Why Series: Suite of tools



In [18]:

    
print(my_series.mean())
my_series = pd.Series(['hello world','hello planet'])
print(my_series.str.replace('hello','goodbye'))









    



2.7
0     goodbye world
1    goodbye planet
dtype: object

Arrays Series and Lists Can Be Converted



In [20]:

    
new_list = list(my_array)
print(new_list)









    



[0, 5, 4, 2, 6, 2, 3, 1, 2, 2]

Two Dimensions

List of lists
Dictionary of lists
Array
Pandas Data Frame

Lists of Lists (or tuples)

Tuples are ordered collections like lists, but can't be changed once instantiated.
Each item in list contains a row.
Remember the position/order of your variables.



In [24]:

    
my_2d_list = [[1,4],[2,1],[8,10],[4,7],[9,2],[4,5]]
my_3var_list = [(1,4,7),(2,1,0),(8,10,2),(4,7,4),(9,2,7),(4,5,3)]

Add a variable from another list

You can only add to a list of lists, not tuples
Must be the proper order and same length



In [25]:

    
for i,new_var in enumerate(my_list):
    my_2d_list[i].append(new_var)
print(my_2d_list)









    



[[1, 4, 1], [2, 1, 3], [8, 10, 2], [4, 7, 4], [9, 2, 7], [4, 5, 8]]

Keep Track of Variable Names With Dictionaries

Curly Brackets
Lots of memory, but search columns fast
Easily add variables
Index data with labels



In [90]:

    
my_dict = {
    'var1':[1,2,8,4,9,4],
    'var2': [4,1,10,7,2,5]
}
my_dict['var3']=my_list
print(my_dict['var3'])









    



[1, 3, 2, 4, 7, 8]

Use numpy to maintain a matrix shape

Instantiate a 2d array with a list of lists or tuples
Each variable is a column, each internal list/tuple a row
Index each dimension like a list, separated by a comma. [row,column]



In [120]:

    
my_matrix = np.array(my_2d_list)
my_other_matrix = np.array(my_3var_list)
print(my_matrix)
print(my_matrix[0,0:2])









    



[[ 1  4  1]
 [ 2  1  3]
 [ 8 10  2]
 [ 4  7  4]
 [ 9  2  7]
 [ 4  5  8]]
[1 4]

Concatenate your matrices by stacking

Axis = 0



In [37]:

    
big_matrix = np.concatenate([my_matrix, my_other_matrix],axis=0)
print(big_matrix)









    



[[ 1  4  1]
 [ 2  1  3]
 [ 8 10  2]
 [ 4  7  4]
 [ 9  2  7]
 [ 4  5  8]
 [ 1  4  7]
 [ 2  1  0]
 [ 8 10  2]
 [ 4  7  4]
 [ 9  2  7]
 [ 4  5  3]]

Concatenate your matrices side by side

Axis = 1



In [41]:

    
big_matrix = np.concatenate([my_matrix, my_other_matrix],axis=1)
print(big_matrix)









    



[[ 1  4  1  1  4  7]
 [ 2  1  3  2  1  0]
 [ 8 10  2  8 10  2]
 [ 4  7  4  4  7  4]
 [ 9  2  7  9  2  7]
 [ 4  5  8  4  5  3]]

Do Matrix Operations

Scalar multiplication
Point-wise addition, subtraction, etc.
Transpose



In [43]:

    
print(my_matrix.T + my_other_matrix.T*5)









    



[[ 6 12 48 24 54 24]
 [24  6 60 42 12 30]
 [36  3 12 24 42 23]]

Instantiate A Random Matrix For Simulations

List of distributions here: https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.random.html



In [126]:

    
my_rand_matrix = np.random.randn(5,3)
print(my_rand_matrix)









    



[[-1.55580936 -0.17261167  0.75196263]
 [ 0.84330579  1.33124031  0.3352658 ]
 [ 0.27599767 -1.15216039 -0.78153232]
 [-0.05247263  0.60094014  0.60057787]
 [-1.16768905 -0.43397707 -0.75927943]]

Index like a list with a comma between dimensions:

[row,column]
Each Column From A Different Normal Distribution:
- Multiply normal distribution by sigma, add mu



In [127]:

    
my_rand_matrix[:,0]=my_rand_matrix[:,0]*.5+5
my_rand_matrix[:,1]=my_rand_matrix[:,1]*.5-5
my_rand_matrix[:,2]=my_rand_matrix[:,2]*10+50
print(my_rand_matrix.T)









    



[[  4.22209532   5.42165289   5.13799884   4.97376368   4.41615548]
 [ -5.08630584  -4.33437984  -5.5760802   -4.69952993  -5.21698854]
 [ 57.51962633  53.35265798  42.1846768   56.00577874  42.40720567]]

Sparse Matrices Save Memory When You Have Lots of Zeros

Create a big empty array
Create indexes to add values
Add some values to each coordinate. e.g. place 4 in position (1,3,8)



In [128]:

    
BIG_array = np.zeros((100,100))
rows = (1,6,29,40,43,50)
columns = (3,6,90,58,34,88)
BIG_array[(rows,columns)]=[4,6,14,1,3,22]

Sparse Matrices Save Memory When You Have Lots of Zeros

Turn the matrix into a sparse matrix
Use scipy package
Will turn itself back if too big
Different types good for different things. See: https://docs.scipy.org/doc/scipy/reference/sparse.html



In [129]:

    
import scipy as sp
from scipy import sparse
BIG_array = sparse.csc_matrix(BIG_array)
print(BIG_array)









    



  (1, 3)	4.0
  (6, 6)	6.0
  (43, 34)	3.0
  (40, 58)	1.0
  (50, 88)	22.0
  (29, 90)	14.0

Maintain Shape AND Labels with Pandas

DataFrames like R
Lots of built in functions
Instantiate from a dictionary...



In [130]:

    
df = pd.DataFrame(my_dict)
df

Instantiate Your Data Frame...

From a list of lists/tuples



In [131]:

    
df = pd.DataFrame(my_2d_list,
                 columns = ['var1','var2','var3'])
df

Instantiate Your Data Frame...

From a matrix
Name your rows too!



In [132]:

    
df = pd.DataFrame(my_rand_matrix, 
                  columns = ['dist_1','dist_2','dist_3'],
                 index = ['obs1','obs2','obs3','obs4','fred'])
df

Never Say No To Pandas

Using Documentation

Pandas website
- Try: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#pandas.DataFrame
Stack Overflow
Copy errors into google
Look up syntax differences with R

Summarize Your Data



In [133]:

    
df.describe()

Look at your data with Matplotlib integration

Matplotlib is like plotting in matlab
Try ggplot package for ggplot2 in python
See also Seaborn and Plotly
Use some ipython magic to see plots inline



In [134]:

    
import matplotlib.pyplot as plt
%matplotlib inline
df.plot.density()









    Out[134]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fdaf0662128>

One Variable At A Time:



In [135]:

    
df.dist_1.plot.hist(bins=3)









    Out[135]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fdaf06b4828>

Real Data

Load Data from a text file

Start by googling it: http://lmgtfy.com/?q=pandas+load+csv
Same method for comma (csv), tab (tab), |, and other separators
Excel and R can both output spreadsheets to csv

We will use the Big Allied and Dangerous Data from START

https://dataverse.harvard.edu/file.xhtml?fileId=2298519&version=RELEASED&version=.0



In [95]:

    
baad_covars = pd.read_csv('BAAD_1_Lethality_Data.tab',sep='\t')

Look at the data

Also try .tail()



In [96]:

    
baad_covars.head(3)









    Out[96]:







  
    
      
      mastertccode3606
      group
      statespond
      cowmastercountry
      masterccode
      fatalities19982005
      OrgAge
      ordsize
      terrStrong
      degree
      ContainRelig
      ContainEthno
      LeftNoReligEthno
      PureRelig
      PureEthno
      ReligEthno
      ContainRelig2
      ContainEthno2
      Islam
    
  
  
    
      0
      50
      Animal Liberation Front (ALF)
      0
      United States of America
      2
      0
      30
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      89
      Army of God
      0
      United States of America
      2
      1
      24
      0
      0
      0
      1
      0
      0
      1
      0
      0
      1
      0
      0
    
    
      2
      113
      Cambodian Freedom Fighters (CFF)
      0
      United States of America
      2
      0
      8
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      1
      0

Explore the data structure



In [76]:

    
print(baad_covars.shape)
baad_covars.columns









    



(395, 19)






    Out[76]:





Index(['mastertccode3606', 'group', 'statespond', 'cowmastercountry',
       'masterccode', 'fatalities19982005', 'OrgAge', 'ordsize', 'terrStrong',
       'degree', 'ContainRelig', 'ContainEthno', 'LeftNoReligEthno',
       'PureRelig', 'PureEthno', 'ReligEthno', 'ContainRelig2',
       'ContainEthno2', 'Islam'],
      dtype='object')

Rename things and adjust values

Use dictionaries to rename and replace



In [77]:

    
baad_covars.rename(columns = {'cowmastercountry':'country',
                              'masterccode':'ccode',
                       'mastertccode3606':'group_code',
                       'fatalities19982005':'fatalities'},
            inplace = True)
baad_covars.replace({'country':{'United States of America':'US'}},
            inplace = True)
print('Dimensions: ',baad_covars.shape)
baad_covars.head()









    



Dimensions:  (395, 19)






    Out[77]:







  
    
      
      group_code
      group
      statespond
      country
      ccode
      fatalities
      OrgAge
      ordsize
      terrStrong
      degree
      ContainRelig
      ContainEthno
      LeftNoReligEthno
      PureRelig
      PureEthno
      ReligEthno
      ContainRelig2
      ContainEthno2
      Islam
    
  
  
    
      0
      50
      Animal Liberation Front (ALF)
      0
      US
      2
      0
      30
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      89
      Army of God
      0
      US
      2
      1
      24
      0
      0
      0
      1
      0
      0
      1
      0
      0
      1
      0
      0
    
    
      2
      113
      Cambodian Freedom Fighters (CFF)
      0
      US
      2
      0
      8
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      1
      0
    
    
      3
      126
      Coalition to Save the Preserves (CSP)
      0
      US
      2
      0
      6
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      153
      Earth Liberation Front (ELF)
      0
      US
      2
      0
      14
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0

Set a useful index



In [39]:

    
#Set the index
baad_covars.set_index(['group_code'],inplace = True)
baad_covars.head()









    Out[39]:







  
    
      
      group
      statespond
      country
      ccode
      fatalities
      OrgAge
      ordsize
      terrStrong
      degree
      ContainRelig
      ContainEthno
      LeftNoReligEthno
      PureRelig
      PureEthno
      ReligEthno
      ContainRelig2
      ContainEthno2
      Islam
    
    
      group_code
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      50
      Animal Liberation Front (ALF)
      0
      US
      2
      0
      30
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      89
      Army of God
      0
      US
      2
      1
      24
      0
      0
      0
      1
      0
      0
      1
      0
      0
      1
      0
      0
    
    
      113
      Cambodian Freedom Fighters (CFF)
      0
      US
      2
      0
      8
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      1
      0
    
    
      126
      Coalition to Save the Preserves (CSP)
      0
      US
      2
      0
      6
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      153
      Earth Liberation Front (ELF)
      0
      US
      2
      0
      14
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0

Save Your Changes

Save it to a usable spreadsheet instead of an unreadable binary



In [78]:

    
baad_covars.to_csv('updated_baad.csv')

Slicing

Get specific values from the dataframe.
Pandas has several slice operators.
- iloc can be used to index the row by ordered integer. i.e. first row is 0, second row is 1, etc. Use this option sparingly. Better practice to use the index you have created.
- loc uses the named index and columns.
Index using [row, columns]
Put your column names in a list
Use : for all values
Notice that the output keeps the index names.



In [40]:

    
baad_covars.loc[:, ['fatalities']].head()









    Out[40]:





group_code
50     0
89     1
113    0
126    0
153    0
Name: fatalities, dtype: int64

Slicing Using Conditionals

Put conditionals in parentheses
Stack multiple conditionals using:
- & when both conditions must always apply
- | when at least one condition must apply



In [41]:

    
baad_covars.loc[(baad_covars.fatalities>1) | (baad_covars.degree>=1),
               ['group','country']].head()









    Out[41]:







  
    
      
      group
      country
    
    
      group_code
      
      
    
  
  
    
      50
      Animal Liberation Front (ALF)
      US
    
    
      153
      Earth Liberation Front (ELF)
      US
    
    
      30035
      Maras Salvatruchas
      US
    
    
      10042
      Group of Guerilla Combatants of Jose Maria Mor...
      Mexico
    
    
      246
      Justice Army of the Defenseless People
      Mexico

Find a list of religious groups with territory



In [102]:

    
baad_covars.loc[(baad_covars.ContainRelig==1)&
               (baad_covars.terrStrong==1),['group']]









    Out[102]:







  
    
      
      group
    
  
  
    
      198
      Hisba
    
    
      203
      Lord's Resistance Army (LRA)
    
    
      238
      Ansar al-Islam
    
    
      252
      Mahdi Army
    
    
      272
      Hezbollah
    
    
      281
      Hamas
    
    
      298
      Hizb-I-Islami
    
    
      303
      Taliban
    
    
      305
      Islamic Movement of Uzbekistan (IMU)
    
    
      356
      al-Qaeda
    
    
      379
      Abu Sayyaf Group (ASG)
    
    
      383
      Moro Islamic Liberation Front (MILF)
    
    
      384
      Moro National Liberation Front (MNLF)
    
    
      391
      Jemaah Islamiya (JI)
    
    
      392
      Laskar Jihad

Plot a histogram of organization age with 20 bins



In [103]:

    
baad_covars.OrgAge.plot.hist(bins=10)









    Out[103]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fdaf0883588>

Grouping By Variables

Groupby(): List the variables to group by
.function(): How to aggregate the rows
- Try: .count(), .mean(), .first(), .mode()



In [83]:

    
state_level = baad_covars.loc[:,['country','OrgAge',
                                     'ordsize','degree',
                                    'fatalities']
                                 ].groupby(['country']).sum()
state_level.head()









    Out[83]:







  
    
      
      OrgAge
      ordsize
      degree
      fatalities
    
    
      country
      
      
      
      
    
  
  
    
      Afghanistan
      58
      4
      11
      353
    
    
      Algeria
      24
      2
      6
      409
    
    
      Angola
      83
      5
      0
      276
    
    
      Argentina
      2
      0
      0
      0
    
    
      Bangladesh
      83
      5
      4
      80

Making New Columns

Assign values to a new column based on other columns:



In [62]:

    
baad_covars['big'] = 0
baad_covars.loc[(baad_covars.fatalities>1) | 
                (baad_covars.degree>=1),
               'big']=1
baad_covars.big.head()









    Out[62]:





group_code
50     1
89     0
113    0
126    0
153    1
Name: big, dtype: int64

Handle Missing Values

First lets make some

Default python type: None
Numpy datatype that can be treated like a number: np.nan
Pandas turns None into an np.nan



In [54]:

    
print(type(np.nan))
baad_covars.loc[(baad_covars.fatalities>1) | (baad_covars.degree>=1),
               ['terrStrong']] = None
baad_covars.loc[(baad_covars.fatalities>1) | (baad_covars.degree>=1),
               ['terrStrong']].head()









    



<class 'float'>






    Out[54]:







  
    
      
      terrStrong
    
    
      group_code
      
    
  
  
    
      50
      NaN
    
    
      153
      NaN
    
    
      30035
      NaN
    
    
      10042
      NaN
    
    
      246
      NaN

Handling Missing Values

We could index by them



In [56]:

    
baad_covars.loc[baad_covars.terrStrong.isnull(),'terrStrong'].head()









    Out[56]:





group_code
50      NaN
153     NaN
30035   NaN
10042   NaN
246     NaN
Name: terrStrong, dtype: float64

Handling Missing Values

We could fill them:



In [58]:

    
baad_covars['terrStrong'] = baad_covars.terrStrong.fillna(-77)
baad_covars.terrStrong.head()









    Out[58]:





group_code
50    -77.0
89      0.0
113     0.0
126     0.0
153   -77.0
Name: terrStrong, dtype: float64

Handling Missing Values

We could drop their rows or columns:

Subset is optional: which columns to look in.
inplace = True will drop rows in df without having to assign another variable



In [60]:

    
baad_covars_dropped = baad_covars.dropna(axis='index',
                                         subset=['terrStrong'],
                                        inplace=False)

Reindexing: Pop the index out without losing it



In [63]:

    
baad_covars.reset_index(inplace=True,
                       drop = False)
baad_covars.head()









    Out[63]:







  
    
      
      group_code
      group
      statespond
      country
      ccode
      fatalities
      OrgAge
      ordsize
      terrStrong
      degree
      ContainRelig
      ContainEthno
      LeftNoReligEthno
      PureRelig
      PureEthno
      ReligEthno
      ContainRelig2
      ContainEthno2
      Islam
      big
    
  
  
    
      0
      50
      Animal Liberation Front (ALF)
      0
      US
      2
      0
      30
      0
      -77.0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      1
      89
      Army of God
      0
      US
      2
      1
      24
      0
      0.0
      0
      1
      0
      0
      1
      0
      0
      1
      0
      0
      0
    
    
      2
      113
      Cambodian Freedom Fighters (CFF)
      0
      US
      2
      0
      8
      0
      0.0
      0
      0
      1
      0
      0
      0
      0
      0
      1
      0
      0
    
    
      3
      126
      Coalition to Save the Preserves (CSP)
      0
      US
      2
      0
      6
      0
      0.0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      153
      Earth Liberation Front (ELF)
      0
      US
      2
      0
      14
      0
      -77.0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1

Set a multi-index

Order Matters. What happens when you reverse group and country?



In [64]:

    
baad_covars.set_index(['group','country'],inplace = True)
baad_covars.head()









    Out[64]:







  
    
      
      
      group_code
      statespond
      ccode
      fatalities
      OrgAge
      ordsize
      terrStrong
      degree
      ContainRelig
      ContainEthno
      LeftNoReligEthno
      PureRelig
      PureEthno
      ReligEthno
      ContainRelig2
      ContainEthno2
      Islam
      big
    
    
      group
      country
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Animal Liberation Front (ALF)
      US
      50
      0
      2
      0
      30
      0
      -77.0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      Army of God
      US
      89
      0
      2
      1
      24
      0
      0.0
      0
      1
      0
      0
      1
      0
      0
      1
      0
      0
      0
    
    
      Cambodian Freedom Fighters (CFF)
      US
      113
      0
      2
      0
      8
      0
      0.0
      0
      0
      1
      0
      0
      0
      0
      0
      1
      0
      0
    
    
      Coalition to Save the Preserves (CSP)
      US
      126
      0
      2
      0
      6
      0
      0.0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      Earth Liberation Front (ELF)
      US
      153
      0
      2
      0
      14
      0
      -77.0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1

Did you get an error?

Don't forget to reset the index first!
Go ahead and change it back for the next step.

Using the new index, make a new dataframe

Note the new slicing operator for multi-index



In [65]:

    
indonesia_grps = baad_covars.xs('Indonesia',level = 'country',drop_level=False)
indonesia_grps = indonesia_grps.loc[indonesia_grps.fatalities>=1,['degree','ContainRelig',
                                            'ContainEthno','terrStrong',
                                            'ordsize','OrgAge']]
indonesia_grps.head()









    Out[65]:







  
    
      
      
      degree
      ContainRelig
      ContainEthno
      terrStrong
      ordsize
      OrgAge
    
    
      group
      country
      
      
      
      
      
      
    
  
  
    
      Free Aceh Movement (GAM)
      Indonesia
      1
      1
      1
      -77.0
      2
      31
    
    
      Jemaah Islamiya (JI)
      Indonesia
      2
      1
      0
      -77.0
      1
      13
    
    
      Laskar Jihad
      Indonesia
      3
      1
      1
      -77.0
      0
      6
    
    
      South Maluku Republic (RMS)
      Indonesia
      0
      0
      1
      -77.0
      2
      8

Warning: Making copies

If you set a variable as equal to an object, Python creates a reference rather than copying the whole object. More efficient, unless you really want to make a copy



In [137]:

    
little_df = pd.DataFrame([1,2,3,4,5],columns = ['A'])
little_df['B']=[0,1,0,1,1]
copied_df = little_df
print('before:')
print(copied_df)
little_df.loc[little_df.A == 3,'B'] = 'Sandwich'
print('after')
print(copied_df)









    



before:
   A  B
0  1  0
1  2  1
2  3  0
3  4  1
4  5  1
after
   A         B
0  1         0
1  2         1
2  3  Sandwich
3  4         1
4  5         1

What happened?

copied_df changed when little_df changed.
Let's fix that: import "copy"



In [139]:

    
import copy
little_df = pd.DataFrame([1,2,3,4,5],columns = ['A'])
little_df['B']=[0,1,0,1,1]
copied_df = little_df.copy()
print('before:')
print(copied_df)
little_df.loc[little_df.A == 3,'B'] = 'Sandwich'
print('after')
print(copied_df)

Merging and Concatenating

Merges automatically if shared index



In [140]:

    
C = pd.DataFrame(['apple','orange','grape','pear','banana'],
                 columns = ['C'],
                 index = [2,4,3,0,1])
little_df['C'] = C
little_df

Joins

Same as SQL, inner and outer



In [141]:

    
C = pd.DataFrame(['apple','orange','grape','apple'],
                 columns = ['C'],
                 index = [2,4,3,'a'])
C['cuts']=['slices','wedges','whole','spirals']
print('C:')
print(C)
print('Inner: Intersection')
print(little_df.merge(right=C,
                how='inner',
                on=None,
                left_index = True,
                right_index =True))
print('Outer: Keep all rows')
print(little_df.merge(right=C,
                how='outer',
                on=None,
                left_index = True,
                right_index =True))

print('Left: Keep little_df')
print(little_df.merge(right=C,
                how='left',
                on=None,
                left_index = True,
                right_index =True))
print('Right: Keep C')
print(little_df.merge(right=C,
                how='right',
                on=None,
                left_index = True,
                right_index =True))

print('Outer, merging on column instead of index')
print(little_df.merge(right=C,
                how='outer',
                on='C',
                left_index = False,
                right_index =False))









    



C:
        C     cuts
2   apple   slices
4  orange   wedges
3   grape    whole
a   apple  spirals
Inner: Intersection
   A         B     C_x     C_y    cuts
2  3  Sandwich   apple   apple  slices
3  4         1   grape   grape   whole
4  5         1  orange  orange  wedges
Outer: Keep all rows
     A         B     C_x     C_y     cuts
0  1.0         0    pear     NaN      NaN
1  2.0         1  banana     NaN      NaN
2  3.0  Sandwich   apple   apple   slices
3  4.0         1   grape   grape    whole
4  5.0         1  orange  orange   wedges
a  NaN       NaN     NaN   apple  spirals
Left: Keep little_df
   A         B     C_x     C_y    cuts
0  1         0    pear     NaN     NaN
1  2         1  banana     NaN     NaN
2  3  Sandwich   apple   apple  slices
3  4         1   grape   grape   whole
4  5         1  orange  orange  wedges
Right: Keep C
     A         B     C_x     C_y     cuts
2  3.0  Sandwich   apple   apple   slices
4  5.0         1  orange  orange   wedges
3  4.0         1   grape   grape    whole
a  NaN       NaN     NaN   apple  spirals
Outer, merging on column instead of index
   A         B       C     cuts
0  1         0    pear      NaN
1  2         1  banana      NaN
2  3  Sandwich   apple   slices
3  3  Sandwich   apple  spirals
4  4         1   grape    whole
5  5         1  orange   wedges






    



/opt/conda/lib/python3.6/site-packages/pandas/core/indexes/base.py:3772: RuntimeWarning: '<' not supported between instances of 'int' and 'str', sort order is undefined for incomparable objects
  return this.join(other, how=how, return_indexers=return_indexers)

Concatenate

Stack dataframes on top of one another
Stack dataframes beside one another



In [142]:

    
add_df = pd.DataFrame({'A':[6],'B':[7],'C':'peach'},index= ['p'])
little_df = pd.concat([little_df,add_df])
little_df

Some New Messy Data: Asthma by Zip Code

From California Health and Human Services
https://data.chhs.ca.gov/dataset/asthma-emergency-department-visit-rates-by-zip-code
Note: old version of data



In [107]:

    
asthma_data = pd.read_csv('asthma-emergency-department-visit-rates-by-zip-code.csv')
asthma_data.head(2)









    Out[107]:







  
    
      
      Year
      ZIP code
      Age Group
      Number of Visits
      Age-adjusted rate
      County Fips code
      County
    
  
  
    
      0
      2015
      90004\n(34.07646, -118.309453)
      Children (0-17)
      117.0
      91.7
      6037
      LOS ANGELES
    
    
      1
      2015
      90011\n(34.007055, -118.258872)
      Children (0-17)
      381.0
      102.8
      6037
      LOS ANGELES

Look at those zip codes!

Clean Zip Code

We don't need the latitude and longitude
Create two variables by splitting the zip code variable:
- index the data frame to the zip code variable
- split it in two: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html
- assign it to another two variables
Remember: can't run this cell twice without starting over



In [108]:

    
asthma_data[['zip','coordinates']] = asthma_data.loc[:,'ZIP code'].str.split(
                                                        pat='\n',expand=True)
asthma_data.drop('ZIP code', axis=1,inplace=True)
asthma_data.head(2)









    Out[108]:







  
    
      
      Year
      Age Group
      Number of Visits
      Age-adjusted rate
      County Fips code
      County
      zip
      coordinates
    
  
  
    
      0
      2015
      Children (0-17)
      117.0
      91.7
      6037
      LOS ANGELES
      90004
      (34.07646, -118.309453)
    
    
      1
      2015
      Children (0-17)
      381.0
      102.8
      6037
      LOS ANGELES
      90011
      (34.007055, -118.258872)

Rearrange The Data: Group By

Make child and adult separate columns rather than rows.
Must specify how to aggregate the columns
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html



In [109]:

    
asthma_grouped = asthma_data.groupby(by=['Year','zip']).sum()
asthma_grouped.head(4)









    Out[109]:







  
    
      
      
      Number of Visits
      Age-adjusted rate
      County Fips code
    
    
      Year
      zip
      
      
      
    
  
  
    
      2009
      90001
      818.0
      226.074245
      18111
    
    
      90002
      836.0
      265.349315
      18111
    
    
      90003
      1542.0
      369.202131
      18111
    
    
      90004
      580.0
      145.538276
      18111

Lost Columns! Fips summed!

Group by: Cleaning Up

Lost columns you can't sum
took sum of fips
Must add these back in
Works because temp table has same index



In [110]:

    
asthma_grouped.drop('County Fips code',axis=1,inplace=True)
temp_grp = asthma_data.groupby(by=['Year','zip']).first()
asthma_grouped[['fips',
                'county',
                'coordinates']]=temp_grp.loc[:,['County Fips code',
                                                'County',
                                                'coordinates']].copy()
asthma_grouped.loc[:,'Number of Visits']=\
        asthma_grouped.loc[:,'Number of Visits']/2
asthma_grouped.head(2)









    Out[110]:







  
    
      
      
      Number of Visits
      Age-adjusted rate
      fips
      county
      coordinates
    
    
      Year
      zip
      
      
      
      
      
    
  
  
    
      2009
      90001
      409.0
      226.074245
      6037
      LOS ANGELES
      (33.973252, -118.249154)
    
    
      90002
      418.0
      265.349315
      6037
      LOS ANGELES
      (33.949079, -118.247877)

Rearrange The Data: Pivot

Use pivot and melt to to move from row identifiers to column identifiers and back
https://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-melt
Tell computer what to do with every cell:
- Index: Stays the same
- Columns: The column containing the new column labels
- Values: The column containing values to insert

Rearrange The Data: Pivot

Tell computer what to do with every cell:
- Index: Stays the same
- Columns: The column containing the new column labels
- Values: The column containing values to insert



In [111]:

    
asthma_unstacked = asthma_data.pivot_table(index = ['Year',
                                                    'zip',
                                                    'County',
                                                    'coordinates',
                                                    'County Fips code'], 
                                           columns = 'Age Group', 
                                           values = 'Number of Visits')
asthma_unstacked.reset_index(drop=False,inplace=True)
asthma_unstacked.head(2)









    Out[111]:







  
    
      Age Group
      Year
      zip
      County
      coordinates
      County Fips code
      Adults (18+)
      All Ages
      Children (0-17)
    
  
  
    
      0
      2009
      90001
      LOS ANGELES
      (33.973252, -118.249154)
      6037
      206.0
      409.0
      203.0
    
    
      1
      2009
      90002
      LOS ANGELES
      (33.949079, -118.247877)
      6037
      204.0
      418.0
      214.0

Rename Columns, Subset Data



In [113]:

    
asthma_unstacked.rename(columns={
    'zip':'Zip',
    'coordinates':'Coordinates',
    'County Fips code':'Fips',
    'Adults (18+)':'Adults',
    'All Ages':'Incidents',
    'Children (0-17)': 'Children'
    },
    inplace=True)
asthma_2015 = asthma_unstacked.loc[asthma_unstacked.Year==2015,:]
asthma_2015.head(2)









    Out[113]:







  
    
      Age Group
      Year
      Zip
      County
      Coordinates
      Fips
      Adults
      Incidents
      Children
    
  
  
    
      4693
      2015
      90001
      LOS ANGELES
      (33.973252, -118.249154)
      6037
      229.0
      441.0
      212.0
    
    
      4694
      2015
      90002
      LOS ANGELES
      (33.949079, -118.247877)
      6037
      249.0
      476.0
      227.0

Save Your Data



In [114]:

    
asthma_2015.to_csv('asthma_2015.csv')

See You Next Week!



In [ ]:

	dist_1	dist_2	dist_3
count	5.000000	5.000000	5.000000
mean	4.834333	-4.982657	50.293989
std	0.501574	0.479122	7.452384
min	4.222095	-5.576080	42.184677
25%	4.416155	-5.216989	42.407206
50%	4.973764	-5.086306	53.352658
75%	5.137999	-4.699530	56.005779
max	5.421653	-4.334380	57.519626

	dist_1	dist_2	dist_3
obs1	4.222095	-5.086306	57.519626
obs2	5.421653	-4.334380	53.352658
obs3	5.137999	-5.576080	42.184677
obs4	4.973764	-4.699530	56.005779
fred	4.416155	-5.216989	42.407206

	mastertccode3606	group	cowmastercountry	masterccode	fatalities19982005	OrgAge	degree	ContainRelig	ContainEthno	PureRelig	ContainRelig2	ContainEthno2
0	50	Animal Liberation Front (ALF)	United States of America	2	0	30	1	0	0	0	0	0
1	89	Army of God	United States of America	2	1	24	0	1	0	1	1	0
2	113	Cambodian Freedom Fighters (CFF)	United States of America	2	0	8	0	0	1	0	0	1

	group	statespond	country	ccode	fatalities	OrgAge	ordsize	terrStrong	degree	ContainRelig	ContainEthno	LeftNoReligEthno	PureRelig	PureEthno	ReligEthno	ContainRelig2	ContainEthno2	Islam
group_code
50	Animal Liberation Front (ALF)	0	US	2	0	30	0	0	1	0	0	0	0	0	0	0	0	0
89	Army of God	0	US	2	1	24	0	0	0	1	0	0	1	0	0	1	0	0
113	Cambodian Freedom Fighters (CFF)	0	US	2	0	8	0	0	0	0	1	0	0	0	0	0	1	0
126	Coalition to Save the Preserves (CSP)	0	US	2	0	6	0	0	0	0	0	0	0	0	0	0	0	0
153	Earth Liberation Front (ELF)	0	US	2	0	14	0	0	1	0	0	0	0	0	0	0	0	0

	group
198	Hisba
203	Lord's Resistance Army (LRA)
238	Ansar al-Islam
252	Mahdi Army
272	Hezbollah
281	Hamas
298	Hizb-I-Islami
303	Taliban
305	Islamic Movement of Uzbekistan (IMU)
356	al-Qaeda
379	Abu Sayyaf Group (ASG)
383	Moro Islamic Liberation Front (MILF)
384	Moro National Liberation Front (MNLF)
391	Jemaah Islamiya (JI)
392	Laskar Jihad

	OrgAge	ordsize	degree	fatalities
country
Afghanistan	58	4	11	353
Algeria	24	2	6	409
Angola	83	5	0	276
Argentina	2	0	0	0
Bangladesh	83	5	4	80

	group_code	group	country	ccode	fatalities	OrgAge	terrStrong	degree	ContainRelig	ContainEthno	PureRelig	ContainRelig2	ContainEthno2	big
0	50	Animal Liberation Front (ALF)	US	2	0	30	-77.0	1	0	0	0	0	0	1
1	89	Army of God	US	2	1	24	0.0	0	1	0	1	1	0	0
2	113	Cambodian Freedom Fighters (CFF)	US	2	0	8	0.0	0	0	1	0	0	1	0
3	126	Coalition to Save the Preserves (CSP)	US	2	0	6	0.0	0	0	0	0	0	0	0
4	153	Earth Liberation Front (ELF)	US	2	0	14	-77.0	1	0	0	0	0	0	1

		group_code	statespond	ccode	fatalities	OrgAge	ordsize	terrStrong	degree	ContainRelig	ContainEthno	LeftNoReligEthno	PureRelig	PureEthno	ReligEthno	ContainRelig2	ContainEthno2	Islam	big
group	country
Animal Liberation Front (ALF)	US	50	0	2	0	30	0	-77.0	1	0	0	0	0	0	0	0	0	0	1
Army of God	US	89	0	2	1	24	0	0.0	0	1	0	0	1	0	0	1	0	0	0
Cambodian Freedom Fighters (CFF)	US	113	0	2	0	8	0	0.0	0	0	1	0	0	0	0	0	1	0	0
Coalition to Save the Preserves (CSP)	US	126	0	2	0	6	0	0.0	0	0	0	0	0	0	0	0	0	0	0
Earth Liberation Front (ELF)	US	153	0	2	0	14	0	-77.0	1	0	0	0	0	0	0	0	0	0	1

		degree	ContainRelig	ContainEthno	terrStrong	ordsize	OrgAge
group	country
Free Aceh Movement (GAM)	Indonesia	1	1	1	-77.0	2	31
Jemaah Islamiya (JI)	Indonesia	2	1	0	-77.0	1	13
Laskar Jihad	Indonesia	3	1	1	-77.0	0	6
South Maluku Republic (RMS)	Indonesia	0	0	1	-77.0	2	8

	Year	ZIP code	Age Group	Number of Visits	Age-adjusted rate	County Fips code	County
0	2015	90004\n(34.07646, -118.309453)	Children (0-17)	117.0	91.7	6037	LOS ANGELES
1	2015	90011\n(34.007055, -118.258872)	Children (0-17)	381.0	102.8	6037	LOS ANGELES

		Number of Visits	Age-adjusted rate	County Fips code
Year	zip
2009	90001	818.0	226.074245	18111
	90002	836.0	265.349315	18111
	90003	1542.0	369.202131	18111
	90004	580.0	145.538276	18111

Age Group	Year	Zip	County	Coordinates	Fips	Adults	Incidents	Children
4693	2015	90001	LOS ANGELES	(33.973252, -118.249154)	6037	229.0	441.0	212.0
4694	2015	90002	LOS ANGELES	(33.949079, -118.247877)	6037	249.0	476.0	227.0

Collecting and Using Data in Python

Laila A. Wahedi, PhD

Massive Data Institute Postdoctoral Fellow McCourt School of Public Policy

Follow along:

Follow Along

Follow Along

Follow Along

Do you get this error?

HTTP Error 400. The size of the request headers is too long

Clear your cookies then refresh the browser.

Your Environment

Your Environment

Your Environment

ORDER MATTERS!

Your Environment: Saving

Easy Saving (more later)

Save more than one variable:

Loading the data from a pickle

Unpack the variables you saved on the fly

Representing Data

Lists

Arrays in Numpy

Series in Pandas

Why Series: Label your data

Why Series: Suite of tools

Arrays Series and Lists Can Be Converted

Two Dimensions

Lists of Lists (or tuples)

Add a variable from another list

Keep Track of Variable Names With Dictionaries

Use numpy to maintain a matrix shape

Concatenate your matrices by stacking

Concatenate your matrices side by side

Do Matrix Operations

Instantiate A Random Matrix For Simulations

Index like a list with a comma between dimensions:

Sparse Matrices Save Memory When You Have Lots of Zeros

Sparse Matrices Save Memory When You Have Lots of Zeros

Maintain Shape AND Labels with Pandas

Instantiate Your Data Frame...

Instantiate Your Data Frame...

Never Say No To Pandas

Using Documentation

Summarize Your Data

Look at your data with Matplotlib integration

One Variable At A Time:

Real Data

Load Data from a text file

We will use the Big Allied and Dangerous Data from START

Look at the data

Explore the data structure

Rename things and adjust values

Set a useful index

Save Your Changes

Slicing

Slicing Using Conditionals

Find a list of religious groups with territory

Find a list of religious groups with territory

Plot a histogram of organization age with 20 bins

Plot a histogram of organization age with 20 bins

Grouping By Variables

Making New Columns

Assign values to a new column based on other columns:

Handle Missing Values

First lets make some

Handling Missing Values

We could index by them

Handling Missing Values

We could fill them:

Handling Missing Values

We could drop their rows or columns:

Reindexing: Pop the index out without losing it

Set a multi-index

Did you get an error?

Using the new index, make a new dataframe

Warning: Making copies

What happened?

Merging and Concatenating

Joins

Concatenate

Massive Data Institute Postdoctoral Fellow
McCourt School of Public Policy