Manipulating Data in Python

Laila A. Wahedi

Massive Data Institute Postdoctoral Fellow
McCourt School of Public Policy

Follow along: Wahedi.us, Current Presentation

Installing packages:

On a Mac:

Open terminal

On Windows:

Type cmd into the start menu

What is this?

Like an explorer window. The text at the start of the line tells you where you are.
To see what's there, type:
- On Mac: ls (That's an L, not an i)
- On Windows: dir

Installing Packages:

Install Packages

Install some packages so they are done by the time we need them. Type:

pip install pandas

When that's done, type:

pip install matplotlib
pip install pickle
pip install statsmodels

Note: if you installed the Continuum Python distribution, you may already have some of these packages installed.

What just happened?

Pip is a download manager, it automatically looks online to find the package and installs it on your computer
Similar to the CRAN Mirror
Bonus: if you followed the link I sent to install Anaconda, you have the conda download manager too.
- conda install works similarly to pip, and also works on many R packages when used with Jupyter notebooks.

Opening Jupyter

Open up another terminal/command prompt while your packages download
Navigate to the directory you want your script to be in
- Type ls to see what folders are in the current folder
- Type cd .. to go one folder up
- Type cd folder_name to go to a sub-folder
  - Type mkdir folder_name to make a folder
- Type:
  - jupyter notebook

What happened?

A browser window popped up
Address something along the lines of localhost:8888/tree
Address is listed in the terminal/command prompt
Go to this address from any browser to open your tree

Your Jupyter Tree

You should be in the folder you navigated to in terminal/cmd
You can make folders here too
Top right, select new notebook, python3

Open this notebook:

Move the file into the directory for your Jupyter Tree
Select the file from teh Jupyter Tree
A new tab should open in your browser.
Add new cells to take notes as you go along.
- Change the type of note cells to Markdown

Some Python Basics

What is code?

Set of clear instructions to the computer
Start with basic set of building blocks
Put those building blocks together

Goal of programming:

Break your task into building blocks
Put building blocks together into an algorithm

Our tasks:

Move and manipulate data using these building blocks.

Building Blocks

Ball and urn example

Our First Algorithm

Basic Data Structures

Strings

Text variables
Declare with either '', or ""



In [20]:

    
my_string = 'Hello World'
print(my_string)









    



Hello World

Numbers:

ints: integers
floats: numbers with decimals
Combine them to get a new float



In [25]:

    
my_int = 2
my_float = 2.2
new_float = my_int+my_float
print(new_float)
type(new_float)









    



4.2






    Out[25]:





float

Lists

Declare with: []
Ordered list of objects



In [ ]:

    
my_list = [0,1,2,3,4]

Zero indexed



In [12]:

    
print(my_list[1])

Lists are Mutable:



In [13]:

    
my_list[2]='hello'
print(my_list)









    



[0, 1, 'hello', 3, 4]

Dictionaries

key value pairs indexed by key
Declared by {key:value}



In [14]:

    
my_dictionary = {'apple':4,
                'pear':'yum'}
print(my_dictionary['apple'])

Add key value pairs later using indexing with brackets []
You can store data type you want



In [17]:

    
my_dictionary['numbers'] = my_list
print(my_dictionary['numbers'])









    



[0, 1, 'hello', 3, 4]

Sets

Like a list, but:
- Contains unique values
- Unordered
Declare with {}



In [18]:

    
my_set = {'thing1','thing2','cat in hat','thing1', 4,4}
print(my_set)









    



{'thing2', 4, 'thing1', 'cat in hat'}

Tuples

Like an ordered list, but can't be changed
Declare with ()



In [19]:

    
my_tuple = (1,3,2)
print(my_tuple)









    



(1, 3, 2)

Our First Algorithm: Strings



In [23]:

    
# Declare Data
my_data = 'hello '
my_other_data = 'world'
#Manipulate it
manipulated_data = my_data+my_other_data
#Output it:
print(manipulated_data)









    



hello world

Our First Algorithm: Numbers



In [26]:

    
# Declare Data
my_data =  1
my_other_data = 5
#Manipulate it
manipulated_data = 1/5
#Output it:
print(manipulated_data)

0.2

Adding Choices

Writing Instructions Python Can Understand:

Whitespace matters

tabs indicate a block of code within an if statement or loop
: marks the start of the block
\ for a linebreak mid-line
Line breaks allowed inside (), [], {}

If statements

= to assign
- Like <- in R
== to evaluate



In [2]:

    
my_variable = 5
print(my_variable)
print(my_variable == 5)









    



5
True

Compare values with:
- <, <=, >, >=, ==, !=, in, not in,



In [5]:

    
print(my_variable > 6)
print(my_variable in [1,4,7])









    



False
False

Booleans and Indicators

booleans are indicators of True or False
In Python3, True = 1, and False = 0



In [6]:

    
True + True









    Out[6]:





2

None is a null value
None evaluates to false in an if statement.

If statements:

Used to change behavior depending on conditionals
Declare the statement
Declare the conditional action within a code block, or indentation:
Declare alternative actions in else
Stack conditionals with elif



In [7]:

    
my_bool = 'ice cream'
if my_bool == 'ice cream': 
    print('yay')
elif my_bool == 'cake':
    print('woo!')
else:
    print('Woe and great tragedy!')

yay

Play around yourself:



In [9]:

    
check = True
# check = False
# check = None
# check = 'monkey'
# check = 0
# check = 10
print('Check is:', check)
if check == 'monkey':
    print('banana')
elif check:
    print('yes')
else:
    print('no')
    
if 1 not in [1,2,3]:
    print('not not in')
if 1 in [1,2,3]:
    print('in')









    



Check is: True
yes
in

Repetition

While loops

Do something in block of code until condition met
Make sure you change the condition in every loop, or it will go forever



In [23]:

    
n = 0
while n < 5:
    print(n)
    n= n+1

For loops

repeat block of code for:
- a certain number of iterations
- For every element in a list
range() gives you an iterator



In [22]:

    
print('use a range:')
for i in range(3):
    print(i)
print('use a range slice:')
for i in range(3,6):
    print(i)
print('iterate throubh a list:')
for i in my_list:
    print(i)









    



use a range:
0
1
2
use a range slice:
3
4
5
iterate throubh a list:
0
1
hello
3
4

Put them together:

Play around on your own



In [10]:

    
my_list = [0,1,'cat',None,'frog',3]
animals = []
nums = []
for i in my_list:
    if type(i)==str:
        animals.append(i)
    elif type(i)==int:
        nums.append(i)
    else:
        pass
print(animals)
print(nums)









    



['cat', 'frog']
[0, 1, 3]

Learn More:

These are the basic building blocks for scripting. To learn more about how to put them together to greater effect:

Take an introductory computer science course online
Check out GU Women Coders.
- Can't make it in person? Their website has lots of great resources.
- http://guwecode.georgetown.domains/

Import packages

Like library(package) in R
Pandas is a dataframe data structure like dataframes in R
matplotlib is a plotting package similar to plotting in Matlab
- you can view plots inline in pandas notebooks with the inline command
ggplot is a plotting package built on matplotlib like ggplot2 in R
Pickle lets you save your workspace
Statsmodels contains many of the statistical models you know and love



In [2]:

    
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# from ggplot import *
import pickle
import statsmodels.api as sm

Pandas Data Frames

Using Documentation

Pandas website
Stack Overflow
Copy errors into google
Look up syntax differences with R ## Load Data from a comma separated file
Start by googling it: http://lmgtfy.com/?q=pandas+load+csv

We will use the Big Allied and Dangerous Data from START

https://dataverse.harvard.edu/file.xhtml?fileId=2298519&version=RELEASED&version=.0



In [3]:

    
baad_covars = pd.read_csv('BAAD_1_Lethality_Data.tab',sep='\t')

Look at the data



In [85]:

    
baad_covars.head()









    Out[85]:







  
    
      
      mastertccode3606
      group
      statespond
      cowmastercountry
      masterccode
      fatalities19982005
      OrgAge
      ordsize
      terrStrong
      degree
      ContainRelig
      ContainEthno
      LeftNoReligEthno
      PureRelig
      PureEthno
      ReligEthno
      ContainRelig2
      ContainEthno2
      Islam
    
  
  
    
      0
      50
      Animal Liberation Front (ALF)
      0
      United States of America
      2
      0
      30
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      89
      Army of God
      0
      United States of America
      2
      1
      24
      0
      0
      0
      1
      0
      0
      1
      0
      0
      1
      0
      0
    
    
      2
      113
      Cambodian Freedom Fighters (CFF)
      0
      United States of America
      2
      0
      8
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      1
      0
    
    
      3
      126
      Coalition to Save the Preserves (CSP)
      0
      United States of America
      2
      0
      6
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      153
      Earth Liberation Front (ELF)
      0
      United States of America
      2
      0
      14
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0

Rename things and adjust values



In [8]:

    
baad_covars.rename(columns = {'cowmastercountry':'country',
                              'masterccode':'ccode',
                       'mastertccode3606':'group_code',
                       'fatalities19982005':'fatalities'},
            inplace = True)
baad_covars.replace({'country':{'United States of America':'US'}},
            inplace = True)
print('Dimensions: ',baad_covars.shape)
baad_covars.head()









    



('Dimensions: ', (395, 19))






    Out[8]:






  
    
      
      group_code
      group
      statespond
      country
      ccode
      fatalities
      OrgAge
      ordsize
      terrStrong
      degree
      ContainRelig
      ContainEthno
      LeftNoReligEthno
      PureRelig
      PureEthno
      ReligEthno
      ContainRelig2
      ContainEthno2
      Islam
    
  
  
    
      0
      50
      Animal Liberation Front (ALF)
      0
      US
      2
      0
      30
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      89
      Army of God
      0
      US
      2
      1
      24
      0
      0
      0
      1
      0
      0
      1
      0
      0
      1
      0
      0
    
    
      2
      113
      Cambodian Freedom Fighters (CFF)
      0
      US
      2
      0
      8
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      1
      0
    
    
      3
      126
      Coalition to Save the Preserves (CSP)
      0
      US
      2
      0
      6
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      153
      Earth Liberation Front (ELF)
      0
      US
      2
      0
      14
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0

Set a useful index



In [9]:

    
#Set the index
baad_covars.set_index(['group_code'],inplace = True)
baad_covars.head()









    Out[9]:






  
    
      
      group
      statespond
      country
      ccode
      fatalities
      OrgAge
      ordsize
      terrStrong
      degree
      ContainRelig
      ContainEthno
      LeftNoReligEthno
      PureRelig
      PureEthno
      ReligEthno
      ContainRelig2
      ContainEthno2
      Islam
    
    
      group_code
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      50
      Animal Liberation Front (ALF)
      0
      US
      2
      0
      30
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      89
      Army of God
      0
      US
      2
      1
      24
      0
      0
      0
      1
      0
      0
      1
      0
      0
      1
      0
      0
    
    
      113
      Cambodian Freedom Fighters (CFF)
      0
      US
      2
      0
      8
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      1
      0
    
    
      126
      Coalition to Save the Preserves (CSP)
      0
      US
      2
      0
      6
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      153
      Earth Liberation Front (ELF)
      0
      US
      2
      0
      14
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0

Slicing

Get specific values from the dataframe.
Pandas has several slice operators.
- iloc can be used to index the row by ordered integer. i.e. first row is 0, second row is 1, etc. Use this option sparingly. Better practice to use the index you have created.
- loc uses the named index and columns.
Index using [row, columns]
For multiple columns, put your column names in a list
Use : for all values
Notice that the output keeps the index names.



In [10]:

    
baad_covars.loc[:, 'fatalities'].head()









    Out[10]:





group_code
50     0
89     1
113    0
126    0
153    0
Name: fatalities, dtype: int64

Look at the data



In [15]:

    
baad_covars.loc[:,['OrgAge']].plot.density()
print(baad_covars.loc[:,['OrgAge']].mean())
baad_covars.loc[:,['fatalities']].plot.hist(bins=20)









    



OrgAge    11.270886
dtype: float64






    Out[15]:





<matplotlib.axes._subplots.AxesSubplot at 0x10d5cb00>

Slicing Using Conditionals

Put conditionals in parentheses
Stack multiple conditionals using:
- & when both conditions must always apply
- | when at least one condition must apply



In [89]:

    
baad_covars.loc[(baad_covars.fatalities>1) | (baad_covars.degree>=1),
               ['group','country']].head()









    Out[89]:







  
    
      
      group
      country
    
    
      group_code
      
      
    
  
  
    
      50
      Animal Liberation Front (ALF)
      US
    
    
      153
      Earth Liberation Front (ELF)
      US
    
    
      30035
      Maras Salvatruchas
      US
    
    
      10042
      Group of Guerilla Combatants of Jose Maria Mor...
      Mexico
    
    
      246
      Justice Army of the Defenseless People
      Mexico

Handling Missing Values

First lets make some:



In [90]:

    
baad_covars.loc[(baad_covars.fatalities>1) | (baad_covars.degree>=1),
               ['terrStrong']] = None
baad_covars.loc[(baad_covars.fatalities>1) | (baad_covars.degree>=1),
               ['terrStrong']].head()

Handling Missing Values

We could index by them



In [91]:

    
baad_covars.loc[baad_covars.terrStrong.isnull(),'terrStrong'].head()









    Out[91]:





group_code
50      NaN
153     NaN
30035   NaN
10042   NaN
246     NaN
Name: terrStrong, dtype: float64

Handling Missing Values

We could fill them:



In [92]:

    
baad_covars['terrStrong'] = baad_covars.terrStrong.fillna(-77)
baad_covars.terrStrong.head()









    Out[92]:





group_code
50    -77.0
89      0.0
113     0.0
126     0.0
153   -77.0
Name: terrStrong, dtype: float64

Making New Columns

Assign values to a new column based on other columns:



In [93]:

    
baad_covars['big'] = 0
baad_covars.loc[(baad_covars.fatalities>1) | 
                (baad_covars.degree>=1),
               'big']=1
baad_covars.big.head()









    Out[93]:





group_code
50     1
89     0
113    0
126    0
153    1
Name: big, dtype: int64

Reindexing: Pop the index out without losing it



In [95]:

    
baad_covars.reset_index(inplace=True)
baad_covars.head()









    Out[95]:







  
    
      
      group_code
      group
      statespond
      country
      ccode
      fatalities
      OrgAge
      ordsize
      terrStrong
      degree
      ContainRelig
      ContainEthno
      LeftNoReligEthno
      PureRelig
      PureEthno
      ReligEthno
      ContainRelig2
      ContainEthno2
      Islam
      big
    
  
  
    
      0
      50
      Animal Liberation Front (ALF)
      0
      US
      2
      0
      30
      0
      -77.0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      1
      89
      Army of God
      0
      US
      2
      1
      24
      0
      0.0
      0
      1
      0
      0
      1
      0
      0
      1
      0
      0
      0
    
    
      2
      113
      Cambodian Freedom Fighters (CFF)
      0
      US
      2
      0
      8
      0
      0.0
      0
      0
      1
      0
      0
      0
      0
      0
      1
      0
      0
    
    
      3
      126
      Coalition to Save the Preserves (CSP)
      0
      US
      2
      0
      6
      0
      0.0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      153
      Earth Liberation Front (ELF)
      0
      US
      2
      0
      14
      0
      -77.0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1

Set a multi-index



In [96]:

    
baad_covars.set_index(['group','country'],inplace = True)
baad_covars.head()









    Out[96]:







  
    
      
      
      group_code
      statespond
      ccode
      fatalities
      OrgAge
      ordsize
      terrStrong
      degree
      ContainRelig
      ContainEthno
      LeftNoReligEthno
      PureRelig
      PureEthno
      ReligEthno
      ContainRelig2
      ContainEthno2
      Islam
      big
    
    
      group
      country
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Animal Liberation Front (ALF)
      US
      50
      0
      2
      0
      30
      0
      -77.0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      Army of God
      US
      89
      0
      2
      1
      24
      0
      0.0
      0
      1
      0
      0
      1
      0
      0
      1
      0
      0
      0
    
    
      Cambodian Freedom Fighters (CFF)
      US
      113
      0
      2
      0
      8
      0
      0.0
      0
      0
      1
      0
      0
      0
      0
      0
      1
      0
      0
    
    
      Coalition to Save the Preserves (CSP)
      US
      126
      0
      2
      0
      6
      0
      0.0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      Earth Liberation Front (ELF)
      US
      153
      0
      2
      0
      14
      0
      -77.0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1

Using the new index, make a new dataframe

Note the new slicing operator for multi-index



In [103]:

    
indonesia_grps = baad_covars.xs('Indonesia',level = 'country',drop_level=False)
indonesia_grps = indonesia_grps.loc[indonesia_grps.fatalities>=1,['degree','ContainRelig',
                                            'ContainEthno','terrStrong',
                                            'ordsize','OrgAge']]
indonesia_grps.head()









    Out[103]:







  
    
      
      
      degree
      ContainRelig
      ContainEthno
      terrStrong
      ordsize
      OrgAge
    
    
      group
      country
      
      
      
      
      
      
    
  
  
    
      Free Aceh Movement (GAM)
      Indonesia
      1
      1
      1
      -77.0
      2
      31
    
    
      Jemaah Islamiya (JI)
      Indonesia
      2
      1
      0
      -77.0
      1
      13
    
    
      Laskar Jihad
      Indonesia
      3
      1
      1
      -77.0
      0
      6
    
    
      South Maluku Republic (RMS)
      Indonesia
      0
      0
      1
      -77.0
      2
      8

Warning: Making copies

If you set a variable as equal to an object, Python creates a reference rather than copying the whole object. More efficient, unless you really want to make a copy



In [67]:

    
little_df = pd.DataFrame([1,2,3,4,5],columns = ['A'])
little_df['B']=[0,1,0,1,1]
copied_df = little_df
print('before:')
print(copied_df)
little_df.loc[little_df.A == 3,'B'] = 7
print('after')
copied_df

What happened?

copied_df changed when little_df changed.
Let's fix that: import "copy"



In [68]:

    
import copy
little_df = pd.DataFrame([1,2,3,4,5],columns = ['A'])
little_df['B']=[0,1,0,1,1]
copied_df = little_df.copy()
print('before:')
print(copied_df)
little_df.loc[little_df.A == 3,'B'] = 7
print('after')
copied_df

Saving

Unlike R or Stata, can't just save your workspace
Save as a csv now that we have the data we want
Pickle a variable to recreate it without having to reset indexes, etc.



In [107]:

    
indonesia_grps.to_csv('indonesia.csv')
pickle.dump(indonesia_grps, open('indonesia.p','wb'))
indonesia_grps = pickle.load(open('indonesia.p','rb'))

Next time:

Scrape Data from the internet
Clean the data
Merge that data into our data
Run a basic stats and ML model



In [ ]:



In [ ]:

Until then, here's some reference code on merging your data sets

Merging and Concatenating

Merges automatically if shared index



In [73]:

    
C = pd.DataFrame(['apple','orange','grape','pear','banana'],
                 columns = ['C'],
                 index = [2,4,3,0,1])
little_df['C'] = C
little_df

Joins

Same as SQL, inner and outer



In [85]:

    
C = pd.DataFrame(['apple','orange','grape','apple'],
                 columns = ['C'],
                 index = [2,4,3,'a'])
C['cuts']=['slices','wedges','whole','spirals']
print('C:')
print(C)
print('Inner: Intersection')
print(little_df.merge(right=C,
                how='inner',
                on=None,
                left_index = True,
                right_index =True))
print('Outer: Keep all rows')
print(little_df.merge(right=C,
                how='outer',
                on=None,
                left_index = True,
                right_index =True))

print('Left: Keep little_df')
print(little_df.merge(right=C,
                how='left',
                on=None,
                left_index = True,
                right_index =True))
print('Right: Keep C')
print(little_df.merge(right=C,
                how='right',
                on=None,
                left_index = True,
                right_index =True))

print('Outer, merging on column instead of index')
print(little_df.merge(right=C,
                how='outer',
                on='C',
                left_index = True,
                right_index =True))









    



C:
        C     cuts
2   apple   slices
4  orange   wedges
3   grape    whole
a   apple  spirals
Inner: Intersection
   A  B     C_x     C_y    cuts
2  3  7   apple   apple  slices
3  4  1   grape   grape   whole
4  5  1  orange  orange  wedges
Outer: Keep all rows
     A    B     C_x     C_y     cuts
0  1.0  0.0    pear     NaN      NaN
1  2.0  1.0  banana     NaN      NaN
2  3.0  7.0   apple   apple   slices
3  4.0  1.0   grape   grape    whole
4  5.0  1.0  orange  orange   wedges
a  NaN  NaN     NaN   apple  spirals
Left: Keep little_df
   A  B     C_x     C_y    cuts
0  1  0    pear     NaN     NaN
1  2  1  banana     NaN     NaN
2  3  7   apple   apple  slices
3  4  1   grape   grape   whole
4  5  1  orange  orange  wedges
Right: Keep C
     A    B     C_x     C_y     cuts
2  3.0  7.0   apple   apple   slices
4  5.0  1.0  orange  orange   wedges
3  4.0  1.0   grape   grape    whole
a  NaN  NaN     NaN   apple  spirals
Outer, merging on column instead of index
     A    B       C     cuts
0  1.0  0.0    pear      NaN
1  2.0  1.0  banana      NaN
2  3.0  7.0   apple   slices
3  4.0  1.0   grape    whole
4  5.0  1.0  orange   wedges
a  NaN  NaN   apple  spirals






    



/home/law98/anaconda/envs/p3env/lib/python3.6/site-packages/pandas/core/indexes/base.py:3033: RuntimeWarning: '<' not supported between instances of 'int' and 'str', sort order is undefined for incomparable objects
  return this.join(other, how=how, return_indexers=return_indexers)

Concatenate

Stack dataframes on top of one another
Stack dataframes beside one another



In [94]:

    
add_df = pd.DataFrame({'A':[6],'B':[7],'C':'peach'},index= ['p'])
little_df = pd.concat([little_df,add_df])
little_df

	mastertccode3606	group	cowmastercountry	masterccode	fatalities19982005	OrgAge	degree	ContainRelig	ContainEthno	PureRelig	ContainRelig2	ContainEthno2
0	50	Animal Liberation Front (ALF)	United States of America	2	0	30	1	0	0	0	0	0
1	89	Army of God	United States of America	2	1	24	0	1	0	1	1	0
2	113	Cambodian Freedom Fighters (CFF)	United States of America	2	0	8	0	0	1	0	0	1
3	126	Coalition to Save the Preserves (CSP)	United States of America	2	0	6	0	0	0	0	0	0
4	153	Earth Liberation Front (ELF)	United States of America	2	0	14	1	0	0	0	0	0

	group	statespond	country	ccode	fatalities	OrgAge	ordsize	terrStrong	degree	ContainRelig	ContainEthno	LeftNoReligEthno	PureRelig	PureEthno	ReligEthno	ContainRelig2	ContainEthno2	Islam
group_code
50	Animal Liberation Front (ALF)	0	US	2	0	30	0	0	1	0	0	0	0	0	0	0	0	0
89	Army of God	0	US	2	1	24	0	0	0	1	0	0	1	0	0	1	0	0
113	Cambodian Freedom Fighters (CFF)	0	US	2	0	8	0	0	0	0	1	0	0	0	0	0	1	0
126	Coalition to Save the Preserves (CSP)	0	US	2	0	6	0	0	0	0	0	0	0	0	0	0	0	0
153	Earth Liberation Front (ELF)	0	US	2	0	14	0	0	1	0	0	0	0	0	0	0	0	0

	group_code	group	country	ccode	fatalities	OrgAge	terrStrong	degree	ContainRelig	ContainEthno	PureRelig	ContainRelig2	ContainEthno2	big
0	50	Animal Liberation Front (ALF)	US	2	0	30	-77.0	1	0	0	0	0	0	1
1	89	Army of God	US	2	1	24	0.0	0	1	0	1	1	0	0
2	113	Cambodian Freedom Fighters (CFF)	US	2	0	8	0.0	0	0	1	0	0	1	0
3	126	Coalition to Save the Preserves (CSP)	US	2	0	6	0.0	0	0	0	0	0	0	0
4	153	Earth Liberation Front (ELF)	US	2	0	14	-77.0	1	0	0	0	0	0	1

		group_code	statespond	ccode	fatalities	OrgAge	ordsize	terrStrong	degree	ContainRelig	ContainEthno	LeftNoReligEthno	PureRelig	PureEthno	ReligEthno	ContainRelig2	ContainEthno2	Islam	big
group	country
Animal Liberation Front (ALF)	US	50	0	2	0	30	0	-77.0	1	0	0	0	0	0	0	0	0	0	1
Army of God	US	89	0	2	1	24	0	0.0	0	1	0	0	1	0	0	1	0	0	0
Cambodian Freedom Fighters (CFF)	US	113	0	2	0	8	0	0.0	0	0	1	0	0	0	0	0	1	0	0
Coalition to Save the Preserves (CSP)	US	126	0	2	0	6	0	0.0	0	0	0	0	0	0	0	0	0	0	0
Earth Liberation Front (ELF)	US	153	0	2	0	14	0	-77.0	1	0	0	0	0	0	0	0	0	0	1

		degree	ContainRelig	ContainEthno	terrStrong	ordsize	OrgAge
group	country
Free Aceh Movement (GAM)	Indonesia	1	1	1	-77.0	2	31
Jemaah Islamiya (JI)	Indonesia	2	1	0	-77.0	1	13
Laskar Jihad	Indonesia	3	1	1	-77.0	0	6
South Maluku Republic (RMS)	Indonesia	0	0	1	-77.0	2	8

Manipulating Data in Python

Laila A. Wahedi

Massive Data Institute Postdoctoral Fellow McCourt School of Public Policy

Follow along: Wahedi.us, Current Presentation

Installing packages:

On a Mac:

On Windows:

What is this?

Installing Packages:

Install Packages

What just happened?

Opening Jupyter

What happened?

Your Jupyter Tree

Open this notebook:

Some Python Basics

What is code?

Goal of programming:

Our tasks:

Building Blocks

Our First Algorithm

Basic Data Structures

Strings

Numbers:

Lists

Dictionaries

Sets

Tuples

Our First Algorithm: Strings

Our First Algorithm: Numbers

Adding Choices

Writing Instructions Python Can Understand:

Whitespace matters

If statements

Booleans and Indicators

If statements:

Play around yourself:

Repetition

While loops

For loops

Put them together:

Learn More:

These are the basic building blocks for scripting. To learn more about how to put them together to greater effect:

Import packages

Pandas Data Frames

Using Documentation

We will use the Big Allied and Dangerous Data from START

Look at the data

Rename things and adjust values

Set a useful index

Slicing

Look at the data

Slicing Using Conditionals

Handling Missing Values

First lets make some:

Handling Missing Values

We could index by them

Handling Missing Values

We could fill them:

Making New Columns

Assign values to a new column based on other columns:

Reindexing: Pop the index out without losing it

Set a multi-index

Using the new index, make a new dataframe

Warning: Making copies

What happened?

Saving

Next time:

Until then, here's some reference code on merging your data sets

Merging and Concatenating

Joins

Concatenate

Massive Data Institute Postdoctoral Fellow
McCourt School of Public Policy