Python for Humanist (Part II)

This workshop is licensed under a Creative Commons Attribution 4.0 International License. See Part I

Loops

A for loop executes commands once for each value in a collection.

  • Doing calculations on the values in a list one by one is as painful as working with pressure_001, pressure_002, etc.
  • A for loop tells Python to execute some statements once for each value in a list, a character string, or some other collection.
  • "for each thing in this group, do these operations"

In [1]:
for number in [2, 3, 5]:
    print(number)


2
3
5
  • This for loop is equivalent to:

In [2]:
print(2)
print(3)
print(5)


2
3
5

The first line of the for loop must end with a colon, and the body must be indented.

  • The colon at the end of the first line signals the start of a block of statements.
  • Python uses indentation rather than {} or begin/end to show nesting.
    • Any consistent indentation is legal, but almost everyone uses four spaces.

In [3]:
for number in [2, 3, 5]:
print(number)


  File "<ipython-input-3-3a0b55365d6d>", line 2
    print(number)
        ^
IndentationError: expected an indented block
  • Indentation is always meaningful in Python.

In [ ]:
firstName="Jon"
  lastName="Smith"
  • This error can be fixed by removing the extra spaces at the beginning of the second line.

A for loop is made up of a collection, a loop variable, and a body.


In [ ]:
for number in [2, 3, 5]:
    print(number)
  • The collection, [2, 3, 5], is what the loop is being run on.
  • The body, print(number), specifies what to do for each value in the collection.
  • The loop variable, number, is what changes for each iteration of the loop.
    • The "current thing".

Loop variables can be called anything.

  • As with all variables, loop variables are:
    • Created on demand.
    • Meaningless: their names can be anything at all.

In [ ]:
for kitten in [2, 3, 5]:
    print(kitten)

The body of a loop can contain many statements.

  • But no loop should be more than a few lines long.
  • Hard for human beings to keep larger chunks of code in mind.

Exercise 1

  • Each of the three cells below contains some code that is incomplete
    • The comment above the code will tell you what the desired output is
  • Fill in the blanks and run the cell
    • Keep trying until you get the desired output

In [4]:
# Total length of the strings in the list: ["red", "green", "blue"] => 12
# Desired output: 12
total = 0
for word in ["red", "green", "blue"]:
    total = total + len(word)
print(total)


12

In [5]:
# List of word lengths: ["red", "green", "blue"] => [3, 5, 4]
# Desired output: [3, 5, 4]
lengths = []
for word in ["red", "green", "blue"]:
    lengths.append(len(word))
print(lengths)


[3, 5, 4]

In [6]:
# Concatenate all words: ["red", "green", "blue"] => "redgreenblue"
# Desired output: "redgreenblue"
words = ["red", "green", "blue"]
result = ''
for word in words:
    result = result + word
print(result)


redgreenblue

In [7]:
# Create acronym: ["red", "green", "blue"] => "RGB"
# write the whole thing
words = ["red", "green", "blue"]
acronym = ''
for word in words:
    acronym = acronym + word[0].capitalize()
print(acronym)


RGB

Conditionals

Use if statements to control whether or not a block of code is executed.

  • An if statement (more properly called a conditional statement) controls whether some block of code is executed or not.
  • Structure is similar to a for statement:
    • First line opens with if and ends with a colon
    • Body containing one or more statements is indented (usually by 4 spaces)

In [8]:
mass = 3.54
if mass > 3.0:
    print(mass, 'is large')

mass = 2.07
if mass > 3.0:
    print (mass, 'is large')


3.54 is large

Conditionals are often used inside loops.

  • Not much point using a conditional when we know the value (as above).
  • But useful when we have a collection to process.

In [9]:
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
    if m > 3.0:
        print(m, 'is large')


3.54 is large
9.22 is large

Use else to execute a block of code when an if condition is not true.

  • else can be used following an if.
  • Allows us to specify an alternative to execute when the if branch isn't taken.

In [10]:
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
    if m > 3.0:
        print(m, 'is large')
    else:
        print(m, 'is small')


3.54 is large
2.07 is small
9.22 is large
1.86 is small
1.71 is small

Use elif to specify additional tests.

  • May want to provide several alternative choices, each with its own test.
  • Use elif (short for "else if") and a condition to specify these.
  • Always associated with an if.
  • Must come before the else (which is the "catch all").

In [11]:
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
    if m > 9.0:
        print(m, 'is HUGE')
    elif m > 3.0:
        print(m, 'is large')
    else:
        print(m, 'is small')


3.54 is large
2.07 is small
9.22 is HUGE
1.86 is small
1.71 is small

Conditions are tested once, in order.

  • Python steps through the branches of the conditional in order, testing each in turn.
  • So ordering matters.

In [12]:
grade = 85
if grade >= 70:
    print('grade is C')
elif grade >= 80:
    print('grade is B')
elif grade >= 90:
    print('grade is A')


grade is C

  • Does not automatically go back and re-evaluate if values change.

In [13]:
velocity = 10.0
if velocity > 20.0:
    print('moving too fast')
else:
    print('adjusting velocity')
    velocity = 50.0


adjusting velocity

Working with Data

Use with open() to open any single file

  • with open() can open files to read in data or to write out data to a file
  • If writing and the file doesn't exist, python will create it for you.

Writing a .csv file


In [14]:
import csv

primes = [2,3,5]

with open('output.csv','w', newline='') as outFile:
    for prime in primes:
        squared = prime ** 2
        cubed = prime ** 3
        row = [prime,squared,cubed]
        csv.writer(outFile).writerow(row)

Reading a .csv file


In [15]:
with open('output.csv','r') as dataFile:
    data = csv.reader(dataFile)
    for row in data:
        print(row)


['2', '4', '8']
['3', '9', '27']
['5', '25', '125']

pandas

Use the Pandas library to open tabular data.

  • Pandas is a widely-used Python library for statistics, particularly on tabular data.
  • Borrows many features from R's dataframes.
    • A 2-dimenstional table whose columns have names and potentially have different data types.
  • Load it with import pandas.
  • Read a Comma Separate Values (CSV) data file with pandas.read_csv.
    • Argument is the name of the file to be read.
    • Assign result to a variable to store the data that was read.

In [16]:
import pandas as pd

data = pd.read_csv('data/gapminder_gdp_oceania.csv')
print(data)


       country  gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
0    Australia     10039.59564     10949.64959     12217.22686   
1  New Zealand     10556.57566     12247.39532     13175.67800   

   gdpPercap_1967  gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  \
0     14526.12465     16788.62948     18334.19751     19477.00928   
1     14463.91893     16046.03728     16233.71770     17632.41040   

   gdpPercap_1987  gdpPercap_1992  gdpPercap_1997  gdpPercap_2002  \
0     21888.88903     23424.76683     26997.93657     30687.75473   
1     19007.19129     18363.32494     21050.41377     23189.80135   

   gdpPercap_2007  
0     34435.36744  
1     25185.00911  
  • The columns in a dataframe are the observed variables, and the rows are the observations.
  • Pandas uses backslash \ to show wrapped lines when output is too wide to fit the screen.

Use index_col to specify that a column's values should be used as row headings.

  • Row headings are numbers (0 and 1 in this case).
  • Really want to index by country.
  • Pass the name of the column to read_csv as its index_col parameter to do this.

In [17]:
data = pd.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')
print(data)


             gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  gdpPercap_1967  \
country                                                                       
Australia       10039.59564     10949.64959     12217.22686     14526.12465   
New Zealand     10556.57566     12247.39532     13175.67800     14463.91893   

             gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  gdpPercap_1987  \
country                                                                       
Australia       16788.62948     18334.19751     19477.00928     21888.88903   
New Zealand     16046.03728     16233.71770     17632.41040     19007.19129   

             gdpPercap_1992  gdpPercap_1997  gdpPercap_2002  gdpPercap_2007  
country                                                                      
Australia       23424.76683     26997.93657     30687.75473     34435.36744  
New Zealand     18363.32494     21050.41377     23189.80135     25185.00911  

Use DataFrame.info to find out more about a dataframe.


In [18]:
data.info()


<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, Australia to New Zealand
Data columns (total 12 columns):
gdpPercap_1952    2 non-null float64
gdpPercap_1957    2 non-null float64
gdpPercap_1962    2 non-null float64
gdpPercap_1967    2 non-null float64
gdpPercap_1972    2 non-null float64
gdpPercap_1977    2 non-null float64
gdpPercap_1982    2 non-null float64
gdpPercap_1987    2 non-null float64
gdpPercap_1992    2 non-null float64
gdpPercap_1997    2 non-null float64
gdpPercap_2002    2 non-null float64
gdpPercap_2007    2 non-null float64
dtypes: float64(12)
memory usage: 208.0+ bytes
  • This is a DataFrame
  • Two rows named 'Australia' and 'New Zealand'
  • Twelve columns, each of which has two actual 64-bit floating point values.
    • We will talk later about null values, which are used to represent missing observations.
  • Uses 208 bytes of memory.

The DataFrame.columns variable stores information about the dataframe's columns.

  • Note that this is data, not a method.
    • Like math.pi.
    • So do not use () to try to call it.
  • Called a member variable, or just member.

In [19]:
print(data.columns)


Index(['gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962', 'gdpPercap_1967',
       'gdpPercap_1972', 'gdpPercap_1977', 'gdpPercap_1982', 'gdpPercap_1987',
       'gdpPercap_1992', 'gdpPercap_1997', 'gdpPercap_2002', 'gdpPercap_2007'],
      dtype='object')

Use DataFrame.T to transpose a dataframe.

  • Sometimes want to treat columns as rows and vice versa.
  • Transpose (written .T) doesn't copy the data, just changes the program's view of it.
  • Like columns, it is a member variable.

In [20]:
print(data.T)


country           Australia  New Zealand
gdpPercap_1952  10039.59564  10556.57566
gdpPercap_1957  10949.64959  12247.39532
gdpPercap_1962  12217.22686  13175.67800
gdpPercap_1967  14526.12465  14463.91893
gdpPercap_1972  16788.62948  16046.03728
gdpPercap_1977  18334.19751  16233.71770
gdpPercap_1982  19477.00928  17632.41040
gdpPercap_1987  21888.88903  19007.19129
gdpPercap_1992  23424.76683  18363.32494
gdpPercap_1997  26997.93657  21050.41377
gdpPercap_2002  30687.75473  23189.80135
gdpPercap_2007  34435.36744  25185.00911

Use DataFrame.describe to get summary statistics about data.

DataFrame.describe() gets the summary statistics of only the columns that have numerical data. All other columns are ignored, unless you use the argument include='all'.

  • Not particularly useful with just two records, but very helpful when there are thousands.

In [21]:
print(data.describe())


       gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  gdpPercap_1967  \
count        2.000000        2.000000        2.000000        2.000000   
mean     10298.085650    11598.522455    12696.452430    14495.021790   
std        365.560078      917.644806      677.727301       43.986086   
min      10039.595640    10949.649590    12217.226860    14463.918930   
25%      10168.840645    11274.086022    12456.839645    14479.470360   
50%      10298.085650    11598.522455    12696.452430    14495.021790   
75%      10427.330655    11922.958888    12936.065215    14510.573220   
max      10556.575660    12247.395320    13175.678000    14526.124650   

       gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  gdpPercap_1987  \
count         2.00000        2.000000        2.000000        2.000000   
mean      16417.33338    17283.957605    18554.709840    20448.040160   
std         525.09198     1485.263517     1304.328377     2037.668013   
min       16046.03728    16233.717700    17632.410400    19007.191290   
25%       16231.68533    16758.837652    18093.560120    19727.615725   
50%       16417.33338    17283.957605    18554.709840    20448.040160   
75%       16602.98143    17809.077557    19015.859560    21168.464595   
max       16788.62948    18334.197510    19477.009280    21888.889030   

       gdpPercap_1992  gdpPercap_1997  gdpPercap_2002  gdpPercap_2007  
count        2.000000        2.000000        2.000000        2.000000  
mean     20894.045885    24024.175170    26938.778040    29810.188275  
std       3578.979883     4205.533703     5301.853680     6540.991104  
min      18363.324940    21050.413770    23189.801350    25185.009110  
25%      19628.685413    22537.294470    25064.289695    27497.598692  
50%      20894.045885    24024.175170    26938.778040    29810.188275  
75%      22159.406358    25511.055870    28813.266385    32122.777857  
max      23424.766830    26997.936570    30687.754730    34435.367440  

Writing Data

  • As well as the read_csv function for reading data from a file, Pandas provides a to_csv function to write dataframes to files.
  • You can use help to get information on how to use to_csv.

In order to write the DataFrame americas to a file called processed.csv, execute the following command: americas.to_csv('processed.csv')


In [22]:
data.T.to_csv('oceania_transposed.csv')

Note about Pandas DataFrames/Series

A [DataFrame][pandas-dataframe] is a collection of [Series][pandas-series]; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.

Pandas is built on top of the [Numpy][numpy] library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.

What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.

Selecting values

To access a value at the position [i,j] of a DataFrame, we have two options, depending on what is the meaning of i in use. Remember that a DataFrame provides a index as a way to identify the rows of the table; a row, then, has a position inside the table as well as a label, which uniquely identifies its entry in the DataFrame.

Use DataFrame.iloc[..., ...] to select values by their (entry) position

  • Can specify location by numerical index analogously to 2D version of character selection in strings.

In [23]:
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
print(data.iloc[0, 0])


1601.056136

Use DataFrame.loc[..., ...] to select values by their (entry) label.

  • Can specify location by row name analogously to 2D version of dictionary keys.

In [24]:
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
print(data.loc["Albania", "gdpPercap_1952"])


1601.056136

Result of slicing can be used in further operations.

  • Usually don't just print a slice.
  • All the statistical operators that work on entire dataframes work the same way on slices.
  • E.g., calculate max of a slice.

In [25]:
albania = data.loc["Albania"]
print(albania)
print(albania.describe())


gdpPercap_1952    1601.056136
gdpPercap_1957    1942.284244
gdpPercap_1962    2312.888958
gdpPercap_1967    2760.196931
gdpPercap_1972    3313.422188
gdpPercap_1977    3533.003910
gdpPercap_1982    3630.880722
gdpPercap_1987    3738.932735
gdpPercap_1992    2497.437901
gdpPercap_1997    3193.054604
gdpPercap_2002    4604.211737
gdpPercap_2007    5937.029526
Name: Albania, dtype: float64
count      12.000000
mean     3255.366633
std      1192.351513
min      1601.056136
25%      2451.300665
50%      3253.238396
75%      3657.893725
max      5937.029526
Name: Albania, dtype: float64

In [26]:
gdp1952 = data["gdpPercap_1952"]
print(gdp1952)
print(gdp1952.max())
print(gdp1952.idxmax())


country
Albania                    1601.056136
Austria                    6137.076492
Belgium                    8343.105127
Bosnia and Herzegovina      973.533195
Bulgaria                   2444.286648
Croatia                    3119.236520
Czech Republic             6876.140250
Denmark                    9692.385245
Finland                    6424.519071
France                     7029.809327
Germany                    7144.114393
Greece                     3530.690067
Hungary                    5263.673816
Iceland                    7267.688428
Ireland                    5210.280328
Italy                      4931.404155
Montenegro                 2647.585601
Netherlands                8941.571858
Norway                    10095.421720
Poland                     4029.329699
Portugal                   3068.319867
Romania                    3144.613186
Serbia                     3581.459448
Slovak Republic            5074.659104
Slovenia                   4215.041741
Spain                      3834.034742
Sweden                     8527.844662
Switzerland               14734.232750
Turkey                     1969.100980
United Kingdom             9979.508487
Name: gdpPercap_1952, dtype: float64
14734.23275
Switzerland
  • Would get the same result printing data.loc[:,"gdpPercap_1952"]

  • Also get the same result printing data.gdpPercap_1952 (since it's a column name)

Select multiple columns or rows using DataFrame.loc and a named slice.


In [27]:
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'])


             gdpPercap_1962  gdpPercap_1967  gdpPercap_1972
country                                                    
Italy           8243.582340    10022.401310    12269.273780
Montenegro      4649.593785     5907.850937     7778.414017
Netherlands    12790.849560    15363.251360    18794.745670
Norway         13450.401510    16361.876470    18965.055510
Poland          5338.752143     6557.152776     8006.506993

In the above code, we discover that slicing using loc is inclusive at both ends, which differs from slicing using iloc, where slicing indicates everything up to but not including the final index.

Exercise 3

  • Assume we've run the code below:
import pandas as pd
df = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')

Write an expression to find the:

  1. GDP per capita of Serbia in 2007.
  2. GDP per capita for all countries in 1982.
  3. GDP per capita for Denmark for all years.
  4. GDP per capita for all countries for years after 1985.

In [ ]:
# Exercise 3
print(df.loc['Serbia', 'gdpPercap_2007'])
print(df['gdpPercap_1982'])
print(df.loc['Denmark',:])
print(df.loc[:,'gdpPercap_1985':])

Batch procesing files

Use a for loop to process files given a list of their names.

  • A filename is just a character string.
  • And lists can contain character strings.

In [32]:
for filename in ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']:
    data = pd.read_csv(filename, index_col='country')
    print(filename,'\n', data.min())


data/gapminder_gdp_africa.csv 
 gdpPercap_1952    298.846212
gdpPercap_1957    335.997115
gdpPercap_1962    355.203227
gdpPercap_1967    412.977514
gdpPercap_1972    464.099504
gdpPercap_1977    502.319733
gdpPercap_1982    462.211415
gdpPercap_1987    389.876185
gdpPercap_1992    410.896824
gdpPercap_1997    312.188423
gdpPercap_2002    241.165877
gdpPercap_2007    277.551859
dtype: float64
data/gapminder_gdp_asia.csv 
 gdpPercap_1952    331.0
gdpPercap_1957    350.0
gdpPercap_1962    388.0
gdpPercap_1967    349.0
gdpPercap_1972    357.0
gdpPercap_1977    371.0
gdpPercap_1982    424.0
gdpPercap_1987    385.0
gdpPercap_1992    347.0
gdpPercap_1997    415.0
gdpPercap_2002    611.0
gdpPercap_2007    944.0
dtype: float64

Use glob.glob to find sets of files whose names match a pattern.

  • In Unix, the term "globbing" means "matching a set of files with a pattern".
  • The most common patterns are:
    • * meaning "match zero or more characters"
    • ? meaning "match exactly one character"
  • Python contains the glob library to provide pattern matching functionality
  • The glob library contains a function also called glob to match file patterns
  • E.g., glob.glob('*.txt') matches all files in the current directory whose names end with .txt.
  • Result is a (possibly empty) list of character strings.

In [33]:
import glob
print('all csv files in data directory:', glob.glob('data/*.csv'))


all csv files in data directory: ['data/gapminder_gdp_americas.csv', 'data/gapminder_gdp_europe.csv', 'data/gapminder_gdp_oceania.csv', 'data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']

In [34]:
print('all PDB files:', glob.glob('*.pdb'))


all PDB files: []

Use glob and for to process batches of files.

  • Helps a lot if the files are named and stored systematically and consistently so that simple patterns will find the right data.
  • Use pd.concat() to join Dataframes by column names

In [35]:
csvs = []
for filename in glob.glob('data/gapminder_*.csv'):
    data = pd.read_csv(filename, index_col='country')
    csvs.append(data) 
#print(csvs)
dataAll = pd.concat(csvs)
print(dataAll)


                        gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
country                                                                  
Argentina                  5911.315053     6856.856212     7133.166023   
Bolivia                    2677.326347     2127.686326     2180.972546   
Brazil                     2108.944355     2487.365989     3336.585802   
Canada                    11367.161120    12489.950060    13462.485550   
Chile                      3939.978789     4315.622723     4519.094331   
Colombia                   2144.115096     2323.805581     2492.351109   
Costa Rica                 2627.009471     2990.010802     3460.937025   
Cuba                       5586.538780     6092.174359     5180.755910   
Dominican Republic         1397.717137     1544.402995     1662.137359   
Ecuador                    3522.110717     3780.546651     4086.114078   
El Salvador                3048.302900     3421.523218     3776.803627   
Guatemala                  2428.237769     2617.155967     2750.364446   
Haiti                      1840.366939     1726.887882     1796.589032   
Honduras                   2194.926204     2220.487682     2291.156835   
Jamaica                    2898.530881     4756.525781     5246.107524   
Mexico                     3478.125529     4131.546641     4581.609385   
Nicaragua                  3112.363948     3457.415947     3634.364406   
Panama                     2480.380334     2961.800905     3536.540301   
Paraguay                   1952.308701     2046.154706     2148.027146   
Peru                       3758.523437     4245.256698     4957.037982   
Puerto Rico                3081.959785     3907.156189     5108.344630   
Trinidad and Tobago        3023.271928     4100.393400     4997.523971   
United States             13990.482080    14847.127120    16173.145860   
Uruguay                    5716.766744     6150.772969     5603.357717   
Venezuela                  7689.799761     9802.466526     8422.974165   
Albania                    1601.056136     1942.284244     2312.888958   
Austria                    6137.076492     8842.598030    10750.721110   
Belgium                    8343.105127     9714.960623    10991.206760   
Bosnia and Herzegovina      973.533195     1353.989176     1709.683679   
Bulgaria                   2444.286648     3008.670727     4254.337839   
...                                ...             ...             ...   
Cambodia                    368.469286      434.038336      496.913648   
China                       400.448611      575.987001      487.674018   
Hong Kong China            3054.421209     3629.076457     4692.648272   
India                       546.565749      590.061996      658.347151   
Indonesia                   749.681655      858.900271      849.289770   
Iran                       3035.326002     3290.257643     4187.329802   
Iraq                       4129.766056     6229.333562     8341.737815   
Israel                     4086.522128     5385.278451     7105.630706   
Japan                      3216.956347     4317.694365     6576.649461   
Jordan                     1546.907807     1886.080591     2348.009158   
Korea Dem. Rep.            1088.277758     1571.134655     1621.693598   
Korea Rep.                 1030.592226     1487.593537     1536.344387   
Kuwait                   108382.352900   113523.132900    95458.111760   
Lebanon                    4834.804067     6089.786934     5714.560611   
Malaysia                   1831.132894     1810.066992     2036.884944   
Mongolia                    786.566857      912.662609     1056.353958   
Myanmar                     331.000000      350.000000      388.000000   
Nepal                       545.865723      597.936356      652.396859   
Oman                       1828.230307     2242.746551     2924.638113   
Pakistan                    684.597144      747.083529      803.342742   
Philippines                1272.880995     1547.944844     1649.552153   
Saudi Arabia               6459.554823     8157.591248    11626.419750   
Singapore                  2315.138227     2843.104409     3674.735572   
Sri Lanka                  1083.532030     1072.546602     1074.471960   
Syria                      1643.485354     2117.234893     2193.037133   
Taiwan                     1206.947913     1507.861290     1822.879028   
Thailand                    757.797418      793.577415     1002.199172   
Vietnam                     605.066492      676.285448      772.049160   
West Bank and Gaza         1515.592329     1827.067742     2198.956312   
Yemen Rep.                  781.717576      804.830455      825.623201   

                        gdpPercap_1967  gdpPercap_1972  gdpPercap_1977  \
country                                                                  
Argentina                  8052.953021     9443.038526    10079.026740   
Bolivia                    2586.886053     2980.331339     3548.097832   
Brazil                     3429.864357     4985.711467     6660.118654   
Canada                    16076.588030    18970.570860    22090.883060   
Chile                      5106.654313     5494.024437     4756.763836   
Colombia                   2678.729839     3264.660041     3815.807870   
Costa Rica                 4161.727834     5118.146939     5926.876967   
Cuba                       5690.268015     5305.445256     6380.494966   
Dominican Republic         1653.723003     2189.874499     2681.988900   
Ecuador                    4579.074215     5280.994710     6679.623260   
El Salvador                4358.595393     4520.246008     5138.922374   
Guatemala                  3242.531147     4031.408271     4879.992748   
Haiti                      1452.057666     1654.456946     1874.298931   
Honduras                   2538.269358     2529.842345     3203.208066   
Jamaica                    6124.703451     7433.889293     6650.195573   
Mexico                     5754.733883     6809.406690     7674.929108   
Nicaragua                  4643.393534     4688.593267     5486.371089   
Panama                     4421.009084     5364.249663     5351.912144   
Paraguay                   2299.376311     2523.337977     3248.373311   
Peru                       5788.093330     5937.827283     6281.290855   
Puerto Rico                6929.277714     9123.041742     9770.524921   
Trinidad and Tobago        5621.368472     6619.551419     7899.554209   
United States             19530.365570    21806.035940    24072.632130   
Uruguay                    5444.619620     5703.408898     6504.339663   
Venezuela                  9541.474188    10505.259660    13143.950950   
Albania                    2760.196931     3313.422188     3533.003910   
Austria                   12834.602400    16661.625600    19749.422300   
Belgium                   13149.041190    16672.143560    19117.974480   
Bosnia and Herzegovina     2172.352423     2860.169750     3528.481305   
Bulgaria                   5577.002800     6597.494398     7612.240438   
...                                ...             ...             ...   
Cambodia                    523.432314      421.624026      524.972183   
China                       612.705693      676.900092      741.237470   
Hong Kong China            6197.962814     8315.928145    11186.141250   
India                       700.770611      724.032527      813.337323   
Indonesia                   762.431772     1111.107907     1382.702056   
Iran                       5906.731805     9613.818607    11888.595080   
Iraq                       8931.459811     9576.037596    14688.235070   
Israel                     8393.741404    12786.932230    13306.619210   
Japan                      9847.788607    14778.786360    16610.377010   
Jordan                     2741.796252     2110.856309     2852.351568   
Korea Dem. Rep.            2143.540609     3701.621503     4106.301249   
Korea Rep.                 2029.228142     3030.876650     4657.221020   
Kuwait                    80894.883260   109347.867000    59265.477140   
Lebanon                    6006.983042     7486.384341     8659.696836   
Malaysia                   2277.742396     2849.094780     3827.921571   
Mongolia                   1226.041130     1421.741975     1647.511665   
Myanmar                     349.000000      357.000000      371.000000   
Nepal                       676.442225      674.788130      694.112440   
Oman                       4720.942687    10618.038550    11848.343920   
Pakistan                    942.408259     1049.938981     1175.921193   
Philippines                1814.127430     1989.374070     2373.204287   
Saudi Arabia              16903.048860    24837.428650    34167.762600   
Singapore                  4977.418540     8597.756202    11210.089480   
Sri Lanka                  1135.514326     1213.395530     1348.775651   
Syria                      1881.923632     2571.423014     3195.484582   
Taiwan                     2643.858681     4062.523897     5596.519826   
Thailand                   1295.460660     1524.358936     1961.224635   
Vietnam                     637.123289      699.501644      713.537120   
West Bank and Gaza         2649.715007     3133.409277     3682.831494   
Yemen Rep.                  862.442146     1265.047031     1829.765177   

                        gdpPercap_1982  gdpPercap_1987  gdpPercap_1992  \
country                                                                  
Argentina                  8997.897412     9139.671389     9308.418710   
Bolivia                    3156.510452     2753.691490     2961.699694   
Brazil                     7030.835878     7807.095818     6950.283021   
Canada                    22898.792140    26626.515030    26342.884260   
Chile                      5095.665738     5547.063754     7596.125964   
Colombia                   4397.575659     4903.219100     5444.648617   
Costa Rica                 5262.734751     5629.915318     6160.416317   
Cuba                       7316.918107     7532.924763     5592.843963   
Dominican Republic         2861.092386     2899.842175     3044.214214   
Ecuador                    7213.791267     6481.776993     7103.702595   
El Salvador                4098.344175     4140.442097     4444.231700   
Guatemala                  4820.494790     4246.485974     4439.450840   
Haiti                      2011.159549     1823.015995     1456.309517   
Honduras                   3121.760794     3023.096699     3081.694603   
Jamaica                    6068.051350     6351.237495     7404.923685   
Mexico                     9611.147541     8688.156003     9472.384295   
Nicaragua                  3470.338156     2955.984375     2170.151724   
Panama                     7009.601598     7034.779161     6618.743050   
Paraguay                   4258.503604     3998.875695     4196.411078   
Peru                       6434.501797     6360.943444     4446.380924   
Puerto Rico               10330.989150    12281.341910    14641.587110   
Trinidad and Tobago        9119.528607     7388.597823     7370.990932   
United States             25009.559140    29884.350410    32003.932240   
Uruguay                    6920.223051     7452.398969     8137.004775   
Venezuela                 11152.410110     9883.584648    10733.926310   
Albania                    3630.880722     3738.932735     2497.437901   
Austria                   21597.083620    23687.826070    27042.018680   
Belgium                   20979.845890    22525.563080    25575.570690   
Bosnia and Herzegovina     4126.613157     4314.114757     2546.781445   
Bulgaria                   8224.191647     8239.854824     6302.623438   
...                                ...             ...             ...   
Cambodia                    624.475478      683.895573      682.303175   
China                       962.421380     1378.904018     1655.784158   
Hong Kong China           14560.530510    20038.472690    24757.603010   
India                       855.723538      976.512676     1164.406809   
Indonesia                  1516.872988     1748.356961     2383.140898   
Iran                       7608.334602     6642.881371     7235.653188   
Iraq                      14517.907110    11643.572680     3745.640687   
Israel                    15367.029200    17122.479860    18051.522540   
Japan                     19384.105710    22375.941890    26824.895110   
Jordan                     4161.415959     4448.679912     3431.593647   
Korea Dem. Rep.            4106.525293     4106.492315     3726.063507   
Korea Rep.                 5622.942464     8533.088805    12104.278720   
Kuwait                    31354.035730    28118.429980    34932.919590   
Lebanon                    7640.519521     5377.091329     6890.806854   
Malaysia                   4920.355951     5249.802653     7277.912802   
Mongolia                   2000.603139     2338.008304     1785.402016   
Myanmar                     424.000000      385.000000      347.000000   
Nepal                       718.373095      775.632450      897.740360   
Oman                      12954.791010    18115.223130    18616.706910   
Pakistan                   1443.429832     1704.686583     1971.829464   
Philippines                2603.273765     2189.634995     2279.324017   
Saudi Arabia              33693.175250    21198.261360    24841.617770   
Singapore                 15169.161120    18861.530810    24769.891200   
Sri Lanka                  1648.079789     1876.766827     2153.739222   
Syria                      3761.837715     3116.774285     3340.542768   
Taiwan                     7426.354774    11054.561750    15215.657900   
Thailand                   2393.219781     2982.653773     4616.896545   
Vietnam                     707.235786      820.799445      989.023149   
West Bank and Gaza         4336.032082     5107.197384     6017.654756   
Yemen Rep.                 1977.557010     1971.741538     1879.496673   

                        gdpPercap_1997  gdpPercap_2002  gdpPercap_2007  
country                                                                 
Argentina                 10967.281950     8797.640716    12779.379640  
Bolivia                    3326.143191     3413.262690     3822.137084  
Brazil                     7957.980824     8131.212843     9065.800825  
Canada                    28954.925890    33328.965070    36319.235010  
Chile                     10118.053180    10778.783850    13171.638850  
Colombia                   6117.361746     5755.259962     7006.580419  
Costa Rica                 6677.045314     7723.447195     9645.061420  
Cuba                       5431.990415     6340.646683     8948.102923  
Dominican Republic         3614.101285     4563.808154     6025.374752  
Ecuador                    7429.455877     5773.044512     6873.262326  
El Salvador                5154.825496     5351.568666     5728.353514  
Guatemala                  4684.313807     4858.347495     5186.050003  
Haiti                      1341.726931     1270.364932     1201.637154  
Honduras                   3160.454906     3099.728660     3548.330846  
Jamaica                    7121.924704     6994.774861     7320.880262  
Mexico                     9767.297530    10742.440530    11977.574960  
Nicaragua                  2253.023004     2474.548819     2749.320965  
Panama                     7113.692252     7356.031934     9809.185636  
Paraguay                   4247.400261     3783.674243     4172.838464  
Peru                       5838.347657     5909.020073     7408.905561  
Puerto Rico               16999.433300    18855.606180    19328.709010  
Trinidad and Tobago        8792.573126    11460.600230    18008.509240  
United States             35767.433030    39097.099550    42951.653090  
Uruguay                    9230.240708     7727.002004    10611.462990  
Venezuela                 10165.495180     8605.047831    11415.805690  
Albania                    3193.054604     4604.211737     5937.029526  
Austria                   29095.920660    32417.607690    36126.492700  
Belgium                   27561.196630    30485.883750    33692.605080  
Bosnia and Herzegovina     4766.355904     6018.975239     7446.298803  
Bulgaria                   5970.388760     7696.777725    10680.792820  
...                                ...             ...             ...  
Cambodia                    734.285170      896.226015     1713.778686  
China                      2289.234136     3119.280896     4959.114854  
Hong Kong China           28377.632190    30209.015160    39724.978670  
India                      1458.817442     1746.769454     2452.210407  
Indonesia                  3119.335603     2873.912870     3540.651564  
Iran                       8263.590301     9240.761975    11605.714490  
Iraq                       3076.239795     4390.717312     4471.061906  
Israel                    20896.609240    21905.595140    25523.277100  
Japan                     28816.584990    28604.591900    31656.068060  
Jordan                     3645.379572     3844.917194     4519.461171  
Korea Dem. Rep.            1690.756814     1646.758151     1593.065480  
Korea Rep.                15993.527960    19233.988180    23348.139730  
Kuwait                    40300.619960    35110.105660    47306.989780  
Lebanon                    8754.963850     9313.938830    10461.058680  
Malaysia                  10132.909640    10206.977940    12451.655800  
Mongolia                   1902.252100     2140.739323     3095.772271  
Myanmar                     415.000000      611.000000      944.000000  
Nepal                      1010.892138     1057.206311     1091.359778  
Oman                      19702.055810    19774.836870    22316.192870  
Pakistan                   2049.350521     2092.712441     2605.947580  
Philippines                2536.534925     2650.921068     3190.481016  
Saudi Arabia              20586.690190    19014.541180    21654.831940  
Singapore                 33519.476600    36023.105400    47143.179640  
Sri Lanka                  2664.477257     3015.378833     3970.095407  
Syria                      4014.238972     4090.925331     4184.548089  
Taiwan                    20206.820980    23235.423290    28718.276840  
Thailand                   5852.625497     5913.187529     7458.396327  
Vietnam                    1385.896769     1764.456677     2441.576404  
West Bank and Gaza         7110.667619     4515.487575     3025.349798  
Yemen Rep.                 2117.484526     2234.820827     2280.769906  

[142 rows x 12 columns]

Plotting

matplotlib is the most widely used scientific plotting library in Python.

  • Commonly use a sub-library called matplotlib.pyplot.
  • The Jupyter Notebook will render plots inline if we ask it to using a "magic" command.

In [36]:
%matplotlib inline
import matplotlib.pyplot as plt

Simple plots are then (fairly) simple to create.


In [37]:
time = [0, 1, 2, 3]
position = [0, 100, 200, 300]

plt.plot(time, position)
plt.xlabel('Time (hr)')
plt.ylabel('Position (km)')


Out[37]:
Text(0, 0.5, 'Position (km)')

Plot data directly from a Pandas dataframe.

  • We can also plot Pandas dataframes.
  • This implicitly uses matplotlib.pyplot.
  • Before plotting, we convert the column headings from a string to integer data type, since they represent numerical values

In [38]:
data = pd.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')

#print(data)
# Extract year from last 4 characters of each column name
years = data.columns.str.strip('gdpPercap_')
# Convert year values to integers, saving results back to dataframe
data.columns = years.astype(int)

#print(data)
#print(data.loc['Australia'])

plt.plot(data.loc['Australia'])


Out[38]:
[<matplotlib.lines.Line2D at 0x11e1a4e80>]

Select and transform data, then plot it.

  • By default, DataFrame.plot plots with the rows as the X axis.
  • We can transpose the data in order to plot multiple series.

In [39]:
plt.plot(data.T)
plt.ylabel('GDP per capita')
plt.xlabel('Year')
plt.title('GPD Per Capita for Oceania (1950-2007)')


Out[39]:
Text(0.5, 1.0, 'GPD Per Capita for Oceania (1950-2007)')

Many styles of plot are available.


In [40]:
plt.style.use('ggplot')

plt.bar(list(data.columns), data.loc['Australia'])

plt.ylabel('GDP per capita')


Out[40]:
Text(0, 0.5, 'GDP per capita')

More Custom Styles

  • The command is plt.plot(x, y)
  • The color / format of markers can also be specified as an optical argument: e.g. 'b-' is a blue line, 'g--' is a green dashed line.

Get Australia data from dataframe


In [41]:
years = data.columns
gdp_australia = data.loc['Australia']

plt.plot(years, gdp_australia, 'g--')


Out[41]:
[<matplotlib.lines.Line2D at 0x11e420f28>]

Plotting multiple data

Often when plotting multiple datasets on the same figure it is desirable to have a legend describing the data. This can be done in matplotlib in two stages:

  • Provide a label for each dataset in the figure:
    plt.plot(years, gdp_australia, label='Australia')
    plt.plot(years, gdp_nz, label='New Zealand')
    

Adding a Legend

  • Instruct matplotlib to create the legend.
    plt.legend()
    
    By default matplotlib will attempt to place the legend in a suitable position. If you would rather specify a position this can be done with the loc= argument, e.g to place the legend in the upper left corner of the plot, specify loc='upper left'

In [42]:
# Select two countries' worth of data.
gdp_australia = data.loc['Australia']
gdp_nz = data.loc['New Zealand']

# Plot with differently-colored markers.
plt.plot(years, gdp_australia, 'b-', label='Australia')
plt.plot(years, gdp_nz, 'g-', label='New Zealand')

# Create legend.
plt.legend(loc='upper left')
plt.xlabel('Year')
plt.ylabel('GDP per capita ($)')


Out[42]:
Text(0, 0.5, 'GDP per capita ($)')

Scatterplots

  • Plot a scatter plot correlating the GDP of Australia and New Zealand
  • Use either plt.scatter or DataFrame.plot.scatter

In [43]:
plt.scatter(gdp_australia, gdp_nz)


Out[43]:
<matplotlib.collections.PathCollection at 0x11e563860>

Saving your plot to a file

If you are satisfied with the plot you see you may want to save it to a file, perhaps to include it in a publication. There is a function in the matplotlib.pyplot module that accomplishes this: savefig. Calling this function, e.g. with

plt.savefig('my_figure.png')

will save the current figure to the file my_figure.png. The file format will automatically be deduced from the file name extension (other formats are pdf, ps, eps and svg).

Note that functions in plt refer to a global figure variable and after a figure has been displayed to the screen (e.g. with plt.show) matplotlib will make this variable refer to a new empty figure. Therefore, make sure you call plt.savefig before the plot is displayed to the screen, otherwise you may find a file with an empty plot.

When using dataframes, data is often generated and plotted to screen in one line, and plt.savefig seems not to be a possible approach. One possibility to save the figure to file is then to

  • save a reference to the current figure in a local variable (with plt.gcf)
  • call the savefig class method from that varible.

In [44]:
plt.scatter(gdp_australia, gdp_nz)
plt.savefig('my_fig.png')


Data Subsets

Use comparisons to select data based on value.

  • Comparison is applied element by element.
  • Returns a similarly-shaped dataframe of True and False.

In [29]:
# Use a subset of data to keep output readable.
subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
print('Subset of data:\n', subset)

# Which values were greater than 10000 ?
print('\nWhere are values large?\n', subset > 10000)


Subset of data:
              gdpPercap_1962  gdpPercap_1967  gdpPercap_1972
country                                                    
Italy           8243.582340    10022.401310    12269.273780
Montenegro      4649.593785     5907.850937     7778.414017
Netherlands    12790.849560    15363.251360    18794.745670
Norway         13450.401510    16361.876470    18965.055510
Poland          5338.752143     6557.152776     8006.506993

Where are values large?
              gdpPercap_1962  gdpPercap_1967  gdpPercap_1972
country                                                    
Italy                 False            True            True
Montenegro            False           False           False
Netherlands            True            True            True
Norway                 True            True            True
Poland                False           False           False

Select values or NaN using a Boolean mask.

  • A frame full of Booleans is sometimes called a mask because of how it can be used.

In [30]:
mask = subset > 10000
print(subset[mask])


             gdpPercap_1962  gdpPercap_1967  gdpPercap_1972
country                                                    
Italy                   NaN     10022.40131     12269.27378
Montenegro              NaN             NaN             NaN
Netherlands     12790.84956     15363.25136     18794.74567
Norway          13450.40151     16361.87647     18965.05551
Poland                  NaN             NaN             NaN
  • Get the value where the mask is true, and NaN (Not a Number) where it is false.
  • Useful because NaNs are ignored by operations like max, min, average, etc.

In [31]:
print(subset[subset > 10000].describe())


       gdpPercap_1962  gdpPercap_1967  gdpPercap_1972
count        2.000000        3.000000        3.000000
mean     13120.625535    13915.843047    16676.358320
std        466.373656     3408.589070     3817.597015
min      12790.849560    10022.401310    12269.273780
25%      12955.737547    12692.826335    15532.009725
50%      13120.625535    15363.251360    18794.745670
75%      13285.513523    15862.563915    18879.900590
max      13450.401510    16361.876470    18965.055510

Workshop materials are drevied from work that is Copyright ©Software Carpentry.