Python for Humanists (Part II)

This workshop is licensed under a Creative Commons Attribution 4.0 International License.

See Part I

Loops

A for loop executes commands once for each value in a collection.

  • Doing calculations on the values in a list one by one is as painful as working with pressure_001, pressure_002, etc.
  • A for loop tells Python to execute some statements once for each value in a list, a character string, or some other collection.
  • "for each thing in this group, do these operations"

In [ ]:
print(2)
print(3)
print(5)
  • This is the equivalent using a for loop:

In [ ]:

The first line of the for loop must end with a colon, and the body must be indented.

  • The colon at the end of the first line signals the start of a block of statements.
  • Python uses indentation rather than {} or begin/end to show nesting.
    • Any consistent indentation is legal, but almost everyone uses four spaces.

In [ ]:
for number in [2, 3, 5]:
print(number)
  • Indentation is always meaningful in Python.

In [ ]:
firstName="Jon"
  lastName="Smith"
  • This error can be fixed by removing the extra spaces at the beginning of the second line.

A for loop is made up of a collection, a loop variable, and a body.


In [ ]:
for number in [2, 3, 5]:
    print(number)
  • The collection, [2, 3, 5], is what the loop is being run on.
  • The body, print(number), specifies what to do for each value in the collection.
  • The loop variable, number, is what changes for each iteration of the loop.
    • The "current thing".

Loop variables can be called anything.

  • As with all variables, loop variables are:
    • Created on demand.
    • Meaningless: their names can be anything at all.

In [ ]:
for kitten in [2, 3, 5]:
    print(kitten)

The body of a loop can contain many statements.

  • But no loop should be more than a few lines long.
  • Hard for human beings to keep larger chunks of code in mind.

Exercise 1

  • Each of the three cells below contains some code that is incomplete
    • The comment above the code will tell you what the desired output is
  • Fill in the blanks and run the cell
    • Keep trying until you get the desired output

In [ ]:
# Total length of the strings in the list: ["red", "green", "blue"] => 12
# Desired output: 12
total = 0
for word in ["red", "green", "blue"]:
    ____ = ____ + len(word)
print(total)

In [ ]:
# List of word lengths: ["red", "green", "blue"] => [3, 5, 4]
# Desired output: [3, 5, 4]
lengths = ____
for word in ["red", "green", "blue"]:
    lengths.____(____)
print(lengths)

In [ ]:
# Concatenate all words: ["red", "green", "blue"] => "redgreenblue"
# Desired output: "redgreenblue"
words = ["red", "green", "blue"]
result = ____
for ____ in ____:
    ____
print(result)

In [ ]:
# Create acronym: ["red", "green", "blue"] => "RGB"
# write the whole thing
words = ["red", "green", "blue"]

Conditionals

Use if statements to control whether or not a block of code is executed.

  • An if statement (more properly called a conditional statement) controls whether some block of code is executed or not.
  • Structure is similar to a for statement:
    • First line opens with if and ends with a colon
    • Body containing one or more statements is indented (usually by 4 spaces)

In [ ]:
mass = 3.54

mass = 2.07

Conditionals are often used inside loops.

  • Not much point using a conditional when we know the value (as above).
  • But useful when we have a collection to process.

In [ ]:
masses = [3.54, 2.07, 9.22, 1.86, 1.71]

Use else to execute a block of code when an if condition is not true.

  • else can be used following an if.
  • Allows us to specify an alternative to execute when the if branch isn't taken.

In [ ]:
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
    if m > 3.0:
        print(m, 'is large')

Use elif to specify additional tests.

  • May want to provide several alternative choices, each with its own test.
  • Use elif (short for "else if") and a condition to specify these.
  • Always associated with an if.
  • Must come before the else (which is the "catch all").

In [ ]:
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
    if m > 9.0:
        print(m, 'is HUGE')
    else:
        print(m, 'is small')

Conditions are tested once, in order.

  • Python steps through the branches of the conditional in order, testing each in turn.
  • So ordering matters.

In [ ]:
grade = 85
if grade >= 70:
    print('grade is C')
elif grade >= 80:
    print('grade is B')
elif grade >= 90:
    print('grade is A')

  • Does not automatically go back and re-evaluate if values change.

Exercise 2

  • What does the program below print?
pressure = 71.9
if pressure > 50.0:
    pressure = 25.0
elif pressure <= 50.0:
    pressure = 0.0
print(pressure)

In [ ]:
# Exercise 2

Working with Data

Use with open() to open any single file

  • with open() can open files to read in data or to write out data to a file
  • If writing and the file doesn't exist, python will create it for you.

Writing a .csv file


In [ ]:
import csv

primes = [2,3,5]

Reading a .csv file


In [ ]:

pandas

Use the Pandas library to open tabular data.

  • Pandas is a widely-used Python library for statistics, particularly on tabular data.
  • Borrows many features from R's dataframes.
    • A 2-dimenstional table whose columns have names and potentially have different data types.
  • Load it with import pandas.
  • Read a Comma Separate Values (CSV) data file with pandas.read_csv.
    • Argument is the name of the file to be read.
    • Assign result to a variable to store the data that was read.

In [ ]:
import pandas as pd
  • The columns in a dataframe are the observed variables, and the rows are the observations.
  • Pandas uses backslash \ to show wrapped lines when output is too wide to fit the screen.

Use index_col to specify that a column's values should be used as row headings.

  • Row headings are numbers (0 and 1 in this case).
  • Really want to index by country.
  • Pass the name of the column to read_csv as its index_col parameter to do this.

In [ ]:
data = pd.read_csv('data/gapminder_gdp_oceania.csv')
print(data)

Use DataFrame.info to find out more about a dataframe.


In [ ]:
data.info()
  • This is a DataFrame
  • Two rows named 'Australia' and 'New Zealand'
  • Twelve columns, each of which has two actual 64-bit floating point values.
    • We will talk later about null values, which are used to represent missing observations.
  • Uses 208 bytes of memory.

The DataFrame.columns variable stores information about the dataframe's columns.

  • Note that this is data, not a method.
    • Like math.pi.
    • So do not use () to try to call it.
  • Called a member variable, or just member.

In [ ]:
print()

Use DataFrame.T to transpose a dataframe.

  • Sometimes want to treat columns as rows and vice versa.
  • Transpose (written .T) doesn't copy the data, just changes the program's view of it.
  • Like columns, it is a member variable.

In [ ]:
print()

Use DataFrame.describe to get summary statistics about data.

DataFrame.describe() gets the summary statistics of only the columns that have numerical data. All other columns are ignored, unless you use the argument include='all'.


In [ ]:
print()

Writing Data

  • As well as the read_csv function for reading data from a file, Pandas provides a to_csv function to write dataframes to files.
  • You can use help to get information on how to use to_csv.

In order to write the DataFrame americas to a file called processed.csv, execute the following command: americas.to_csv('processed.csv')


In [ ]:

Note about Pandas DataFrames/Series

A [DataFrame][pandas-dataframe] is a collection of [Series][pandas-series]; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.

Pandas is built on top of the [Numpy][numpy] library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.

What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.

Selecting values

To access a value at the position [i,j] of a DataFrame, we have two options, depending on what is the meaning of i in use. Remember that a DataFrame provides a index as a way to identify the rows of the table; a row, then, has a position inside the table as well as a label, which uniquely identifies its entry in the DataFrame.

Use DataFrame.iloc[..., ...] to select values by their (entry) position

  • Can specify location by numerical index analogously to 2D version of character selection in strings.

In [ ]:
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
print()

Use DataFrame.loc[..., ...] to select values by their (entry) label.

  • Can specify location by row name analogously to 2D version of dictionary keys.

In [ ]:
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
print()

Result of slicing can be used in further operations.

  • Usually don't just print a slice.
  • All the statistical operators that work on entire dataframes work the same way on slices.
  • E.g., calculate max of a slice.

In [ ]:
albania = data.loc["Albania"]
print()
print()

In [ ]:
gdp1952 = data["gdpPercap_1952"]
print()
print()
  • Would get the same result printing data.loc[:,"gdpPercap_1952"]

  • Also get the same result printing data.gdpPercap_1952 (since it's a column name)

Select multiple columns or rows using DataFrame.loc and a named slice.


In [ ]:
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'])

In the above code, we discover that slicing using loc is inclusive at both ends, which differs from slicing using iloc, where slicing indicates everything up to but not including the final index.

Exercise 3

  • Assume we've only run the code below:
import pandas as pd
df = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')

Write an expression to find the:

  1. GDP per capita of Serbia in 2007.
  2. GDP per capita for all countries in 1982.
  3. GDP per capita for Denmark for all years.
  4. GDP per capita for all countries for years after 1985.

In [ ]:
# Exercise 3

Batch procesing files

Use a for loop to process files given a list of their names.

  • A filename is just a character string.
  • And lists can contain character strings.

In [ ]:
filenames = ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']

Use glob.glob to find sets of files whose names match a pattern.

  • In Unix, the term "globbing" means "matching a set of files with a pattern".
  • The most common patterns are:
    • * meaning "match zero or more characters"
    • ? meaning "match exactly one character"
  • Python contains the glob library to provide pattern matching functionality
  • The glob library contains a function also called glob to match file patterns
  • E.g., glob.glob('*.txt') matches all files in the current directory whose names end with .txt.
  • Result is a (possibly empty) list of character strings.

In [ ]:
print()

In [ ]:
print()

Use glob and for to process batches of files.

  • Helps a lot if the files are named and stored systematically and consistently so that simple patterns will find the right data.
  • Use pd.concat() to join Dataframes by column names

In [ ]:
csvs = []
for filename in glob.glob('data/gapminder_*.csv'):
    data = pd.read_csv(filename, index_col='country')

Plotting

matplotlib is the most widely used scientific plotting library in Python.

  • Commonly use a sub-library called matplotlib.pyplot.
  • The Jupyter Notebook will render plots inline if we ask it to using a "magic" command.

In [ ]:
%matplotlib inline

Simple plots are then (fairly) simple to create.


In [ ]:
time = [0, 1, 2, 3]
position = [0, 100, 200, 300]

Plot data directly from a Pandas dataframe.

  • We can also plot Pandas dataframes.
  • This implicitly uses matplotlib.pyplot.
  • Before plotting, we convert the column headings from a string to integer data type, since they represent numerical values

In [ ]:
data = pd.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')

# Extract year from last 4 characters of each column name

# Convert year values to integers, saving results back to dataframe

Select and transform data, then plot it.

  • By default, DataFrame.plot plots with the rows as the X axis.
  • We can transpose the data in order to plot multiple series.

In [ ]:

Many styles of plot are available.


In [ ]:

More Custom Styles

  • The command is plt.plot(x, y)
  • The color / format of markers can also be specified as an optical argument: e.g. 'b-' is a blue line, 'g--' is a green dashed line.

Get Australia data from dataframe


In [ ]:

Plotting multiple data

Often when plotting multiple datasets on the same figure it is desirable to have a legend describing the data. This can be done in matplotlib in two stages:

  • Provide a label for each dataset in the figure:
    plt.plot(years, gdp_australia, label='Australia')
    plt.plot(years, gdp_nz, label='New Zealand')
    

Adding a Legend

  • Instruct matplotlib to create the legend.
    plt.legend()
    
    By default matplotlib will attempt to place the legend in a suitable position. If you would rather specify a position this can be done with the loc= argument, e.g to place the legend in the upper left corner of the plot, specify loc='upper left'

In [ ]:
# Select two countries' worth of data.

# Plot with differently-colored markers.


# Create legend.

Scatterplots

  • Plot a scatter plot correlating the GDP of Australia and New Zealand
  • Use either plt.scatter or DataFrame.plot.scatter

In [ ]:

Saving your plot to a file

If you are satisfied with the plot you see you may want to save it to a file, perhaps to include it in a publication. There is a function in the matplotlib.pyplot module that accomplishes this: savefig. Calling this function, e.g. with

plt.savefig('my_figure.png')

will save the current figure to the file my_figure.png. The file format will automatically be deduced from the file name extension (other formats are pdf, ps, eps and svg).

Note that functions in plt refer to a global figure variable and after a figure has been displayed to the screen (e.g. with plt.show) matplotlib will make this variable refer to a new empty figure. Therefore, make sure you call plt.savefig before the plot is displayed to the screen, otherwise you may find a file with an empty plot.

When using dataframes, data is often generated and plotted to screen in one line, and plt.savefig seems not to be a possible approach. One possibility to save the figure to file is then to

  • save a reference to the current figure in a local variable (with plt.gcf)
  • call the savefig class method from that varible.
fig = plt.gcf() # get current figure
data.plot(kind='bar')
fig.savefig('my_figure.png')

In [ ]:
plt.scatter(gdp_australia, gdp_nz)

Data Subsets

Use comparisons to select data based on value.

  • Comparison is applied element by element.
  • Returns a similarly-shaped dataframe of True and False.

In [ ]:
# Use a subset of data to keep output readable.
subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
print('Subset of data:\n', subset)

# Which values were greater than 10000 ?
print('\nWhere are values large?\n', subset > 10000)

Select values or NaN using a Boolean mask.

  • A frame full of Booleans is sometimes called a mask because of how it can be used.

In [ ]:
mask = subset > 10000
print(subset[mask])
  • Get the value where the mask is true, and NaN (Not a Number) where it is false.
  • Useful because NaNs are ignored by operations like max, min, average, etc.

In [ ]:
print(subset[subset > 10000].describe())

Workshop materials are drevied from work that is Copyright ©Software Carpentry.