WORKSHOP
"Python for Journalists and Data Nerds"


Marta Alonso @malonfe
Kathleen Burnett @kmb232

Based in an idea from @crisodisy from PyLadiesBCN

1. INTRO TO JUPYTER NOTEBOODK

1.1 Jupyter Kernel

Be sure you are using the appropriate kernel.

Go to the menu "Kernel > Change Kernel" and select the one that refers to your virtual environment.

1.2 Text cells

Double click on this cell, you will see the text without formatting.

This allows you to edit this block of text which is written using Markdown. But you can also use html to edit it.

Hit shift + enter or shift + return on your keyboard to show the formatted text again. This is called running the cell, and you can also do it using the run button in the toolbar.

1.3 Code cells

One great advantage of Jupyter notebooks is that you can show your Python code alongside the results, add comments to the code, or even add blocks of text using Markdown. The following cell is a code cell.



In [ ]:

    
# Hit shift + enter or use the run button to run this cell and see the results

print 'Hello PyLadies'



In [ ]:

    
# The last line of every code cell will be displayed by default, 
# even if you don't print it. Run this cell to see how this works.

2 + 2 # The result of this line will not be displayed
3 + 3 # The result of this line will be displayed, because it is the last line of the cell

1.4 Creating cells

To create a new code cell, click "Insert > Insert Cell [Above or Below]". A code cell will automatically be created.

To create a new markdown cell, first follow the process above to create a code cell, then change the type from "Code" to "Markdown" using the dropdown next to the run, stop, and restart buttons.

You can also use the + button to create a cell and the scissors buttom to cut it out.

1.5 Re-running cells

If you find a bug in your code, you can always update the cell and re-run it. However, any cells that come afterward won't be automatically updated. Try it out below. First run each of the three cells. The first two don't have any output, but you will be able to tell they've run because a number will appear next to them, for example, "In [5]". The third cell should output the message "Intro to Data Analysis is awesome!"



In [ ]:

    
who = "Python"
who



In [ ]:

    
message = "We love " + who + "!"



In [ ]:

    
message

Once you've run all three cells, try modifying the first one to set who to your anything else.

Rerun the first and third cells without rerunning the second and see what happens. It seems that you need to run the three cells to have the expected result.

Often, after changing a cell, you'll want to rerun all the cells below it. You can do that quickly by clicking "Cell > Run All Below".

1.6 Autocompletion

Use TAB to automplete python methods and variables.



In [ ]:

    
print len(message)
print len(message)

1.7 Final tips

If the output is too long click on the left part of the result:

One click will confine the result to a box
Double click will hide the result

One final thing to remember: if you shut down the kernel after saving your notebook, the cells' output will still show up as you left it at the end of your session when you start the notebook back up. However, the state of the kernel will be reset. If you are actively working on a notebook, remember to re-run your cells to set up your working environment to really pick up where you last left off.

2. WORKSHOP

Handy documentation:

Pandas Docs

Pandas API Reference

Pandas DataFrames

THE "ZIKA REPORTS"

2.1 Obtaining and Preparing data

For this exercise we are going to use a real data set of Zika infection cases. The original data is provided by the Centers for Desease Control and Prevention (CDC) in raw HTML. You can find it in the CDC site from where it is scrapped and formatted as CSVs files.

The CSVs are available from:

Epidemic Prediction Innitiative github: https://github.com/cdcepi/zika
Our github: https://github.com/PyLadiesDC/python-for-journalists

You've downloaded already all the data so you should have all that you need to start right now.

2.2 Check that we have all necessary files in current directory

First of all, we have to ensure that all files we downloded are in the working directory and all data files are available.

In this exercice we are going to work with data stored in the data folder. Check that you have that folder running the following cell.



In [ ]:

    
import os
print os.getcwd()
print os.listdir('.')
print os.listdir('data')

2.3 Load .csv in a DataFrame with the data

We have to import pandas library.



In [ ]:

    
import pandas as pd 
#from now on we use pd as an abbreviation for pandas

We create a Pandas DataFrame from a csv file using the read_csv method. The name of the columns are going to be those of the CSVs columns.

Check Pandas Input/Output commands to check other available read methods read_json or read_excel.



In [ ]:

    
"""shape is an attribute (invoked without parenthesis)
tells us how many rows and columns our DataFrame has"""
august_df = pd.read_csv("data/CDC_Report-2016-08-31.csv")
august_df.shape

To explore the DataFrame we can use head() method that results by default in the first 5 lines of the DataFrame



In [ ]:

    
"""head() is a method, to call it you use parenthesis"""
august_df.head()

The attribute columns returns a list of the column names.



In [ ]:

    
august_df.columns

2.4 Concatenate several DataFrames

If we want to have the historical data to see the evolution of the disease we need to create a new DataFrame containing the content of the different csv files. We use pd.concat() to concatenate one DataFrame per csv file.



In [ ]:

    
import glob 

#glob finds all the pathnames matching a specified pattern
csv_list = glob.glob("data/*.csv")

df_list = []
for f in csv_list:
    df = pd.read_csv(f)
    df_list.append(df)
year_df = pd.concat(df_list, ignore_index=True)
# NOTE: a more pythonic way of doing the last five lines would be:
# year_df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)

year_df.shape



In [ ]:

    
year_df.head(2)

2.5 Arrange columns

2.5.1 Processing dates

We want to process the report_date column, being able to extract the year or the month or any other part of the date:

we need to import datetime library
using strptime to create a datetime object according to a format (see the link for understand the format directives)



In [ ]:

    
import datetime as dt

#creates a datetime object from a string representing a date and time and a corresponding format string.
my_date = dt.datetime.strptime('2010-06-17', '%Y-%m-%d')



In [ ]:

    
# from the datetime object we can extract the following attributes and functions
print my_date.year
print my_date.month
print my_date.day
print my_date.hour
print my_date.minute
print my_date.isocalendar() #returns a tuple (ISO year, ISO week number, ISO weekday)

2.5.2 New column -- 'week'

Now we are going to create a column week from the column report_date using the apply method:

We first define a function that extracts the week number for each of the dates.
Then we need to apply this function to all the elements in the report_date column.



In [ ]:

    
def get_week_number(any_date):
     return dt.datetime.strptime(any_date,'%Y-%m-%d').isocalendar()[1]
    
# we apply the function to each of the elements of the column "report_date"
year_df['week'] = year_df['report_date'].apply(get_week_number)

year_df.head()

2.5.3 New column -- 'state'

Now we are going to create a column state from the column location:

We first define a function that extracts the week number for each of the dates.
Then we need to apply this function to all the elements in the report_date column.



In [ ]:

    
def get_state(location):
    return location.split("-")[1]

year_df['state'] = year_df['location'].apply(get_state)
year_df.head(10)

2.5.4 Deleting columns

We also want to delete the columns that we don't need. We use the drop method with:

axis = 1 specifying that we are deleting a column (0 for deleting rows)
inplace = True specifies that we are deleting the column in our object, with inplace = False we are creating a new DataFrame without that column.



In [ ]:

    
year_df.drop('time_period', axis=1, inplace=True)
year_df.head()

NOTE: If you try to execute the drop twice you'll get an error because that column doesn't exist anymore. If you want to reset the kernel and start from scratch, you can go to the meny "Kernel" and select any of the "Restart" options and afterwards going to the menu "Cell" and Run the cells above or below the focused cell.

2.5.5 Exercise - try it on your own!

Try to do this exercise without having a look to the solution in the next paragraph
Create a column country similarly as the one you created before
Delete the column time_period_type, unit, data_field_code and location that you are not using



In [ ]:

    
# your solution goes here

2.5.6 Solution to the previous exercise



In [ ]:

    
# Adding "country" column
def get_country(location):
    return location.split("-")[0]    
year_df['country'] = year_df['location'].apply(get_country)

# Deleting extra columns
year_df.drop('time_period_type', axis=1, inplace=True) 
year_df.drop('unit', axis=1, inplace=True) 
year_df.drop('data_field_code', axis=1, inplace=True) 
year_df.drop('location', axis=1, inplace=True)



In [ ]:

    
year_df.head()

2.6 Select and filter data

2.6.1 Selectors



In [ ]:

    
year_df.ix[:3,'report_date':'data_field']

Creating a new DataFrame from a fragment of other.

It is a good practice to use the copy() method to avoid later problems when creating a new DataFrame out of a piece of another one. Pandas cannot assure the piece you are taking is a view or a copy of the original DataFrame and we want to be sure its a copy, so we do not modify the original one if we change the new one. Explanation: here



In [ ]:

    
two_columns_df = year_df[["week","value"]].copy()
two_columns_df.head()

# Another way of copying columns selecting all the rows for certain columns, no need to add the copy() method
# two_columns_df = year_df.loc[:,["week","value"]]

2.6.2 Applying conditions

Which are the zika infection cases in the District of Columbia?



In [ ]:

    
year_df[(year_df["state"] == "District_of_Columbia") & (year_df["week"] == 30)]

But we should store the value of the latest week, to know which are the latest results if we need them.



In [ ]:

    
max_week = year_df["week"].max()
print max_week

Which are the current zika infection cases in the District of Columbia?



In [ ]:

    
year_df[(year_df["state"] == "District_of_Columbia") & (year_df["week"] == max_week)]

Which states or territories have zika cases due to local infection?



In [ ]:

    
year_df[(year_df["value"] != 0) & (year_df["week"] == max_week) & (year_df["data_field"] == "zika_reported_local") ]

2.6.3 Exercise - try it on your own!

Try to do this exercise without having a look to the solution in the next paragraph
Use year_df dataframe and conditions to create a new df called syear_df where the state column contains only state names, not territory names
To ensure that we are making a copy use the copy() method



In [ ]:

    
# Your solution goes here

2.6.4 Solution to exercise 2.6.3



In [ ]:

    
syear_df = year_df[year_df['location_type'] == 'state'].copy()
syear_df.head()

2.7 Grouping and ordering

2.7.1 Which states have more zika cases?

First we selected the chunk that we want to use, the latest report data



In [ ]:

    
latest_df = syear_df[syear_df['week'] == max_week].copy()
latest_df.head()

Now we group the data by state to sum both local and travel cases



In [ ]:

    
sum_df = latest_df.groupby('state').sum()
sum_df.head()
# see how all the numerical columns ar added up, although the **week** column added doesn't make any sense
# pay attention to how the resulted DF has states as indexes

We order the df by the value columns



In [ ]:

    
sorted_df = sum_df.sort_values(by='value', ascending=False)
sorted_df[:10]

2.8 Plotting Data

matplotlib is one of the most common libraries for making 2D plots of arrays in Python, Pandas library has a plot() method that wraps the plotting functionalities of matplotlib. We'll try to use this method when possible, for simplicity.

Even so, in order to see the plots inmediately in our script we need to use the matplotlib magic line as we'll see in the next paragraph. More about magic lines here.

2.8.1 Bar graph with the aggregated cases of zika in the US



In [ ]:

    
"""This is the matplotlib magic line, by default matplotlib defers drawing until the end 
of the script. But we need matplotlib to works interactively and draw the plots right away"""
%matplotlib inline

# Seaborn is a library that makes your plots prettier
import seaborn as sns



In [ ]:

    
# we are drawing a plot bar with values in axis Y
# by default the indexes of the DF (states) are used for axis X
sorted_df[:10].plot.bar(y='value', figsize=(8,4), title='zika cases')

Let's draw it horizontally:



In [ ]:

    
# Remember how we sorted:
#sorted_df = sum_df.sort_values(by='value', ascending=False)
sorted_df[:10].plot.barh(y='value', figsize=(8,4), title='zika cases')

Wouldn't it be clearer if the highest value is at the top of the plot?

2.8.2 Exercise - try it on your own!

Try to do this exercise without having a look to the solution in the next paragraph
Draw the same horizontal bar graph that we've done before but ordered from the top to the bottom
You need to re-write the code we've done for the graph, read the instructions in the following cell



In [ ]:

    
# This is the code we used:
# sorted_df = sum_df.sort_values(by='value', ascending=False)
# sorted_df[:10].plot.barh(y='value', figsize=(8,4), title='zika cases')

# 1. sort the dataframe in an ascending way
# 2. get the last 10 positions of the dataframe instead of the first ten

2.8.3 Solution to 2.8.2



In [ ]:

    
sorted_df = sum_df.sort_values(by='value', ascending=True)
sorted_df[-10:].plot.barh(y='value', figsize=(8,4), title='zika cases',color='sandybrown')

Hint: color names for plotting: http://matplotlib.org/mpl_examples/color/named_colors.hires.png

2.8.4 Linegraph with the evolution of cases throught 2016 in the US



In [ ]:

    
# Remember that syear_df is our dataframe with the anual data for the states
weekly_df = syear_df.groupby('week').sum()
weekly_df.head()
# now the weeks are the new indexes



In [ ]:

    
# again the X axis takes by default the indexes of the DF (the weeks)

weekly_df.plot.line(y='value',figsize=(8,4), title='zika evolution in the US', color='orange')

2.8.5 Add some labels!! (extra)



In [ ]:

    
"""Our plotting function returns the axes object, which allow us to access
the rectangles and add a text to them"""

# first reorder the data descendingly
sorted_df = sum_df.sort_values(by='value', ascending=False)
# access the axes object when creating the bar plot
axes = sorted_df[:10].plot.bar(y='value', figsize=(8,4), title='zika cases')
# loop over the rectangles and add a tex
for p in axes.patches:
    axes.text(p.get_x() + p.get_width()/2, # x positions
            p.get_height()-35,             # y position
            int(p.get_height()),           # label text
            ha='center', va='bottom', color='white', fontsize=9) # fontdict with font alignment and properties

2.8.6 Stacked bar plots (extra)

We want to show in the same bar plot which cases of zika were transmitted locally and which cases where due to travelling to affected areas.

In order to plot two different variables travel and local we need to create two new columns for these variables.



In [ ]:

    
# We'll use latest_df dataframe which contains the most recent values, already filtered only for states.

latest_df['travel'] = latest_df.loc[latest_df['data_field'] == 'zika_reported_travel','value']
latest_df['local'] = latest_df.loc[latest_df['data_field'] == 'zika_reported_local','value']
latest_df.head()



In [ ]:

    
group_df = latest_df.groupby('state').sum()
group_df.sort_values(by="value", ascending = False, inplace = True)
group_df[:10].plot.bar(y=['travel','local'], stacked = True, figsize=(8,4), title='zika cases')

2.8.7 Arranging the axes for dates (extra)

This example is a little bit more complex, in this case we have to go down to the original matplotlib library and use its plot_data method that allows as to configure their axes according to our necessities.



In [ ]:

    
import matplotlib.pyplot as plt
import matplotlib.dates as dates

# group by date returns a df with report_date as indexes
bydate_df = syear_df.groupby('report_date').sum()

# Series with the values to plot
y_values = bydate_df['value']
# Get the values of the indexes (our dates) and returns a list of Datetime objects
my_date_index = pd.DatetimeIndex(bydate_df.index.values).to_pydatetime()

f = plt.figure(figsize=(8, 4))
ax = f.gca() # get current axes
# set ticks location in the X axis, in the 14th day of each month (to see Sept)
ax.xaxis.set_major_locator(dates.MonthLocator(bymonthday=14))
# set ticks format in the X axis => (%b is abbreviated month)+new line+ year
ax.xaxis.set_major_formatter(dates.DateFormatter('%b\n%Y'))
plt.plot_date(
            my_date_index, # x values
            y_values,      # y values
            fmt='-',     # format of the line
            xdate=True,  # x-axis will be labeled with dates
            ydate=False, # y-axis don't
            color='red')

2.9 Store data in a file

Again, consult Pandas Input/Output commands to check other file formats available.



In [ ]:

    
group_df.to_csv("output.csv")

WORKSHOP "Python for Journalists and Data Nerds"