Introduction to Python for Data Analysis

Introductory Exercise:

With your best knowledge, write a python script, or write an algorithm in words, that:

  1. loads a file from a website: https://raw.githubusercontent.com/TeachingDataScience/datasets/master/nyt1.csv

  2. from that file, counts the number of 1s and 0s under the 'Gender' column, and the number of 1s and 0s under the 'Age' column

At a minimum, click on the link, look at the general layout of the data, and write out some notes as to how you'd go about a solution.

Solution with Python...

Solution with Pandas...

Prepare Your Environment

Launching a Python Notebook from Domino

  • Sign in to your dominodatalab.com account
  • Click on the "New Project" button and give your project a name like "PythonDataWorkshop", then click "Create Project"
  • You'll be brought to the project overview. Click on "Runs"
  • Click on the "Notebook" menu and select "Python session"
    • You'll see the loading icon and then a message saying "Notebook Session Ready"
  • Once the session has started running, click on the "Open Notebook" button
    • Your Jupyter notebook will open in a new tab, go to this tab.
    • You'll see the Jupyter home page.
  • Click on "New" and select "Python 2" under the notebooks section.
    • Your new notebook will open.
  • Click on where it says "Untitled" and give your notebook a name like "My Python for Data Workshop Notes"My
  • You're now running a fully functional data science server in the cloud!

Discussion

  • What are the advantages of running Python in the cloud?
  • What are the underlying technologies?
    • Amazon EC2
    • Jupyter Notebooks

Exploring your Environment

  • User Interface
  • Keyboard Shortcuts
  • Notebook Help - Web-based interactive computing system.
  • Markdown
  • Python 2.7.10 - An easy to learn, powerful programming language.
  • iPython - Enhnaced Python Shell, Web-based Notebook, Architecture for Interactive Parallel Computing
  • NumPy - Fundamental scientific computing library -- multi-dimensional arrays and fast math.
  • SciPy - Advanced scientific processing built on numpy
  • Matplotlib - Data visualization engine
  • SymPy - Symbolic mathematics
  • pandas: Powerful Python Data Analysis Toolkit

Documenting our Work

We'll be using the following practices to take advantage of the various parts of a Jupyter notebook:

  • the "code" cell allows us to run python code. It'll allow us to write multiple lines of code at a time.
  • the "markdown" cell allows us to save text, add images, etc.

With these two cell types we'll take notes for the workshop today, using the following steps:

  1. Always include a markdown cell above code. Write notes how you'd usually write notes related to code below.
  2. With code, for those completely fresh to programming, we'll be "commenting through the code." That just means for each line of code there will be a python comment that breaks down each line.

Let's practice!

With the following cell, we'll go through each line first, explain what it's doing, and then include a markdown cell above/below to summarize what was learned.


In [ ]:
x = 7
print x + 5

# This is a comment! Comments are super helpful!

y = 2
print x / y
print float(x) / float(y)

help(x)

Basic Data Types: Strings, Lists, Tuples

Try the same thing on your own to learn about the following python objects: strings, lists, and tuples.


In [ ]:
some_string1 = 'apples'
some_string2 = 'and'
some_string3 = 'bananas'
print some_string1, some_string2, some_string3
print some_string1 + some_string2 + some_string3

mutable_list = ["apple", "apple", "banana", "kiwi", "bear", "strawberry", "strawberry"]
immutable_tuple = ("apple", "apple", "banana", "kiwi", "bear", "strawberry", "strawberry")

print len(some_string1)
print len(mutable_list)
print len(immutable_tuple)

print some_string1[0:5]
print mutable_list[0:4]
print immutable_tuple[5:6]

try:
    some_string1[5] = 'd'
except TypeError as e:
    print e

mutable_list[5] = 'mango'
try:
    immutable_tuple[5] = 'not going to work'
except TypeError as e:
    print e

a = [3 for i in range(10)]
print a

Libraries for Data Science

Data Scientists use a wide variety of libraries in Python that make working with data significantly easier. Those libraries primarily consist of:

1.numpy
2.scipy
3.pandas
4.matplotlib
5.statsmodels
6.scikit-learn
7.nltk

Though there are countless others available.

For today, we'll primarily focus ourselves around the library that is 99% of our work: pandas

pandas and object oriented programming

pandas is a library built on top of numpy, which allows us to use excel-like matrices in the python programming space. These special matrices are called DataFrames, the primary object in pandas.

Earlier we loaded up the csv file from our computer; pandas can also parse from a URL.


In [ ]:
nyt = pd.read_csv('https://raw.githubusercontent.com/TeachingDataScience/datasets/master/nyt1.csv')

Like with everything else in python, the DataFrame is an object. We'll use a function called type to identify the object name:


In [ ]:
print type(nyt)

This translates to:

  1. Found in the pandas library...
  2. ... in the core module...
  3. ... in the frame module...
  4. ...as a class, DataFrame

Objects in python are filled with variables and functions, and we use dot notation to access them.

# object.variable
nyt.dtypes
# object.function()
nyt.describe()

One great advantage with iPython is it has tab completion which means we can type nyt., press tab, and it'll show us what variables and functions exist.

Practice this with the following code cell with the following steps:

  1. wrap the code with the type function: what data type does it return?
  2. if it returns an "instancemethod" object, try the wrapper with a () to excute the function. What data type does the function return?

In [ ]:
# example:
print nyt.dtypes
nyt.dtypes?
print type(nyt.dtypes)

In [ ]:
# Try wrapping type with the following:
print nyt.describe
print nyt.describe()
print nyt.shape
print nyt.index
print nyt.columns
print nyt.groupby
print nyt.groupby('Age')
print nyt.Age

Exercise: early data exploration and pandas syntax

Practice the following steps with these 4 data sets located at http://teachingdatascience.github.io/Rdatasets/datasets.html.

Copy the csv link and use that with read_csv to import it.

DATA SETS

  1. chickwts
  2. infert
  3. msleep
  4. Cars93

INSTRUCTIONS

  1. import the data into python using pd.read_csv()
  2. call the dtypes variable to identify the columns and their data types.
  3. call the describe function to collect statistics about numerical data types.
  4. for each other column (object columns):use
    • dataframe['column'].unique() and
    • dataframe.groupby('column')['column'].count()
  5. Write a short markdown that describes the data set and its columns.

Data Munging basics with pandas

What interesting data point do we learn when we run the following code?


In [ ]:
print nyt.groupby(['Signed_In', 'Gender']).Age.describe()

In the Signed_In 0 group, we notice both Age and Gender are also always 0. We know this by:

  • In the Signed_In 0 group, there is only one Gender group as well in the DataFrame Index (0)
  • the min/max age for Signed_in group 0 is also 0.

This intuitively makes sense when working with marketing data: if a user is not logged in, you likely do not know their age or gender either.

Data does not always come in forms that we expect, so it is generally a requirement for us to work through a process called "data munging," which is the process of extracting and cleaning up a data set. Given that, we will explore the basics of data munging and aggregation:

  • Filter
  • Sort
  • Select and Unique
  • Mutate and Transform
  • Pivot and Aggregate

Much of this fits into the Split-Apply-Combine strategy of data analysis, popularized by Hadley Wickham's R package plyr (and later, dplyr). In fact, many of these concepts are shared between R and pandas, the primary difference being the syntax.

While we go through each of these examples, take good notes, comment through the code, and write questions that we can come back to.

Filter

Say we want to only focus on our marketing analysis on users we have data for--after all, it'd be easier to target users to convert on ads if we have a target demographic that performs well. Earlier we used a filtering command that finds rows that return True.


In [ ]:
nyt_signedin_only = nyt[nyt.Signed_In == 1]
nyt_signedin_only = nyt[nyt['Signed_In'] == 1]
nyt_signedin_only.groupby('Gender').describe()

Practice filtering for the following:

  • rows where age is less than 30
  • rows where gender is 0 but signed_in is 1
  • rows where age is less than 30 or more than 60.

Selecting Rows, Columns, and Finding Uniques

Earlier we selected columns (or a list of columns) using key notation (like with dictionaries). Pandas also accepts dot notation (akin to an object mapper, or javascript object notation).

To select rows based on the index, we can use .ix[], which accepts either keys (when the index is a string) or a range (when the index is integer based). (pandas does allow more specific typing, loc and iloc)

Otherwise, passing a range in will do fine.


In [ ]:
# Finding the first ten rows
print nyt[0:10]

# This will still return the first ten rows
print nyt.set_index('Age')[0:10]

# Since the index is positional, this will also return the first 10 rows.
print nyt.ix[0:10]

# This first one will work fine, as it returns back all rows where Age == 40.
# The second will fail, as .ix does not treat the index as positional, when the index does not represent the row order.
print nyt.set_index('Age').ix[40]
try:
    print nyt.set_index('Age').ix[40:45]
except Exception as e:
    print e.message

# Using dot notation and returning uniques of that column
print nyt.Age.head()
print nyt.Age.unique()

Mutate, Transform (apply)

There are three most common techniques in mutating or transforming the data.

One approach would be to directly use values from other columns to derive new columns. The other would be to apply functions to columns using a named or lambda function.

You can generate the new columns by defining their name as a key.


In [ ]:
nyt['column_of_ones'] = 1

nyt['saw_ad_many_times'] = nyt['Impressions'].apply(lambda x: 1 if x > 5 else 0)

def saw_add_func(x):
    if x > 5:
        return 1
    else:
        return 0

nyt['saw_ad_many_times'] = nyt['Impressions'].apply(saw_add_func)

# practice on one column: 
# a common approach to missing data, let's either set the values to null, or -1:
# import numpy as np
# np.nan


# by default apply works with a single column, however you can use it to iterate by row as well (though it is slower)
nyt['not_signed_in'] = nyt.apply(lambda row: 0 if row['Signed_In'] else 1, axis=1)

Pivot and Aggregate

Earlier we saw use of the function groupby. A pandas groupby object has other functionality, such as returning statistics about a column, or all columns, or aggregating only what's defined.

Similarly, pivot tables can be used to explore groups side by side.


In [ ]:
# import numpy as np
nyt['column_of_ones'] = 1
nyt['Click_Thru'] = nyt['Clicks'] / nyt['Impressions']
nyt['Click_Thru'] = nyt['Click_Thru'].apply(lambda x: 0 if x == np.inf else x)
nyt_signedin_only = nyt[nyt.Signed_In == 1]
nyt_group = nyt.groupby('Age')

print nyt_group.mean()

In [ ]:
print nyt_group['Click_Thru'].mean()

In [ ]:
print nyt_group.agg({
    'Click_Thru': ['mean', 'max'],
    'Gender': 'mean',
})

In [ ]:
print nyt[nyt.Signed_In == 1].pivot_table(
    values='column_of_ones',
    index='Gender',
    columns='Age',
    aggfunc='count'
)

Practice:

  1. Aggregate the mean and standard deviation (std) of Age when Signed_In equals 1
  2. Find the min values of Age for each value of Impressions
  3. Find the max values of Clicks for each value of Impressions
  4. Determine: Are Signed In users more or less likely to click thru on an ad?

Plotting

matplotlib's core functionality serves as a plotting tool within python. While calling .describe() on DataFrames is useful to get a rough idea of what your data looks like, plots allow you to visualize what your data really looks like. For today, we'll use matplotlib within the context of pandas.

Consider the following data set and code:


In [ ]:
anscombe = pd.DataFrame({
    'x' :  [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5],
    'y1' : [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68],
    'y2' : [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74],
    'y3' : [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73],
    'x4' : [8,8,8,8,8,8,8,19,8,8,8],
    'y4' : [6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.50,5.56,7.91,6.89],
})

anscombe.describe()

Visually from creating the data frame you can tell the data looks different, yet in the .describe() call, the data shares very similar features. The two primary plotting tools we uses from matplotlib are histograms and scatterplots, which help us understand the shape of data.


In [ ]:
%matplotlib inline
# above magic allows plots to show in our notebook.

for y in ['y1', 'y2', 'y3', 'y4']:
    if y != 'y4':
        print anscombe.plot(kind='scatter', x='x', y=y)
    else:
        print anscombe.plot(kind='scatter', x='x4', y=y)

Use these new tools in order to visualize some of this New York Times ad performance data.


In [ ]:
nyt[nyt.Signed_In == 1].Age.hist()

In [ ]:
print nyt[nyt.Signed_In == 1].Clicks.hist()

In [ ]:
nyt.describe()

In [ ]:
nyt['Click_Thru'] = nyt['Clicks'] / nyt['Impressions']
nyt[nyt.Signed_In == 1].plot(kind='scatter', x='Age', y='Click_Thru')

Independent Practice

Practice doing much of the same functionality we did today with two of the four different data sets from the repo above in small groups of 3 or 4.

Primarily, your goals are to:

  • Show how to load a csv file into python
  • Use pandas to understand the numerical portions of the data
  • Use matplotlib to visualize the data
  • Use scikit-learn to fit the data to some dependent feature (y)

Summarise your work with the following questions:

  1. What is the dataset about? What do the columns mean?
  2. What does the dataset look like?
  3. What data relationships can you find?
  4. Given those relationships/correlations, what does that help explain or help you understand about the data?

We will come back as a class and have each group present their answers to the above to one of the data sets they explored.

Next Steps

I encourage you to pick up and start reading the following books to continue learning about programming and data analytics.

Still uncomfortable with Python?

That's okay! Continue practicing your python basics with Learn Python the Hard Way. It'll strengthen your chops. Do it multiple times—you'll keep learning!

Feel like Pandas can revolutionize how you work?

Purchase and read Wes McKinney's Python for Data Analysis. It'll get you cranking on everything Wes did with Pandas and why it was designed the way it was.

Want to dig into why Data Science?

Check out Provost and Fawcett's Data Science for Business. You'll start understanding more about the business practice of Data Science (vs the academic side) and some fundamentals about algorithm and stats application.