Introduction to Python for Data Analysis

Introductory Exercise:

With your best knowledge, write a python script, or write an algorithm in words, that:

  1. loads a file from a website: https://raw.githubusercontent.com/TeachingDataScience/datasets/master/nyt1.csv

  2. from that file, counts the number of 1s and 0s under the 'Gender' column, and the number of 1s and 0s under the 'Age' column

At a minimum, click on the link, look at the general layout of the data, and write out some notes as to how you'd go about a solution.

Solution with Python...

Solution with Pandas...

Prepare Your Environment

Launching a Python Notebook from Domino

  • Sign in to your dominodatalab.com account
  • Click on the "New Project" button and give your project a name like "PythonDataWorkshop", then click "Create Project"
  • You'll be brought to the project overview. Click on "Runs"
  • Click on the "Notebook" menu and select "Python session"
    • You'll see the loading icon and then a message saying "Notebook Session Ready"
  • Once the session has started running, click on the "Open Notebook" button
    • Your Jupyter notebook will open in a new tab, go to this tab.
    • You'll see the Jupyter home page.
  • Click on "New" and select "Python 2" under the notebooks section.
    • Your new notebook will open.
  • Click on where it says "Untitled" and give your notebook a name like "My Python for Data Workshop Notes"My
  • You're now running a fully functional data science server in the cloud!

Discussion

  • What are the advantages of running Python in the cloud?
  • What are the underlying technologies?
    • Amazon EC2
    • Jupyter Notebooks

Exploring your Environment

  • User Interface
  • Keyboard Shortcuts
  • Notebook Help - Web-based interactive computing system.
  • Markdown
  • Python 2.7.10 - An easy to learn, powerful programming language.
  • iPython - Enhnaced Python Shell, Web-based Notebook, Architecture for Interactive Parallel Computing
  • NumPy - Fundamental scientific computing library -- multi-dimensional arrays and fast math.
  • SciPy - Advanced scientific processing built on numpy
  • Matplotlib - Data visualization engine
  • SymPy - Symbolic mathematics
  • pandas: Powerful Python Data Analysis Toolkit

Documenting our Work

We'll be using the following practices to take advantage of the various parts of a Jupyter notebook:

  • the "code" cell allows us to run python code. It'll allow us to write multiple lines of code at a time.
  • the "markdown" cell allows us to save text, add images, etc.

With these two cell types we'll take notes for the workshop today, using the following steps:

  1. Always include a markdown cell above code. Write notes how you'd usually write notes related to code below.
  2. With code, for those completely fresh to programming, we'll be "commenting through the code." That just means for each line of code there will be a python comment that breaks down each line.

Let's practice!

With the following cell, we'll go through each line first, explain what it's doing, and then include a markdown cell above/below to summarize what was learned.


In [30]:
x = 7
print x + 5

# This is a comment! Comments are super helpful!

y = 2
print x / y
print float(x) / float(y)

help(x)


12
3
3.5
Help on int object:

class int(object)
 |  int(x=0) -> int or long
 |  int(x, base=10) -> int or long
 |  
 |  Convert a number or string to an integer, or return 0 if no arguments
 |  are given.  If x is floating point, the conversion truncates towards zero.
 |  If x is outside the integer range, the function returns a long instead.
 |  
 |  If x is not a number or if base is given, then x must be a string or
 |  Unicode object representing an integer literal in the given base.  The
 |  literal can be preceded by '+' or '-' and be surrounded by whitespace.
 |  The base defaults to 10.  Valid bases are 0 and 2-36.  Base 0 means to
 |  interpret the base from the string as an integer literal.
 |  >>> int('0b100', base=0)
 |  4
 |  
 |  Methods defined here:
 |  
 |  __abs__(...)
 |      x.__abs__() <==> abs(x)
 |  
 |  __add__(...)
 |      x.__add__(y) <==> x+y
 |  
 |  __and__(...)
 |      x.__and__(y) <==> x&y
 |  
 |  __cmp__(...)
 |      x.__cmp__(y) <==> cmp(x,y)
 |  
 |  __coerce__(...)
 |      x.__coerce__(y) <==> coerce(x, y)
 |  
 |  __div__(...)
 |      x.__div__(y) <==> x/y
 |  
 |  __divmod__(...)
 |      x.__divmod__(y) <==> divmod(x, y)
 |  
 |  __float__(...)
 |      x.__float__() <==> float(x)
 |  
 |  __floordiv__(...)
 |      x.__floordiv__(y) <==> x//y
 |  
 |  __format__(...)
 |  
 |  __getattribute__(...)
 |      x.__getattribute__('name') <==> x.name
 |  
 |  __getnewargs__(...)
 |  
 |  __hash__(...)
 |      x.__hash__() <==> hash(x)
 |  
 |  __hex__(...)
 |      x.__hex__() <==> hex(x)
 |  
 |  __index__(...)
 |      x[y:z] <==> x[y.__index__():z.__index__()]
 |  
 |  __int__(...)
 |      x.__int__() <==> int(x)
 |  
 |  __invert__(...)
 |      x.__invert__() <==> ~x
 |  
 |  __long__(...)
 |      x.__long__() <==> long(x)
 |  
 |  __lshift__(...)
 |      x.__lshift__(y) <==> x<<y
 |  
 |  __mod__(...)
 |      x.__mod__(y) <==> x%y
 |  
 |  __mul__(...)
 |      x.__mul__(y) <==> x*y
 |  
 |  __neg__(...)
 |      x.__neg__() <==> -x
 |  
 |  __nonzero__(...)
 |      x.__nonzero__() <==> x != 0
 |  
 |  __oct__(...)
 |      x.__oct__() <==> oct(x)
 |  
 |  __or__(...)
 |      x.__or__(y) <==> x|y
 |  
 |  __pos__(...)
 |      x.__pos__() <==> +x
 |  
 |  __pow__(...)
 |      x.__pow__(y[, z]) <==> pow(x, y[, z])
 |  
 |  __radd__(...)
 |      x.__radd__(y) <==> y+x
 |  
 |  __rand__(...)
 |      x.__rand__(y) <==> y&x
 |  
 |  __rdiv__(...)
 |      x.__rdiv__(y) <==> y/x
 |  
 |  __rdivmod__(...)
 |      x.__rdivmod__(y) <==> divmod(y, x)
 |  
 |  __repr__(...)
 |      x.__repr__() <==> repr(x)
 |  
 |  __rfloordiv__(...)
 |      x.__rfloordiv__(y) <==> y//x
 |  
 |  __rlshift__(...)
 |      x.__rlshift__(y) <==> y<<x
 |  
 |  __rmod__(...)
 |      x.__rmod__(y) <==> y%x
 |  
 |  __rmul__(...)
 |      x.__rmul__(y) <==> y*x
 |  
 |  __ror__(...)
 |      x.__ror__(y) <==> y|x
 |  
 |  __rpow__(...)
 |      y.__rpow__(x[, z]) <==> pow(x, y[, z])
 |  
 |  __rrshift__(...)
 |      x.__rrshift__(y) <==> y>>x
 |  
 |  __rshift__(...)
 |      x.__rshift__(y) <==> x>>y
 |  
 |  __rsub__(...)
 |      x.__rsub__(y) <==> y-x
 |  
 |  __rtruediv__(...)
 |      x.__rtruediv__(y) <==> y/x
 |  
 |  __rxor__(...)
 |      x.__rxor__(y) <==> y^x
 |  
 |  __str__(...)
 |      x.__str__() <==> str(x)
 |  
 |  __sub__(...)
 |      x.__sub__(y) <==> x-y
 |  
 |  __truediv__(...)
 |      x.__truediv__(y) <==> x/y
 |  
 |  __trunc__(...)
 |      Truncating an Integral returns itself.
 |  
 |  __xor__(...)
 |      x.__xor__(y) <==> x^y
 |  
 |  bit_length(...)
 |      int.bit_length() -> int
 |      
 |      Number of bits necessary to represent self in binary.
 |      >>> bin(37)
 |      '0b100101'
 |      >>> (37).bit_length()
 |      6
 |  
 |  conjugate(...)
 |      Returns self, the complex conjugate of any int.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  denominator
 |      the denominator of a rational number in lowest terms
 |  
 |  imag
 |      the imaginary part of a complex number
 |  
 |  numerator
 |      the numerator of a rational number in lowest terms
 |  
 |  real
 |      the real part of a complex number
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __new__ = <built-in method __new__ of type object>
 |      T.__new__(S, ...) -> a new object with type S, a subtype of T

Basic Data Types: Strings, Lists, Tuples

Try the same thing on your own to learn about the following python objects: strings, lists, and tuples.


In [34]:
some_string1 = 'apples'
some_string2 = 'and'
some_string3 = 'bananas'
print some_string1, some_string2, some_string3
print some_string1 + some_string2 + some_string3

mutable_list = ["apple", "apple", "banana", "kiwi", "bear", "strawberry", "strawberry"]
immutable_tuple = ("apple", "apple", "banana", "kiwi", "bear", "strawberry", "strawberry")

print len(some_string1)
print len(mutable_list)
print len(immutable_tuple)

print some_string1[0:5]
print mutable_list[0:4]
print immutable_tuple[5:6]

try:
    some_string1[5] = 'd'
except TypeError as e:
    print e

mutable_list[5] = 'mango'
try:
    immutable_tuple[5] = 'not going to work'
except TypeError as e:
    print e

a = [3 for i in range(10)]
print a


apples and bananas
applesandbananas
6
7
7
apple
['apple', 'apple', 'banana', 'kiwi']
('strawberry',)
'str' object does not support item assignment
'tuple' object does not support item assignment
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

Libraries for Data Science

Data Scientists use a wide variety of libraries in Python that make working with data significantly easier. Those libraries primarily consist of:

1.numpy
2.scipy
3.pandas
4.matplotlib
5.statsmodels
6.scikit-learn
7.nltk

Though there are countless others available.

For today, we'll primarily focus ourselves around the library that is 99% of our work: pandas

pandas and object oriented programming

pandas is a library built on top of numpy, which allows us to use excel-like matrices in the python programming space. These special matrices are called DataFrames, the primary object in pandas.

Earlier we loaded up the csv file from our computer; pandas can also parse from a URL.


In [35]:
nyt = pd.read_csv('https://raw.githubusercontent.com/TeachingDataScience/datasets/master/nyt1.csv')

Like with everything else in python, the DataFrame is an object. We'll use a function called type to identify the object name:


In [36]:
print type(nyt)


<class 'pandas.core.frame.DataFrame'>

This translates to:

  1. Found in the pandas library...
  2. ... in the core module...
  3. ... in the frame module...
  4. ...as a class, DataFrame

Objects in python are filled with variables and functions, and we use dot notation to access them.

# object.variable
nyt.dtypes
# object.function()
nyt.describe()

One great advantage with iPython is it has tab completion which means we can type nyt., press tab, and it'll show us what variables and functions exist.

Practice this with the following code cell with the following steps:

  1. wrap the code with the type function: what data type does it return?
  2. if it returns an "instancemethod" object, try the wrapper with a () to excute the function. What data type does the function return?

In [37]:
# example:
print nyt.dtypes
nyt.dtypes?
print type(nyt.dtypes)


Age            int64
Gender         int64
Impressions    int64
Clicks         int64
Signed_In      int64
dtype: object
<class 'pandas.core.series.Series'>

In [38]:
# Try wrapping type with the following:
print nyt.describe
print nyt.describe()
print nyt.shape
print nyt.index
print nyt.columns
print nyt.groupby
print nyt.groupby('Age')
print nyt.Age


<bound method DataFrame.describe of         Age  Gender  Impressions  Clicks  Signed_In
0        36       0            3       0          1
1        73       1            3       0          1
2        30       0            3       0          1
3        49       1            3       0          1
4        47       1           11       0          1
5        47       0           11       1          1
6         0       0            7       1          0
7        46       0            5       0          1
8        16       0            3       0          1
9        52       0            4       0          1
10        0       0            8       1          0
11       21       0            3       0          1
12        0       0            4       0          0
13       57       0            6       0          1
14       31       0            5       0          1
15        0       0            6       0          0
16       40       1            3       0          1
17       31       1            5       0          1
18       38       0            4       0          1
19        0       0            5       0          0
20       59       1            4       0          1
21       61       0            6       0          1
22       48       0            7       0          1
23       29       1            2       0          1
24        0       0            4       0          0
25       19       1            4       0          1
26       19       0            3       0          1
27       48       1            9       0          1
28       48       1            4       0          1
29       21       1            5       0          1
...     ...     ...          ...     ...        ...
458411   55       1            2       0          1
458412   68       1            7       0          1
458413    0       0            7       1          0
458414   21       0            5       0          1
458415   35       0            6       0          1
458416   26       1            6       0          1
458417   41       0           11       0          1
458418   58       1            4       0          1
458419   46       1            5       0          1
458420   45       0           11       0          1
458421   46       1            4       0          1
458422   47       0            6       0          1
458423   22       1            8       0          1
458424   21       0           10       0          1
458425   40       1            6       0          1
458426   49       0            9       0          1
458427   43       1            9       0          1
458428   40       1            4       0          1
458429   49       1           11       0          1
458430    0       0            6       0          0
458431   21       1            5       0          1
458432   30       0            4       0          1
458433   21       1            2       0          1
458434   61       1            6       0          1
458435   51       0            7       0          1
458436    0       0            2       0          0
458437    0       0            4       0          0
458438   72       1            5       0          1
458439    0       0            5       0          0
458440    0       0            3       0          0

[458441 rows x 5 columns]>
                 Age         Gender    Impressions         Clicks  \
count  458441.000000  458441.000000  458441.000000  458441.000000   
mean       29.482551       0.367037       5.007316       0.092594   
std        23.607034       0.481997       2.239349       0.309973   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       3.000000       0.000000   
50%        31.000000       0.000000       5.000000       0.000000   
75%        48.000000       1.000000       6.000000       0.000000   
max       108.000000       1.000000      20.000000       4.000000   

           Signed_In  
count  458441.000000  
mean        0.700930  
std         0.457851  
min         0.000000  
25%         0.000000  
50%         1.000000  
75%         1.000000  
max         1.000000  
(458441, 5)
Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
                 8,      9, 
            ...
            458431, 458432, 458433, 458434, 458435, 458436, 458437, 458438,
            458439, 458440],
           dtype='int64', length=458441)
Index([u'Age', u'Gender', u'Impressions', u'Clicks', u'Signed_In'], dtype='object')
<bound method DataFrame.groupby of         Age  Gender  Impressions  Clicks  Signed_In
0        36       0            3       0          1
1        73       1            3       0          1
2        30       0            3       0          1
3        49       1            3       0          1
4        47       1           11       0          1
5        47       0           11       1          1
6         0       0            7       1          0
7        46       0            5       0          1
8        16       0            3       0          1
9        52       0            4       0          1
10        0       0            8       1          0
11       21       0            3       0          1
12        0       0            4       0          0
13       57       0            6       0          1
14       31       0            5       0          1
15        0       0            6       0          0
16       40       1            3       0          1
17       31       1            5       0          1
18       38       0            4       0          1
19        0       0            5       0          0
20       59       1            4       0          1
21       61       0            6       0          1
22       48       0            7       0          1
23       29       1            2       0          1
24        0       0            4       0          0
25       19       1            4       0          1
26       19       0            3       0          1
27       48       1            9       0          1
28       48       1            4       0          1
29       21       1            5       0          1
...     ...     ...          ...     ...        ...
458411   55       1            2       0          1
458412   68       1            7       0          1
458413    0       0            7       1          0
458414   21       0            5       0          1
458415   35       0            6       0          1
458416   26       1            6       0          1
458417   41       0           11       0          1
458418   58       1            4       0          1
458419   46       1            5       0          1
458420   45       0           11       0          1
458421   46       1            4       0          1
458422   47       0            6       0          1
458423   22       1            8       0          1
458424   21       0           10       0          1
458425   40       1            6       0          1
458426   49       0            9       0          1
458427   43       1            9       0          1
458428   40       1            4       0          1
458429   49       1           11       0          1
458430    0       0            6       0          0
458431   21       1            5       0          1
458432   30       0            4       0          1
458433   21       1            2       0          1
458434   61       1            6       0          1
458435   51       0            7       0          1
458436    0       0            2       0          0
458437    0       0            4       0          0
458438   72       1            5       0          1
458439    0       0            5       0          0
458440    0       0            3       0          0

[458441 rows x 5 columns]>
<pandas.core.groupby.DataFrameGroupBy object at 0x7f71c46ed910>
0         36
1         73
2         30
3         49
4         47
5         47
6          0
7         46
8         16
9         52
10         0
11        21
12         0
13        57
14        31
15         0
16        40
17        31
18        38
19         0
20        59
21        61
22        48
23        29
24         0
25        19
26        19
27        48
28        48
29        21
          ..
458411    55
458412    68
458413     0
458414    21
458415    35
458416    26
458417    41
458418    58
458419    46
458420    45
458421    46
458422    47
458423    22
458424    21
458425    40
458426    49
458427    43
458428    40
458429    49
458430     0
458431    21
458432    30
458433    21
458434    61
458435    51
458436     0
458437     0
458438    72
458439     0
458440     0
Name: Age, dtype: int64

Exercise: early data exploration and pandas syntax

Practice the following steps with these 4 data sets located at http://teachingdatascience.github.io/Rdatasets/datasets.html.

Copy the csv link and use that with read_csv to import it.

DATA SETS

  1. chickwts
  2. infert
  3. msleep
  4. Cars93

INSTRUCTIONS

  1. import the data into python using pd.read_csv()
  2. call the dtypes variable to identify the columns and their data types.
  3. call the describe function to collect statistics about numerical data types.
  4. for each other column (object columns):use
    • dataframe['column'].unique() and
    • dataframe.groupby('column')['column'].count()
  5. Write a short markdown that describes the data set and its columns.

Data Munging basics with pandas

What interesting data point do we learn when we run the following code?


In [72]:
print nyt.groupby(['Signed_In', 'Gender']).Age.describe()


Signed_In  Gender       
0          0       count    137106.000000
                   mean          0.000000
                   std           0.000000
                   min           0.000000
                   25%           0.000000
                   50%           0.000000
                   75%           0.000000
                   max           0.000000
1          0       count    153070.000000
                   mean         43.423336
                   std          16.763906
                   min           7.000000
                   25%          30.000000
                   50%          42.000000
                   75%          55.000000
                   max         108.000000
           1       count    168265.000000
                   mean         40.823701
                   std          15.780505
                   min           7.000000
                   25%          28.000000
                   50%          40.000000
                   75%          52.000000
                   max         107.000000
dtype: float64

In the Signed_In 0 group, we notice both Age and Gender are also always 0. We know this by:

  • In the Signed_In 0 group, there is only one Gender group as well in the DataFrame Index (0)
  • the min/max age for Signed_in group 0 is also 0.

This intuitively makes sense when working with marketing data: if a user is not logged in, you likely do not know their age or gender either.

Data does not always come in forms that we expect, so it is generally a requirement for us to work through a process called "data munging," which is the process of extracting and cleaning up a data set. Given that, we will explore the basics of data munging and aggregation:

  • Filter
  • Sort
  • Select and Unique
  • Mutate and Transform
  • Pivot and Aggregate

Much of this fits into the Split-Apply-Combine strategy of data analysis, popularized by Hadley Wickham's R package plyr (and later, dplyr). In fact, many of these concepts are shared between R and pandas, the primary difference being the syntax.

While we go through each of these examples, take good notes, comment through the code, and write questions that we can come back to.

Filter

Say we want to only focus on our marketing analysis on users we have data for--after all, it'd be easier to target users to convert on ads if we have a target demographic that performs well. Earlier we used a filtering command that finds rows that return True.


In [75]:
nyt_signedin_only = nyt[nyt.Signed_In == 1]
nyt_signedin_only = nyt[nyt['Signed_In'] == 1]
nyt_signedin_only.groupby('Gender').describe()


Out[75]:
Age Clicks Impressions Signed_In
Gender
0 count 153070.000000 153070.000000 153070.000000 153070
mean 43.423336 0.073117 5.012733 1
std 16.763906 0.271194 2.238426 0
min 7.000000 0.000000 0.000000 1
25% 30.000000 0.000000 3.000000 1
50% 42.000000 0.000000 5.000000 1
75% 55.000000 0.000000 6.000000 1
max 108.000000 3.000000 17.000000 1
1 count 168265.000000 168265.000000 168265.000000 168265
mean 40.823701 0.069991 5.008629 1
std 15.780505 0.266324 2.239114 0
min 7.000000 0.000000 0.000000 1
25% 28.000000 0.000000 3.000000 1
50% 40.000000 0.000000 5.000000 1
75% 52.000000 0.000000 6.000000 1
max 107.000000 3.000000 20.000000 1

Practice filtering for the following:

  • rows where age is less than 30
  • rows where gender is 0 but signed_in is 1
  • rows where age is less than 30 or more than 60.

Selecting Rows, Columns, and Finding Uniques

Earlier we selected columns (or a list of columns) using key notation (like with dictionaries). Pandas also accepts dot notation (akin to an object mapper, or javascript object notation).

To select rows based on the index, we can use .ix[], which accepts either keys (when the index is a string) or a range (when the index is integer based). (pandas does allow more specific typing, loc and iloc)

Otherwise, passing a range in will do fine.


In [77]:
# Finding the first ten rows
print nyt[0:10]

# This will still return the first ten rows
print nyt.set_index('Age')[0:10]

# Since the index is positional, this will also return the first 10 rows.
print nyt.ix[0:10]

# This first one will work fine, as it returns back all rows where Age == 40.
# The second will fail, as .ix does not treat the index as positional, when the index does not represent the row order.
print nyt.set_index('Age').ix[40]
try:
    print nyt.set_index('Age').ix[40:45]
except Exception as e:
    print e.message

# Using dot notation and returning uniques of that column
print nyt.Age.head()
print nyt.Age.unique()


   Age  Gender  Impressions  Clicks  Signed_In
0   36       0            3       0          1
1   73       1            3       0          1
2   30       0            3       0          1
3   49       1            3       0          1
4   47       1           11       0          1
5   47       0           11       1          1
6    0       0            7       1          0
7   46       0            5       0          1
8   16       0            3       0          1
9   52       0            4       0          1
     Gender  Impressions  Clicks  Signed_In
Age                                        
36        0            3       0          1
73        1            3       0          1
30        0            3       0          1
49        1            3       0          1
47        1           11       0          1
47        0           11       1          1
0         0            7       1          0
46        0            5       0          1
16        0            3       0          1
52        0            4       0          1
    Age  Gender  Impressions  Clicks  Signed_In
0    36       0            3       0          1
1    73       1            3       0          1
2    30       0            3       0          1
3    49       1            3       0          1
4    47       1           11       0          1
5    47       0           11       1          1
6     0       0            7       1          0
7    46       0            5       0          1
8    16       0            3       0          1
9    52       0            4       0          1
10    0       0            8       1          0
     Gender  Impressions  Clicks  Signed_In
Age                                        
40        1            3       0          1
40        1            3       0          1
40        0            3       0          1
40        1            7       0          1
40        1            9       0          1
40        1            6       0          1
40        1            9       0          1
40        0            4       0          1
40        0            5       0          1
40        0            6       0          1
40        0            3       0          1
40        1            8       0          1
40        0            2       0          1
40        0            4       0          1
40        1            6       0          1
40        1            6       0          1
40        1            4       0          1
40        0            2       0          1
40        0            8       0          1
40        1            2       0          1
40        1            4       0          1
40        0            5       0          1
40        0            2       0          1
40        0            2       0          1
40        1            4       0          1
40        0            7       0          1
40        1            8       0          1
40        1           11       0          1
40        1            2       0          1
40        1            3       0          1
..      ...          ...     ...        ...
40        1            5       0          1
40        0           11       0          1
40        1            9       0          1
40        0            4       0          1
40        1            3       0          1
40        1            5       0          1
40        1            8       0          1
40        1            8       0          1
40        1            3       0          1
40        1            2       0          1
40        1            3       0          1
40        0            4       0          1
40        0            2       0          1
40        1            3       0          1
40        0            7       0          1
40        1            7       0          1
40        0            4       0          1
40        0            1       0          1
40        0            5       0          1
40        0            3       0          1
40        0            2       0          1
40        1            6       0          1
40        1            3       0          1
40        1           11       0          1
40        1            4       0          1
40        1            6       0          1
40        1            2       0          1
40        1            6       0          1
40        1            6       0          1
40        1            4       0          1

[7702 rows x 4 columns]
Cannot get left slice bound for non-unique label: 40
0    36
1    73
2    30
3    49
4    47
Name: Age, dtype: int64
[ 36  73  30  49  47   0  46  16  52  21  57  31  40  38  59  61  48  29
  19  23  66  44  32  22  74  65  63  35  33  34  43  18  17  62  27  11
  42  41  14  64  24  26  39  71  45  28  83  79  67  51  60  58  53  25
  50  76  15  78  55  37  56  77  85  82  13  70  84  54  20  10  81  68
  72  69  87  80  88  12  86  75  89   9  96 102  98  90  99  91  92  95
   8  94  93  97   7 100 103 107 101 108 104]

Mutate, Transform (apply)

There are three most common techniques in mutating or transforming the data.

One approach would be to directly use values from other columns to derive new columns. The other would be to apply functions to columns using a named or lambda function.

You can generate the new columns by defining their name as a key.


In [ ]:
nyt['column_of_ones'] = 1

nyt['saw_ad_many_times'] = nyt['Impressions'].apply(lambda x: 1 if x > 5 else 0)

def saw_add_func(x):
    if x > 5:
        return 1
    else:
        return 0

nyt['saw_ad_many_times'] = nyt['Impressions'].apply(saw_add_func)

# practice on one column: 
# a common approach to missing data, let's either set the values to null, or -1:
# import numpy as np
# np.nan


# by default apply works with a single column, however you can use it to iterate by row as well (though it is slower)
nyt['not_signed_in'] = nyt.apply(lambda row: 0 if row['Signed_In'] else 1, axis=1)

Pivot and Aggregate

Earlier we saw use of the function groupby. A pandas groupby object has other functionality, such as returning statistics about a column, or all columns, or aggregating only what's defined.

Similarly, pivot tables can be used to explore groups side by side.


In [ ]:
# import numpy as np
nyt['column_of_ones'] = 1
nyt['Click_Thru'] = nyt['Clicks'] / nyt['Impressions']
nyt['Click_Thru'] = nyt['Click_Thru'].apply(lambda x: 0 if x == np.inf else x)
nyt_signedin_only = nyt[nyt.Signed_In == 1]
nyt_group = nyt.groupby('Age')

print nyt_group.mean()

In [ ]:
print nyt_group['Click_Thru'].mean()

In [ ]:
print nyt_group.agg({
    'Click_Thru': ['mean', 'max'],
    'Gender': 'mean',
})

In [ ]:
print nyt[nyt.Signed_In == 1].pivot_table(
    values='column_of_ones',
    index='Gender',
    columns='Age',
    aggfunc='count'
)

Practice:

  1. Aggregate the mean and standard deviation (std) of Age when Signed_In equals 1
  2. Find the min values of Age for each value of Impressions
  3. Find the max values of Clicks for each value of Impressions
  4. Determine: Are Signed In users more or less likely to click thru on an ad?

Plotting

matplotlib's core functionality serves as a plotting tool within python. While calling .describe() on DataFrames is useful to get a rough idea of what your data looks like, plots allow you to visualize what your data really looks like. For today, we'll use matplotlib within the context of pandas.

Consider the following data set and code:


In [80]:
anscombe = pd.DataFrame({
    'x' :  [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5],
    'y1' : [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68],
    'y2' : [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74],
    'y3' : [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73],
    'x4' : [8,8,8,8,8,8,8,19,8,8,8],
    'y4' : [6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.50,5.56,7.91,6.89],
})

anscombe.describe()


Out[80]:
x x4 y1 y2 y3 y4
count 11.000000 11.000000 11.000000 11.000000 11.000000 11.000000
mean 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
std 3.316625 3.316625 2.031568 2.031657 2.030424 2.030579
min 4.000000 8.000000 4.260000 3.100000 5.390000 5.250000
25% 6.500000 8.000000 6.315000 6.695000 6.250000 6.170000
50% 9.000000 8.000000 7.580000 8.140000 7.110000 7.040000
75% 11.500000 8.000000 8.570000 8.950000 7.980000 8.190000
max 14.000000 19.000000 10.840000 9.260000 12.740000 12.500000

Visually from creating the data frame you can tell the data looks different, yet in the .describe() call, the data shares very similar features. The two primary plotting tools we uses from matplotlib are histograms and scatterplots, which help us understand the shape of data.


In [85]:
%matplotlib inline
# above magic allows plots to show in our notebook.

for y in ['y1', 'y2', 'y3', 'y4']:
    if y != 'y4':
        print anscombe.plot(kind='scatter', x='x', y=y)
    else:
        print anscombe.plot(kind='scatter', x='x4', y=y)


Axes(0.125,0.125;0.775x0.775)
Axes(0.125,0.125;0.775x0.775)
Axes(0.125,0.125;0.775x0.775)
Axes(0.125,0.125;0.775x0.775)

Use these new tools in order to visualize some of this New York Times ad performance data.


In [87]:
nyt[nyt.Signed_In == 1].Age.hist()


Out[87]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f719f0751d0>

In [88]:
print nyt[nyt.Signed_In == 1].Clicks.hist()


Axes(0.125,0.125;0.775x0.775)

In [90]:
nyt.describe()


Out[90]:
Age Gender Impressions Clicks Signed_In column_of_ones saw_ad_many_times
count 458441.000000 458441.000000 458441.000000 458441.000000 458441.000000 458441 458441.000000
mean 29.482551 0.367037 5.007316 0.092594 0.700930 1 0.385127
std 23.607034 0.481997 2.239349 0.309973 0.457851 0 0.486626
min 0.000000 0.000000 0.000000 0.000000 0.000000 1 0.000000
25% 0.000000 0.000000 3.000000 0.000000 0.000000 1 0.000000
50% 31.000000 0.000000 5.000000 0.000000 1.000000 1 0.000000
75% 48.000000 1.000000 6.000000 0.000000 1.000000 1 1.000000
max 108.000000 1.000000 20.000000 4.000000 1.000000 1 1.000000

In [92]:
nyt['Click_Thru'] = nyt['Clicks'] / nyt['Impressions']
nyt[nyt.Signed_In == 1].plot(kind='scatter', x='Age', y='Click_Thru')


Out[92]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f719eec2790>

Independent Practice

Practice doing much of the same functionality we did today with two of the four different data sets from the repo above in small groups of 3 or 4.

Primarily, your goals are to:

  • Show how to load a csv file into python
  • Use pandas to understand the numerical portions of the data
  • Use matplotlib to visualize the data
  • Use scikit-learn to fit the data to some dependent feature (y)

Summarise your work with the following questions:

  1. What is the dataset about? What do the columns mean?
  2. What does the dataset look like?
  3. What data relationships can you find?
  4. Given those relationships/correlations, what does that help explain or help you understand about the data?

We will come back as a class and have each group present their answers to the above to one of the data sets they explored.

Next Steps

I encourage you to pick up and start reading the following books to continue learning about programming and data analytics.

Still uncomfortable with Python?

That's okay! Continue practicing your python basics with Learn Python the Hard Way. It'll strengthen your chops. Do it multiple times—you'll keep learning!

Feel like Pandas can revolutionize how you work?

Purchase and read Wes McKinney's Python for Data Analysis. It'll get you cranking on everything Wes did with Pandas and why it was designed the way it was.

Want to dig into why Data Science?

Check out Provost and Fawcett's Data Science for Business. You'll start understanding more about the business practice of Data Science (vs the academic side) and some fundamentals about algorithm and stats application.