Pandas: Introduction

Pandas is Python's library for dealing with structured or tabular data. It's main contributor, Wes McKinney was inspired by R's data.frame, and implemented it for Python.
It combines the speed of NumPy with the ease of SQL, as per Wes, and I completely agree with that. If you have used R, and the dplyr package, you know how easy it is to manipulate data with it.

We will be learning about various methods to deal with data, and occasionally we will make things a little challenging so as to replicate/mimic real world conditions. And while at it, we will throw in visualisations using Matplotlib too! The best way to learn is to write code yourself, but don't worry if you don't understand all of it in the first go. And of course, feel free to take a step back and revisit Lesson 10.

By the end of it, we should have dealt with about a few case studies, which should be an excellent start for your portfolio.

We will cover at the very least the following topic:

  • Indexing and Selection
  • Creating new columns
  • Renaming
  • Grouping
  • Handling missing values
  • Merge, join
  • map(), apply(), applymap()
  • Pivot Tables
  • Basic statistics
  • Plots (throughout the exercise)

I say "at the very least" because in my opinion, this is the bare minimum you should know to handle data science problems 'in the wild', as in, problems that aren't toy problems, and the kind that data scientists deal with every day.

Importing Libraries

As usual, we begin by importing our libraries. Just as with NumPy, where we import it as np, we will import the pandas library as pd. It's just convention, and you're free to import it as chuck_norris, really_long_name_for_reason_in_particular or just plain and simple, pd.


In [ ]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

Data Structures

There are three fundamental data structures supported by Pandas:

  • Series: a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). For those coming from an R background, Series is much like a Vector.
  • DataFrame: a 2-dimensional labeled data structure with columns of potentially different types.
  • Panel: also called longitudinal data or cross-sectional time series data, is data where multiple cases (people, firms, countries etc) were observed at two or more time periods. This is rarely used though, and I personally haven't come across this except for some Econometrics courses I had taken in my undergraduate years.

Series

The basic format to creat a series is:
series_a = pd.Series(data, index = index_name)

The default value for the index is 1,2,3,4....and so on, and doesn't not need to be specified, except in the case of scalars.


In [ ]:
import pandas as pd

# From Scalar Values
series_1 = pd.Series([1,2,3,4,5])
series_1

Notice the 0,1,2,3... on the left side? That's called the Index. It starts from 0, but you can rename it.


In [ ]:
series_1 = pd.Series([1,2,3,4,5], index = ['Mon','Tue','Wed','Thu','Fri'])
series_1

In [ ]:
series_2 = pd.Series(1.0, index = ['a','b','c','d','e'])
series_2

In [ ]:
import pandas as pd
import numpy as np

# From an array

# Just copy this for now, we'll cover the 'seed' in DataFrames
np.random.seed(42) 

series_3 = pd.Series(np.random.randn(5))
series_3

In [ ]:
np.random.seed(42)
series_3 = pd.Series(np.random.randn(5), index = ['a','b','c','d','e'])
series_3

In [ ]:
np.random.seed(42)
ind_1 = ['a','b','c','d','e']
series_3 = pd.Series(np.random.randn(5), index = ind_1)
series_3

In [ ]:
series_4 = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
series_4

We can subset and get values from the series.


In [ ]:
series_4['a'] == series_4[0]

In [ ]:
series_4[series_4>3]

In [ ]:
series_4[series_4%2==0]

In [ ]:
series_5 = pd.Series([1,2,3,4,5], index = ['HP', 'GS', 'IBM', 'AA', 'FB'])
series_5

In [ ]:
series_5['IBM']

In [ ]:
tech_pf1 = series_5[['HP', 'IBM', 'FB']]
tech_pf1

In [ ]:
# From a Dictionary
dict_01 = {'Gavin' : 50, 'Russ' : 100, 'Erlich' : 150}
series_6 = pd.Series(dict_01)
series_6

In [ ]:
# Reordering the previous series
index = ['Gavin', 'Russ', 'Erlich', 'Peter']
series_7 = pd.Series(dict_01, index=index)
series_7

Notice the NaN, which stands for Not a Number. We will be dealing with it extensively when working with DataFrames. It is an indicator for missing or corrupted data. Here's how we test for it.


In [ ]:
pd.isnull(series_7)

And here's a nice discussion on the topic from our friends at StackOverflow.


In [ ]:
# Pandas is very smart, and aligns the series for mathematical operations
series_6 + series_7

In [ ]:
# Renaming an Index
series_7.index.name = "Names"
series_7

In [ ]:
# Naming a Series
series_7.name = "SV"
series_7

Mini-Project


In [ ]:
goals = pd.Series([20,19,21,24,1], index = ["Messi", "Neymar", "Zlatan", "Ronaldo", "N’Gog"])
goals

In [ ]:
# Who scored less than 20 goals?
goals[goals<20]

In [ ]:
# What is the average number of goals scored?
goals.mean()

In [ ]:
# What is the median number of goals scored?
goals.median()

In [ ]:
# What is the range of goals scored? (Range = Max - Min)
goals_range = goals.max() - goals.min()
print(goals_range)

In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (15,7)

In [ ]:
# Plot the goals in a bar chart
goals.plot(kind = "bar")

In [ ]:
# Let's beautify that a little
goals.plot(kind = "barh", title = "Goal Scorers")

Read more about these here.

DataFrames

DataFrames is in many respects, the real Pandas. Usually, if you're using Pandas, it will be to use DataFrames.
We will begin with creating DataFrames, and the usual indexing and selection mechanisms. In reality, you will probably never have to 'create' a DataFrame, but practice these skills here to get comfortable with heirarchies, indices and selections. Then we will move on to reading data from multiple formats, including spreadsheets, JSON files and API endpoints.

By the way, during these examples, we will always set seed first when generating random numbers. If you're coming from R, this is the same as set.seed(). In Python, we use the random.seed statement from numpy, which you can read about here. You can set it to any number you like, and I usually set it to 42 just out of habit, but there's not to say you can't set it to an arbitrary number like 27 or 2012. Use the same numbers as this notebook though to replicate the results. Also note that we need to mention it in every cell that we want the results replicated.

You will see later about how this is good practice especially when sharing your work with other members of the team - they will be able to reproduce your work on their machines due to the pseudo-random number that is generated algorithmically.


In [ ]:
import pandas as pd
import numpy as np

In [ ]:
# Let's start with a standard array
arr1 = np.array([[40,40,75,95],[80,85,120,130],
                 [155,160,165,170],[200,245,250,260]])
print(arr1.shape)
print(arr1.size)
print(arr1)

In [ ]:
# It is quite common to assign a dataframe the name 'df', although you can
# use a relevant name, such baseball_stats or book_sales
# It's always good to use context driven names - you should code expecting
# someone else to read it a few months down the line

df = pd.DataFrame(arr1, index = "Peter,Clarke,Bruce,Tony".split(","), 
                  columns = "Jan,Feb,Mar,Apr".split(","))
df

Indexing and Selection

Selecting Columns


In [ ]:
df = pd.DataFrame(arr1, index = "Peter,Clarke,Bruce,Tony".split(","), 
                  columns = "Jan,Feb,Mar,Apr".split(","))
df

In [ ]:
# Selecting columns
df[['Jan']]

In [ ]:
df[['Jan','Feb']]

In [ ]:
df[['Mar','Jan']]

It's interesting to note that the offical Pandas documentation refers to DataFrames as:

Can be thought of as a dict-like container for Series objects.

You can access it as a Series as below:


In [ ]:
df['Jan']

In [ ]:
print('Series:', type(df['Jan']))
print('DataFrame:',type(df[['Jan']]))

Using loc and iloc


In [ ]:
df = pd.DataFrame(arr1, index = "Peter,Clarke,Bruce,Tony".split(","), 
                  columns = "Jan,Feb,Mar,Apr".split(","))
df

In [ ]:
# For selecting by Label
df.loc[['Tony']]

In [ ]:
df.loc[['Peter','Bruce']]

In [ ]:
df.loc[['Peter','Bruce'],['Jan','Feb']]

In [ ]:
# All of Peter's data
df.loc[["Peter"]][:]

In [ ]:
df.loc["Peter"][:]

In [ ]:
df

In [ ]:
# Integer-location based indexing for selection by position
# Note how this returns a Dataframe
df.iloc[[0]]

In [ ]:
# and this returns a Series
df.iloc[0]

In [ ]:
# Narrowing down further
df.iloc[[0],[1]]

In [ ]:
# Replicating the results from our use of the loc statement
df.iloc[[0,2]]

In [ ]:
# Compare to df.loc[['Peter','Bruce'],['A','D']]
df.iloc[[0,2],[0,3]]

There's another function named ix. I have rarely used it, and both loc and iloc take care of all my selection needs. You can read about it here.

Also, check out the similarity of outputs below:


In [ ]:
df.ix[0:3]

In [ ]:
df.iloc[0:3]

Conditional Selection

While exploring data sets, one often has to use conditional selection. Or this could be true for creating subsets to work.


In [ ]:
df

In [ ]:
df[df%2 == 0]

In [ ]:
df%2 == 0

In [ ]:
df < 100

In [ ]:
df[df<100]

In [ ]:
df

In [ ]:
df[df['Jan']>100][['Apr']]

In [ ]:
df[df['Jan']<100][['Feb','Apr']]

In [ ]:
# Using multiple conditions
df[(df['Jan'] >= 80) & (df['Mar']>100)]

Did you notice that we used & instead of and? When using Pandas, we have to use the symbol, not the word. Here's a StackOverflow discussion on this.

Creating New Columns


In [ ]:
df = pd.DataFrame(arr1, index = "Peter,Clarke,Bruce,Tony".split(","), columns = "Jan,Feb,Mar,Apr".split(","))
df

In [ ]:
df["Dec"] = df["Jan"] + df["Mar"]
df

Removing Columns

While fundamentally adding and removing columns ought to be similar operations, there are a few differences. Let's see if you can figure it out.


In [ ]:
df

In [ ]:
df.drop('Dec', axis = 1)

First, we had to mention the axis. 0 is for rows, 1 is for columns.


In [ ]:
df

Why is 'Dec' still there? Here lies the difference - while removing columns, we have to specify that the operation should be inplace. Read about it in the official documentation.


In [ ]:
df.drop('Dec', axis = 1, inplace = True)
df

And just for the sake of completion, let's temporarily kick out Tony from the table. Temporary, since it's not inplace.


In [ ]:
df.drop('Tony', axis = 0)

In [ ]:
# Renaming Columns
df.rename(columns={'Jan': 'January'}, inplace=True)
df

In [ ]:
df.rename(columns={'Feb': 'February', 'Mar': 'March', 'Apr': 'April'}, inplace=True)
df

Dataframe from a Dictionry

Let's create a new dataframe from a dictionary, and then apply some of the selection techniques we just learnt.


In [ ]:
dict1 = {'first_name': ['Erlich', 'Richard', "Dinesh", 'Gilfoyle', 'Nelson'],
         'second_name': ['Bachman', 'Hendricks', np.nan, np.nan, 'Bighetti'],
        'occupation': ['Investor', 'Entrepreneur', 'Coder', 'Coder', 'Bench Warmer'],
        'age': [40, 30, 28, 29, 28]}
df = pd.DataFrame(dict1, columns = ['first_name', 'second_name','occupation', 'age'])
df

In [ ]:
# Who is under 30 years of age?
df[df["age"]<30]

In [ ]:
# Who are the coders?
df[df["occupation"] == "Coder"]

In [ ]:
# Multiple Conditions : Coders, below 30
# Not that conditions are Booleans, as shown below
coders = df["occupation"] == "Coder"
und_30 = df["age"]<30
df[coders & und_30]

In [ ]:
df[df["second_name"].notnull()]

Exercise


In [ ]:
np.random.seed(42)
np.random.randn(4,4)

In [ ]:
np.random.seed(42)
df = pd.DataFrame(np.random.randn(4,4), index = "Peter,Clarke,Bruce,Tony".split(","), columns = "Jan,Feb,Mar,Apr".split(","))
df

In [ ]:
# Who scored greater than 0 in Apr?
df[df>0][["Apr"]]

In [ ]:
# Who scored below 0 in March?

In [ ]:
# In which month/months did Clarke score above 0?

In [ ]:
# Find the highest scores for each month 
# Hint: .max()

In [ ]:
# Find the lowest scores for each month

In [ ]:
# Plot the higest score for each month in a bar graph

Handling Missing Data

Pay special attention to this section. If needed, spend some extra time to cover all the relevant techniques.
Never in my experience have I come across a 100% clean data set "in the wild". What that means is that of course you will find that most data sets that you train with to be complete, but real world data is messy and incomplete. Even when working with high quality, financial data from exchanges, they might often have missing data points. The less said about unstructured data like text, the better.

TL/DR: If you're going to fight Mike Tyson, don't train to fight Mr Bean.

What is Missing Data?

Data can be missing because:

  • It was never captured
  • The data does not exist
  • It was captured but got corrupted

In Pandas, missing data will be represented as None or NaN.


In [ ]:
df = pd.DataFrame({'NYC':[3,np.nan,7,9,6],
                  'SF':[4,3,8,7,15],
                  'CHI':[4,np.nan,np.nan,14,6],
                  'MIA':[3, 9,12,8,9]}, index = ['Mon','Tue','Wed','Thu','Fri'])
df

First thing we can do is drop rows with missing values with the dropna() function. By default, rows are dropped, but you can change this to columns as well.


In [ ]:
df.dropna()

In [ ]:
df.dropna(axis = 0)

In [ ]:
df.dropna(axis = 1)

While this can be helpful in some ways, if your dataset is small, you are losing a significant portion of your data. For example, if 100 rows out of 1 million rows have missing data, that's negligible, and can potentially be thrown away. What if you have 10 out of 85 rows with incorrect, unusable or missing data?


In [ ]:
df2 = df.copy()

In [ ]:
df2

In [ ]:
df2.mean()

In [ ]:
# Are these really the means though?
df

In [ ]:
mean = df2['SF'].mean()
mean

Imputation

Using the fillna function, we can replace missing values.


In [ ]:
df = pd.DataFrame({'NYC':[3,np.nan,7,9,6],
                  'SF':[4,3,8,7,15],
                  'CHI':[4,np.nan,np.nan,14,6],
                  'MIA':[3, 9,12,8,9]}, index = ['Mon','Tue','Wed','Thu','Fri'])
df

In [ ]:
df.mean()

In [ ]:
df.fillna(value = df.mean(), inplace = True)
df

In [ ]:
df = pd.DataFrame({'NYC':[3,np.nan,7,9,6],
                  'SF':[4,3,8,7,15],
                  'CHI':[4,np.nan,np.nan,14,6],
                  'MIA':[3, 9,12,8,9]}, index = ['Mon','Tue','Wed','Thu','Fri'])
df

In [ ]:
df3 = df.copy()
df3

In [ ]:
median = df3['SF'].median()
median

In [ ]:
df3.fillna(value = median, inplace = True)
df3

In [ ]:
df3.mode()

But sometimes, the data isn't part of the table. Consider the scenario below. We know that the below tables contains names of female babies. But it's missing in our dataset.


In [ ]:
baby_names = {
        'id': ['101', '102', '103', '104', '105'],
        'first_name': ['Emma', 'Madison', 'Hannah', 'Grace', 'Emily']
            }
df_baby = pd.DataFrame(baby_names, columns = ['id', 'first_name'])
df_baby

In [ ]:
df_baby.columns

In [ ]:
df_baby["gender"] = "F"

In [ ]:
df_baby

In [ ]:
df_baby['gender'] = 0

In [ ]:
df_baby

Interpolation

Read up more on the interpolate function here and here


In [ ]:
df = pd.read_csv("data/cafe_sales2015.csv")
df

In [ ]:
df["Date"].head()

In [ ]:
df["Date"] = pd.to_datetime(df["Date"])

In [ ]:
df.set_index(["Date"], inplace = True)

In [ ]:
df.head()

In [ ]:
df.tail()

In [ ]:
df.head(3)

In [ ]:
df.describe()

In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (20,5)

In [ ]:
df.plot(kind="line")

In [ ]:
df["Water"].plot(kind="line")

In [ ]:
df.interpolate(method = "linear", inplace = True)

In [ ]:
df.head(5)

In [ ]:
df.interpolate().count()

In [ ]:
df[["Latte", "Water"]].plot(kind="line")

Keep in mind though, that these are at best approximations.

A Quick Detour into some Data Viz

Install Vincent by running the following line in your command line:

Python 2.x: pip install vincent
Python 3.x: pip3 install vincent


In [ ]:
import vincent
vincent.core.initialize_notebook()

In [ ]:
line = vincent.Line(df)
line.axis_titles(x='Date', y='Amount')

In [ ]:
line = vincent.Line(df[["Latte", "Water"]])
line.axis_titles(x='Date', y='Amount')

In [ ]:
stacked = vincent.StackedArea(df)
stacked.axis_titles(x='Date', y='Amount')
stacked.legend(title='Cafe Sales')
stacked.colors(brew='Spectral')

Read about using the Vincent package here.

The latest update to Matplotlib, V 2.0.0 has really improved the quality of the graphics, but it's still not quite production ready, while on the positive side, it is stable and has a large community of people who use it. Niche packages like Vincent can produce some amazing graphics right out of the box with minimal tweaking, but they may not be very mature. Nevertheless, as Data Scientists, it's good to learn about new packages, especially those that help you communicate your results to a non-technical audience. If people don't understand what you do, they won't think what you do is important!

Merge, Join, Concatenate

Merge


In [ ]:
customers = {
        'customer_id': ['101', '102', '103', '104', '105'],
        'first_name': ['Tony', 'Silvio', 'Paulie', 'Corrado', 'Christopher'], 
        'last_name': ['Soprano', 'Dante', 'Gualtieri', 'Soprano', 'Moltisanti']}
df_1 = pd.DataFrame(customers, columns = ['customer_id', 'first_name', 'last_name'])
df_1

In [ ]:
orders = {
        'customer_id': ['101', '104', '105', '108', '111'],
        'order_date': ['2015-01-01', '2015-01-08', '2015-01-19', '2015-02-10', '2015-02-11'], 
        'order_value': ['10000', '25000', '1100', '5000', '4400']}
df_2 = pd.DataFrame(orders, columns = ['customer_id', 'order_date', 'order_value'])
df_2

In [ ]:
pd.merge(df_1, df_2, how = 'inner', on = 'customer_id')

In [ ]:
pd.merge(df_1, df_2, how = 'left', on = 'customer_id')

In [ ]:
pd.merge(df_1, df_2, how = 'right', on = 'customer_id')

In [ ]:
pd.merge(df_1, df_2, how = 'outer', on = 'customer_id')

Join


In [ ]:
customers = {
        'customer_id': ['101', '102', '103', '104', '105'],
        'first_name': ['Tony', 'Silvio', 'Paulie', 'Corrado', 'Christopher'], 
        'last_name': ['Soprano', 'Dante', 'Gualtieri', 'Soprano', 'Moltisanti']}
customers

In [ ]:
orders = {
        'customer_id': ['101', '104', '105', '108', '111'],
        'order_date': ['2015-01-01', '2015-01-08', '2015-01-19', '2015-02-10', '2015-02-11'], 
        'order_value': ['10000', '25000', '1100', '5000', '4400']}
orders

In [ ]:
df1_new = pd.DataFrame.from_dict(customers, orient='columns', dtype=None)

In [ ]:
df1_new

In [ ]:
df1_new = df1_new.set_index('customer_id')
df1_new

In [ ]:
df2_new = pd.DataFrame.from_dict(orders, orient='columns', dtype=None)
df2_new

In [ ]:
df2_new = df2_new.set_index('customer_id')
df2_new

In [ ]:
df1_new.join(df2_new,how = "inner")

In [ ]:
df1_new.join(df2_new,how = "outer")

In [ ]:
df1_new.join(df2_new,how = "left")

In [ ]:
df1_new.join(df2_new,how = "right")

In [ ]:
# Alternate Way : I don't recommend this
df_1.join(df_2, on = "customer_id", lsuffix='_l', rsuffix='_r')

Concatenate


In [ ]:
customers = {
        'customer_id': ['101', '102', '103', '104', '105'],
        'first_name': ['Tony', 'Silvio', 'Paulie', 'Corrado', 'Christopher'], 
        'last_name': ['Soprano', 'Dante', 'Gualtieri', 'Soprano', 'Moltisanti']}
df_1 = pd.DataFrame(customers, columns = ['customer_id', 'first_name', 'last_name'])
df_1

In [ ]:
orders = {
        'customer_id': ['101', '104', '105', '108', '111'],
        'order_date': ['2015-01-01', '2015-01-08', '2015-01-19', '2015-02-10', '2015-02-11'], 
        'order_value': ['10000', '25000', '1100', '5000', '4400']}
df_2 = pd.DataFrame(orders, columns = ['customer_id', 'order_date', 'order_value'])
df_2

In [ ]:
pd.concat([df_1,df_2])

In [ ]:
pd.concat([df_1,df_2],axis=0)

In [ ]:
pd.concat([df_1,df_2],axis=1)

One final resource on why you would want to perform these operations in Pandas - and evidence on how fast it really is! http://wesmckinney.com/blog/high-performance-database-joins-with-pandas-dataframe-more-benchmarks/

Grouping, a.k.a. split-apply-combine

While analysing data, a Data Scientist has to very often perform aggregations, perform transformation ops like standardising data, and filter through the dataset to look at only relevant samples.

This is what the groupby function is primarily used for. Read more here.


In [ ]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = (15,7)

In [ ]:
paintball = {'Team': ['Super Ducks','Super Ducks', 'Super Ducks', 'Super Ducks', 'Super Ducks', 'Bobcats', 'Bobcats', 'Bobcats', 'Bobcats', 'Tigers', 'Tigers', 'Tigers', 'Tigers','Tigers','Tigers'], 
        'Name': ['Tony', 'Antonio', 'Felipe', 'Ryan', 'Mario', 'Sergio', 'Tanaka', 'Anderson', 'Joe', 'Floyd', 'Manny', 'Chris', 'Junior', 'George','Brock'],
        'Kills': ['1', '1', '1', '4', '3', '2', '2', '2','5', '1', '1', '7', '4','8','5'],    
        'Shots Fired Before': [17, 19, 22, 8, 13, 85, 64, 49, 74, 14, 20, 24,13,31,37],
        'Shots Fired After': [41, 73, 57, 30, 74, 37, 28, 40, 43, 18, 19, 21,13,32,39]}
df = pd.DataFrame(paintball, columns = ['Team', 'Name', 'Shots Fired Before', 'Shots Fired After','Kills'])
df

In [ ]:
df.groupby('Team').mean()

In [ ]:
byteam = df.groupby('Team')
byteam.count()

In [ ]:
byteam.describe()

In [ ]:
byteam.describe().transpose()['Bobcats']

In [ ]:
Team_Before = df[['Shots Fired Before']].groupby(df['Team']).mean()
Team_After = df[['Shots Fired After']].groupby(df['Team']).mean()

In [ ]:
Team_Before

In [ ]:
Team_After

In [ ]:
Team_Before.join(Team_After)

In [ ]:
plt.style.use('ggplot')
plt.rcParams["figure.figsize"] = (15,7) 
Team_Before.join(Team_After).plot(kind="Bar")

Cool graph, but can we improve it, visually speaking? Yes of course we can! Let's look at some of the styles available within Matplotlib.


In [ ]:
plt.style.available

Personally I am quite partial to ggplot and seaborn, but not so much to fivethirtyeight. Let's try these.


In [ ]:
plt.style.use('ggplot')
plt.rcParams["figure.figsize"] = (15,7) 
Team_Before.join(Team_After).plot(kind="Bar")

What about fivethirtyeight?


In [ ]:
plt.style.use('fivethirtyeight')
plt.rcParams["figure.figsize"] = (15,7) 
Team_Before.join(Team_After).plot(kind="Bar")

And seaborn. Note that seaborn is a visualisation library that works with Matplotlib. You can mimic the style without actually using it.


In [ ]:
plt.style.use('seaborn')
plt.rcParams["figure.figsize"] = (15,7) 
Team_Before.join(Team_After).plot(kind="Bar")

In [ ]:
plt.rcParams.update(plt.rcParamsDefault)
plt.style.use('seaborn-poster')
plt.rcParams["figure.figsize"] = (15,7) 
Team_Before.join(Team_After).plot(kind="Bar")

In [ ]:
pd.crosstab(df["Team"], df["Kills"], margins = True)

In [ ]:
plt.rcParams.update(plt.rcParamsDefault)
%matplotlib inline
plt.rcParams["figure.figsize"] = (15,7)
plt.style.use('seaborn-deep')
df.groupby('Kills').mean().plot(kind="bar")

Apply

We can use the apply function to perform an operation over an axis in a dataframe.


In [ ]:
import pandas as pd
import numpy as np

In [ ]:
df = pd.read_csv("data/cafe_sales2015.csv")
df.head()

In [ ]:
df["Date"] = pd.to_datetime(df["Date"])
df.set_index(["Date"], inplace = True)
df.interpolate(method = "linear", inplace = True)

In [ ]:
df.head()

In [ ]:
#print(df.apply(np.cumsum))
df.apply(np.average)

In [ ]:
df.apply(lambda x: x.max() - x.min())

In [ ]:
# What columns have missing values?
df.apply(lambda x: sum(x.isnull()),axis=0)

In [ ]:
# Using Apply to find missing values
# Obviously don't do this for datasets with thousands or millions of rows!
empty = df.apply(lambda col: pd.isnull(col))
empty

Map

The map function iterates over each element of a series.


In [ ]:
import pandas as pd
import numpy as np

In [ ]:
df = pd.read_csv("data/cafe_sales2015.csv")
df.head()

In [ ]:
df["Latte"] = df["Latte"].map(lambda x: x+2)

In [ ]:
df.head()

In [ ]:
df.interpolate(method = "linear", inplace = True)
df["Water"] = df["Water"].map(lambda x: x-1 if (x>0) else 0)

In [ ]:
df.head()

ApplyMap


In [ ]:
import pandas as pd
import numpy as np

In [ ]:
df = pd.read_csv("data/cafe_sales2015.csv")
df.head()

In [ ]:
def to_int(x):
    if type(x) is float:
        x = int(x)
        return x 
    else:
        return x

In [ ]:
df.interpolate(method = "linear", inplace = True)
df.applymap(to_int).head()

Further Reading
Wes McKinney's amazing book covers this issue. Refer to Page 132.

Pivot Tables

Pivot tables are summarisation tables that help the user sort, count, total or average the data available in a dataset. If you have used Excel, you will be very familiar with them. If not, let's look at it from a fresh Pandas perspective.

Typically, there are four parameters, but you don't always have to specify every one of them, as we will see in the examples below.

  • index: An array of the dataset that will used as indices to our new reshaped and aggregated DataFrame
  • columns: An array of the dataset that will provide columns to the new DataFrame
  • values: These are the values we wish to aggregate in each cell.
  • aggfunc: The function we will use to perform the aggregation

Sales Reports


In [ ]:
import pandas as pd
import numpy as np

In [ ]:
# The 'xlrd' module gets imported automatically, if not, install it with 'pip install xlrd'
df = pd.read_excel("Data/bev-sales.xlsx")
df.head()

In [ ]:
df.tail()

In [ ]:
df.describe()

In [ ]:
help(pd.pivot_table)

In [ ]:
df.head()

In [ ]:
pd.pivot_table(df,index=["Sales Exec"],values=["Revenue"],aggfunc="sum")

In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt

In [ ]:
pd.pivot_table(df, index=["Sales Exec"],values=["Revenue"],aggfunc="sum").plot(kind="bar")

In [ ]:
pd.pivot_table(df,index=["Sales Exec"],values=["Revenue"],aggfunc="mean")

In [ ]:
pd.pivot_table(df, index=["Sales Exec", "Item"], values=["Revenue"], aggfunc="sum")

In [ ]:
pd.pivot_table(df,index=["Sales Exec"],values=["Revenue"],aggfunc=[np.sum])

In [ ]:
pd.pivot_table(df,index=["Sales Exec"],values=["Units sold", "Revenue"],aggfunc=[np.sum])

In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn')
plt.rcParams["figure.figsize"] = (15,7)


pd.pivot_table(df,index=["Sales Exec", "Item"],values=["Revenue"],aggfunc=[np.sum]).plot(kind="bar")
plt.title('January Sales Report')

In [ ]:
pd.pivot_table(df,index=["Sales Exec", "Item"],values=["Units sold", "Revenue"],
               columns=["Price per Unit"], aggfunc="sum", margins = True)

Tips


In [ ]:
df = pd.read_csv("Data/tips.csv")
df.head()

In [ ]:
df["tip_pc"] = df["tip"] / df["total_bill"]

In [ ]:
df.head()

In [ ]:
pd.pivot_table(df,index=["sex"], values = ["tip_pc"], aggfunc="mean")

In [ ]:
pd.pivot_table(df, index = ["smoker", "sex"], values = ["tip_pc"], aggfunc = "mean")

In [ ]:
pd.pivot_table(df,index=["sex"], values = ["total_bill","tip"], aggfunc="sum")

Bada Bing!


In [ ]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

In [ ]:
df = pd.read_excel("Data/Sopranos/sopranos-killings.xlsx")
df.head()

In [ ]:
pd.pivot_table(df,index=["Cause of Death"],values = ["Season"], aggfunc="first")

In [ ]:
pd.pivot_table(df,index=["Cause of Death"],values = ["Season"], aggfunc="count", margins=True)

In [ ]:
whacked = pd.pivot_table(df,index=["Cause of Death"],values = ["Season"], aggfunc="count")
whacked

In [ ]:
plt.style.available

In [ ]:
plt.rcParams.update(plt.rcParamsDefault)
%matplotlib inline
plt.style.use('seaborn-deep')
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (15,7)
whacked.plot(kind = "bar", legend=None)
plt.title('How People Died on The Sopranos')

In [ ]:
with plt.style.context('ggplot', after_reset=True):
    %matplotlib inline
    import matplotlib.pyplot as plt
    plt.rcParams["figure.figsize"] = (15,7)
    whacked.plot(kind = "bar", legend=None)
    plt.title('How People Died on The Sopranos')

In [ ]:
killer = pd.pivot_table(df,index=["Killer"],values = ["Season"], aggfunc="count")

In [ ]:
killer = killer.sort_values(by=["Season"], ascending = False)
killer

In [ ]:
plt.rcParams.update(plt.rcParamsDefault)
plt.style.use('ggplot')
plt.rcParams["figure.figsize"] = (15,7)

killer[:10].plot(kind = "bar", legend=None)
plt.title('Top 10 Killers')

Basic Statistical Operations/Explorations


In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("data/cafe_sales2015.csv")
df["Date"] = pd.to_datetime(df["Date"])
df.set_index(["Date"], inplace = True)
df.interpolate(method = "linear", inplace = True)

In [3]:
df.head()


Out[3]:
Latte Cappuccino Red Bull Water
Date
2015-01-01 36.0 50 27.0 14.0
2015-01-02 53.0 73 0.0 2.0
2015-01-03 33.0 74 9.0 0.0
2015-01-04 45.0 42 30.0 24.0
2015-01-05 96.0 20 33.0 15.0

In [4]:
df.tail()


Out[4]:
Latte Cappuccino Red Bull Water
Date
2015-01-26 73.0 32 20.0 20.0
2015-01-27 30.0 75 16.0 49.0
2015-01-28 55.0 82 31.0 40.0
2015-01-29 37.0 59 50.0 30.0
2015-01-30 37.0 99 47.0 9.0

In [5]:
df.describe()


Out[5]:
Latte Cappuccino Red Bull Water
count 30.000000 30.000000 30.000000 30.000000
mean 47.800000 56.333333 26.333333 25.366667
std 26.990931 26.526933 14.816890 16.206072
min 1.000000 1.000000 0.000000 0.000000
25% 30.250000 43.500000 16.500000 13.250000
50% 43.000000 55.000000 27.000000 20.750000
75% 69.000000 74.750000 37.000000 40.750000
max 99.000000 99.000000 50.000000 49.000000

In [6]:
print("Mean\n", df.mean())
print("\n\nMedian\n", df.median())
print("\n\nMode\n", df.mode())


Mean
 Latte         47.800000
Cappuccino    56.333333
Red Bull      26.333333
Water         25.366667
dtype: float64


Median
 Latte         43.00
Cappuccino    55.00
Red Bull      27.00
Water         20.75
dtype: float64


Mode
    Latte  Cappuccino  Red Bull  Water
0   21.0        50.0      50.0    2.0
1   37.0        52.0       NaN   20.0
2   73.0        99.0       NaN   39.0
3    NaN         NaN       NaN   40.0
4    NaN         NaN       NaN   45.0
5    NaN         NaN       NaN   46.0

In [7]:
print("The Maximum value is:\n",df.max())
print("\n\nThe Minimum value is:\n",df.min())
print("\n\nKurtosis:\n",df.kurtosis())


The Maximum value is:
 Latte         99.0
Cappuccino    99.0
Red Bull      50.0
Water         49.0
dtype: float64


The Minimum value is:
 Latte         1.0
Cappuccino    1.0
Red Bull      0.0
Water         0.0
dtype: float64


Kurtosis:
 Latte        -0.723402
Cappuccino   -0.540181
Red Bull     -0.844065
Water        -1.460348
dtype: float64

In [ ]: