Pandas is Python's library for dealing with structured or tabular data. It's main contributor, Wes McKinney was inspired by R's data.frame, and implemented it for Python.
It combines the speed of NumPy with the ease of SQL, as per Wes, and I completely agree with that.
If you have used R, and the dplyr package, you know how easy it is to manipulate data with it.
We will be learning about various methods to deal with data, and occasionally we will make things a little challenging so as to replicate/mimic real world conditions. And while at it, we will throw in visualisations using Matplotlib too! The best way to learn is to write code yourself, but don't worry if you don't understand all of it in the first go. And of course, feel free to take a step back and revisit Lesson 10.
By the end of it, we should have dealt with about a few case studies, which should be an excellent start for your portfolio.
We will cover at the very least the following topic:
I say "at the very least" because in my opinion, this is the bare minimum you should know to handle data science problems 'in the wild', as in, problems that aren't toy problems, and the kind that data scientists deal with every day.
In [ ]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
There are three fundamental data structures supported by Pandas:
In [ ]:
import pandas as pd
# From Scalar Values
series_1 = pd.Series([1,2,3,4,5])
series_1
Notice the 0,1,2,3... on the left side? That's called the Index. It starts from 0, but you can rename it.
In [ ]:
series_1 = pd.Series([1,2,3,4,5], index = ['Mon','Tue','Wed','Thu','Fri'])
series_1
In [ ]:
series_2 = pd.Series(1.0, index = ['a','b','c','d','e'])
series_2
In [ ]:
import pandas as pd
import numpy as np
# From an array
# Just copy this for now, we'll cover the 'seed' in DataFrames
np.random.seed(42)
series_3 = pd.Series(np.random.randn(5))
series_3
In [ ]:
np.random.seed(42)
series_3 = pd.Series(np.random.randn(5), index = ['a','b','c','d','e'])
series_3
In [ ]:
np.random.seed(42)
ind_1 = ['a','b','c','d','e']
series_3 = pd.Series(np.random.randn(5), index = ind_1)
series_3
In [ ]:
series_4 = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
series_4
We can subset and get values from the series.
In [ ]:
series_4['a'] == series_4[0]
In [ ]:
series_4[series_4>3]
In [ ]:
series_4[series_4%2==0]
In [ ]:
series_5 = pd.Series([1,2,3,4,5], index = ['HP', 'GS', 'IBM', 'AA', 'FB'])
series_5
In [ ]:
series_5['IBM']
In [ ]:
tech_pf1 = series_5[['HP', 'IBM', 'FB']]
tech_pf1
In [ ]:
# From a Dictionary
dict_01 = {'Gavin' : 50, 'Russ' : 100, 'Erlich' : 150}
series_6 = pd.Series(dict_01)
series_6
In [ ]:
# Reordering the previous series
index = ['Gavin', 'Russ', 'Erlich', 'Peter']
series_7 = pd.Series(dict_01, index=index)
series_7
Notice the NaN, which stands for Not a Number. We will be dealing with it extensively when working with DataFrames. It is an indicator for missing or corrupted data. Here's how we test for it.
In [ ]:
pd.isnull(series_7)
And here's a nice discussion on the topic from our friends at StackOverflow.
In [ ]:
# Pandas is very smart, and aligns the series for mathematical operations
series_6 + series_7
In [ ]:
# Renaming an Index
series_7.index.name = "Names"
series_7
In [ ]:
# Naming a Series
series_7.name = "SV"
series_7
In [ ]:
goals = pd.Series([20,19,21,24,1], index = ["Messi", "Neymar", "Zlatan", "Ronaldo", "N’Gog"])
goals
In [ ]:
# Who scored less than 20 goals?
goals[goals<20]
In [ ]:
# What is the average number of goals scored?
goals.mean()
In [ ]:
# What is the median number of goals scored?
goals.median()
In [ ]:
# What is the range of goals scored? (Range = Max - Min)
goals_range = goals.max() - goals.min()
print(goals_range)
In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (15,7)
In [ ]:
# Plot the goals in a bar chart
goals.plot(kind = "bar")
In [ ]:
# Let's beautify that a little
goals.plot(kind = "barh", title = "Goal Scorers")
Read more about these here.
DataFrames is in many respects, the real Pandas. Usually, if you're using Pandas, it will be to use DataFrames.
We will begin with creating DataFrames, and the usual indexing and selection mechanisms. In reality, you will probably never have to 'create' a DataFrame, but practice these skills here to get comfortable with heirarchies, indices and selections. Then we will move on to reading data from multiple formats, including spreadsheets, JSON files and API endpoints.
By the way, during these examples, we will always set seed first when generating random numbers. If you're coming from R, this is the same as set.seed(). In Python, we use the random.seed statement from numpy, which you can read about here. You can set it to any number you like, and I usually set it to 42 just out of habit, but there's not to say you can't set it to an arbitrary number like 27 or 2012. Use the same numbers as this notebook though to replicate the results. Also note that we need to mention it in every cell that we want the results replicated.
You will see later about how this is good practice especially when sharing your work with other members of the team - they will be able to reproduce your work on their machines due to the pseudo-random number that is generated algorithmically.
In [ ]:
import pandas as pd
import numpy as np
In [ ]:
# Let's start with a standard array
arr1 = np.array([[40,40,75,95],[80,85,120,130],
[155,160,165,170],[200,245,250,260]])
print(arr1.shape)
print(arr1.size)
print(arr1)
In [ ]:
# It is quite common to assign a dataframe the name 'df', although you can
# use a relevant name, such baseball_stats or book_sales
# It's always good to use context driven names - you should code expecting
# someone else to read it a few months down the line
df = pd.DataFrame(arr1, index = "Peter,Clarke,Bruce,Tony".split(","),
columns = "Jan,Feb,Mar,Apr".split(","))
df
In [ ]:
df = pd.DataFrame(arr1, index = "Peter,Clarke,Bruce,Tony".split(","),
columns = "Jan,Feb,Mar,Apr".split(","))
df
In [ ]:
# Selecting columns
df[['Jan']]
In [ ]:
df[['Jan','Feb']]
In [ ]:
df[['Mar','Jan']]
It's interesting to note that the offical Pandas documentation refers to DataFrames as:
Can be thought of as a dict-like container for Series objects.
You can access it as a Series as below:
In [ ]:
df['Jan']
In [ ]:
print('Series:', type(df['Jan']))
print('DataFrame:',type(df[['Jan']]))
In [ ]:
df = pd.DataFrame(arr1, index = "Peter,Clarke,Bruce,Tony".split(","),
columns = "Jan,Feb,Mar,Apr".split(","))
df
In [ ]:
# For selecting by Label
df.loc[['Tony']]
In [ ]:
df.loc[['Peter','Bruce']]
In [ ]:
df.loc[['Peter','Bruce'],['Jan','Feb']]
In [ ]:
# All of Peter's data
df.loc[["Peter"]][:]
In [ ]:
df.loc["Peter"][:]
In [ ]:
df
In [ ]:
# Integer-location based indexing for selection by position
# Note how this returns a Dataframe
df.iloc[[0]]
In [ ]:
# and this returns a Series
df.iloc[0]
In [ ]:
# Narrowing down further
df.iloc[[0],[1]]
In [ ]:
# Replicating the results from our use of the loc statement
df.iloc[[0,2]]
In [ ]:
# Compare to df.loc[['Peter','Bruce'],['A','D']]
df.iloc[[0,2],[0,3]]
There's another function named ix. I have rarely used it, and both loc and iloc take care of all my selection needs. You can read about it here.
Also, check out the similarity of outputs below:
In [ ]:
df.ix[0:3]
In [ ]:
df.iloc[0:3]
In [ ]:
df
In [ ]:
df[df%2 == 0]
In [ ]:
df%2 == 0
In [ ]:
df < 100
In [ ]:
df[df<100]
In [ ]:
df
In [ ]:
df[df['Jan']>100][['Apr']]
In [ ]:
df[df['Jan']<100][['Feb','Apr']]
In [ ]:
# Using multiple conditions
df[(df['Jan'] >= 80) & (df['Mar']>100)]
Did you notice that we used & instead of and? When using Pandas, we have to use the symbol, not the word. Here's a StackOverflow discussion on this.
In [ ]:
df = pd.DataFrame(arr1, index = "Peter,Clarke,Bruce,Tony".split(","), columns = "Jan,Feb,Mar,Apr".split(","))
df
In [ ]:
df["Dec"] = df["Jan"] + df["Mar"]
df
In [ ]:
df
In [ ]:
df.drop('Dec', axis = 1)
First, we had to mention the axis. 0 is for rows, 1 is for columns.
In [ ]:
df
Why is 'Dec' still there? Here lies the difference - while removing columns, we have to specify that the operation should be inplace. Read about it in the official documentation.
In [ ]:
df.drop('Dec', axis = 1, inplace = True)
df
And just for the sake of completion, let's temporarily kick out Tony from the table. Temporary, since it's not inplace.
In [ ]:
df.drop('Tony', axis = 0)
In [ ]:
# Renaming Columns
df.rename(columns={'Jan': 'January'}, inplace=True)
df
In [ ]:
df.rename(columns={'Feb': 'February', 'Mar': 'March', 'Apr': 'April'}, inplace=True)
df
In [ ]:
dict1 = {'first_name': ['Erlich', 'Richard', "Dinesh", 'Gilfoyle', 'Nelson'],
'second_name': ['Bachman', 'Hendricks', np.nan, np.nan, 'Bighetti'],
'occupation': ['Investor', 'Entrepreneur', 'Coder', 'Coder', 'Bench Warmer'],
'age': [40, 30, 28, 29, 28]}
df = pd.DataFrame(dict1, columns = ['first_name', 'second_name','occupation', 'age'])
df
In [ ]:
# Who is under 30 years of age?
df[df["age"]<30]
In [ ]:
# Who are the coders?
df[df["occupation"] == "Coder"]
In [ ]:
# Multiple Conditions : Coders, below 30
# Not that conditions are Booleans, as shown below
coders = df["occupation"] == "Coder"
und_30 = df["age"]<30
df[coders & und_30]
In [ ]:
df[df["second_name"].notnull()]
In [ ]:
np.random.seed(42)
np.random.randn(4,4)
In [ ]:
np.random.seed(42)
df = pd.DataFrame(np.random.randn(4,4), index = "Peter,Clarke,Bruce,Tony".split(","), columns = "Jan,Feb,Mar,Apr".split(","))
df
In [ ]:
# Who scored greater than 0 in Apr?
df[df>0][["Apr"]]
In [ ]:
# Who scored below 0 in March?
In [ ]:
# In which month/months did Clarke score above 0?
In [ ]:
# Find the highest scores for each month
# Hint: .max()
In [ ]:
# Find the lowest scores for each month
In [ ]:
# Plot the higest score for each month in a bar graph
Pay special attention to this section. If needed, spend some extra time to cover all the relevant techniques.
Never in my experience have I come across a 100% clean data set "in the wild". What that means is that of course you will find that most data sets that you train with to be complete, but real world data is messy and incomplete.
Even when working with high quality, financial data from exchanges, they might often have missing data points. The less said about unstructured data like text, the better.
TL/DR: If you're going to fight Mike Tyson, don't train to fight Mr Bean.
In [ ]:
df = pd.DataFrame({'NYC':[3,np.nan,7,9,6],
'SF':[4,3,8,7,15],
'CHI':[4,np.nan,np.nan,14,6],
'MIA':[3, 9,12,8,9]}, index = ['Mon','Tue','Wed','Thu','Fri'])
df
First thing we can do is drop rows with missing values with the dropna() function. By default, rows are dropped, but you can change this to columns as well.
In [ ]:
df.dropna()
In [ ]:
df.dropna(axis = 0)
In [ ]:
df.dropna(axis = 1)
While this can be helpful in some ways, if your dataset is small, you are losing a significant portion of your data. For example, if 100 rows out of 1 million rows have missing data, that's negligible, and can potentially be thrown away. What if you have 10 out of 85 rows with incorrect, unusable or missing data?
In [ ]:
df2 = df.copy()
In [ ]:
df2
In [ ]:
df2.mean()
In [ ]:
# Are these really the means though?
df
In [ ]:
mean = df2['SF'].mean()
mean
In [ ]:
df = pd.DataFrame({'NYC':[3,np.nan,7,9,6],
'SF':[4,3,8,7,15],
'CHI':[4,np.nan,np.nan,14,6],
'MIA':[3, 9,12,8,9]}, index = ['Mon','Tue','Wed','Thu','Fri'])
df
In [ ]:
df.mean()
In [ ]:
df.fillna(value = df.mean(), inplace = True)
df
In [ ]:
df = pd.DataFrame({'NYC':[3,np.nan,7,9,6],
'SF':[4,3,8,7,15],
'CHI':[4,np.nan,np.nan,14,6],
'MIA':[3, 9,12,8,9]}, index = ['Mon','Tue','Wed','Thu','Fri'])
df
In [ ]:
df3 = df.copy()
df3
In [ ]:
median = df3['SF'].median()
median
In [ ]:
df3.fillna(value = median, inplace = True)
df3
In [ ]:
df3.mode()
But sometimes, the data isn't part of the table. Consider the scenario below. We know that the below tables contains names of female babies. But it's missing in our dataset.
In [ ]:
baby_names = {
'id': ['101', '102', '103', '104', '105'],
'first_name': ['Emma', 'Madison', 'Hannah', 'Grace', 'Emily']
}
df_baby = pd.DataFrame(baby_names, columns = ['id', 'first_name'])
df_baby
In [ ]:
df_baby.columns
In [ ]:
df_baby["gender"] = "F"
In [ ]:
df_baby
In [ ]:
df_baby['gender'] = 0
In [ ]:
df_baby
In [ ]:
df = pd.read_csv("data/cafe_sales2015.csv")
df
In [ ]:
df["Date"].head()
In [ ]:
df["Date"] = pd.to_datetime(df["Date"])
In [ ]:
df.set_index(["Date"], inplace = True)
In [ ]:
df.head()
In [ ]:
df.tail()
In [ ]:
df.head(3)
In [ ]:
df.describe()
In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (20,5)
In [ ]:
df.plot(kind="line")
In [ ]:
df["Water"].plot(kind="line")
In [ ]:
df.interpolate(method = "linear", inplace = True)
In [ ]:
df.head(5)
In [ ]:
df.interpolate().count()
In [ ]:
df[["Latte", "Water"]].plot(kind="line")
Keep in mind though, that these are at best approximations.
In [ ]:
import vincent
vincent.core.initialize_notebook()
In [ ]:
line = vincent.Line(df)
line.axis_titles(x='Date', y='Amount')
In [ ]:
line = vincent.Line(df[["Latte", "Water"]])
line.axis_titles(x='Date', y='Amount')
In [ ]:
stacked = vincent.StackedArea(df)
stacked.axis_titles(x='Date', y='Amount')
stacked.legend(title='Cafe Sales')
stacked.colors(brew='Spectral')
Read about using the Vincent package here.
The latest update to Matplotlib, V 2.0.0 has really improved the quality of the graphics, but it's still not quite production ready, while on the positive side, it is stable and has a large community of people who use it. Niche packages like Vincent can produce some amazing graphics right out of the box with minimal tweaking, but they may not be very mature. Nevertheless, as Data Scientists, it's good to learn about new packages, especially those that help you communicate your results to a non-technical audience. If people don't understand what you do, they won't think what you do is important!
In [ ]:
customers = {
'customer_id': ['101', '102', '103', '104', '105'],
'first_name': ['Tony', 'Silvio', 'Paulie', 'Corrado', 'Christopher'],
'last_name': ['Soprano', 'Dante', 'Gualtieri', 'Soprano', 'Moltisanti']}
df_1 = pd.DataFrame(customers, columns = ['customer_id', 'first_name', 'last_name'])
df_1
In [ ]:
orders = {
'customer_id': ['101', '104', '105', '108', '111'],
'order_date': ['2015-01-01', '2015-01-08', '2015-01-19', '2015-02-10', '2015-02-11'],
'order_value': ['10000', '25000', '1100', '5000', '4400']}
df_2 = pd.DataFrame(orders, columns = ['customer_id', 'order_date', 'order_value'])
df_2
In [ ]:
pd.merge(df_1, df_2, how = 'inner', on = 'customer_id')
In [ ]:
pd.merge(df_1, df_2, how = 'left', on = 'customer_id')
In [ ]:
pd.merge(df_1, df_2, how = 'right', on = 'customer_id')
In [ ]:
pd.merge(df_1, df_2, how = 'outer', on = 'customer_id')
In [ ]:
customers = {
'customer_id': ['101', '102', '103', '104', '105'],
'first_name': ['Tony', 'Silvio', 'Paulie', 'Corrado', 'Christopher'],
'last_name': ['Soprano', 'Dante', 'Gualtieri', 'Soprano', 'Moltisanti']}
customers
In [ ]:
orders = {
'customer_id': ['101', '104', '105', '108', '111'],
'order_date': ['2015-01-01', '2015-01-08', '2015-01-19', '2015-02-10', '2015-02-11'],
'order_value': ['10000', '25000', '1100', '5000', '4400']}
orders
In [ ]:
df1_new = pd.DataFrame.from_dict(customers, orient='columns', dtype=None)
In [ ]:
df1_new
In [ ]:
df1_new = df1_new.set_index('customer_id')
df1_new
In [ ]:
df2_new = pd.DataFrame.from_dict(orders, orient='columns', dtype=None)
df2_new
In [ ]:
df2_new = df2_new.set_index('customer_id')
df2_new
In [ ]:
df1_new.join(df2_new,how = "inner")
In [ ]:
df1_new.join(df2_new,how = "outer")
In [ ]:
df1_new.join(df2_new,how = "left")
In [ ]:
df1_new.join(df2_new,how = "right")
In [ ]:
# Alternate Way : I don't recommend this
df_1.join(df_2, on = "customer_id", lsuffix='_l', rsuffix='_r')
In [ ]:
customers = {
'customer_id': ['101', '102', '103', '104', '105'],
'first_name': ['Tony', 'Silvio', 'Paulie', 'Corrado', 'Christopher'],
'last_name': ['Soprano', 'Dante', 'Gualtieri', 'Soprano', 'Moltisanti']}
df_1 = pd.DataFrame(customers, columns = ['customer_id', 'first_name', 'last_name'])
df_1
In [ ]:
orders = {
'customer_id': ['101', '104', '105', '108', '111'],
'order_date': ['2015-01-01', '2015-01-08', '2015-01-19', '2015-02-10', '2015-02-11'],
'order_value': ['10000', '25000', '1100', '5000', '4400']}
df_2 = pd.DataFrame(orders, columns = ['customer_id', 'order_date', 'order_value'])
df_2
In [ ]:
pd.concat([df_1,df_2])
In [ ]:
pd.concat([df_1,df_2],axis=0)
In [ ]:
pd.concat([df_1,df_2],axis=1)
One final resource on why you would want to perform these operations in Pandas - and evidence on how fast it really is! http://wesmckinney.com/blog/high-performance-database-joins-with-pandas-dataframe-more-benchmarks/
While analysing data, a Data Scientist has to very often perform aggregations, perform transformation ops like standardising data, and filter through the dataset to look at only relevant samples.
This is what the groupby function is primarily used for.
Read more here.
In [ ]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = (15,7)
In [ ]:
paintball = {'Team': ['Super Ducks','Super Ducks', 'Super Ducks', 'Super Ducks', 'Super Ducks', 'Bobcats', 'Bobcats', 'Bobcats', 'Bobcats', 'Tigers', 'Tigers', 'Tigers', 'Tigers','Tigers','Tigers'],
'Name': ['Tony', 'Antonio', 'Felipe', 'Ryan', 'Mario', 'Sergio', 'Tanaka', 'Anderson', 'Joe', 'Floyd', 'Manny', 'Chris', 'Junior', 'George','Brock'],
'Kills': ['1', '1', '1', '4', '3', '2', '2', '2','5', '1', '1', '7', '4','8','5'],
'Shots Fired Before': [17, 19, 22, 8, 13, 85, 64, 49, 74, 14, 20, 24,13,31,37],
'Shots Fired After': [41, 73, 57, 30, 74, 37, 28, 40, 43, 18, 19, 21,13,32,39]}
df = pd.DataFrame(paintball, columns = ['Team', 'Name', 'Shots Fired Before', 'Shots Fired After','Kills'])
df
In [ ]:
df.groupby('Team').mean()
In [ ]:
byteam = df.groupby('Team')
byteam.count()
In [ ]:
byteam.describe()
In [ ]:
byteam.describe().transpose()['Bobcats']
In [ ]:
Team_Before = df[['Shots Fired Before']].groupby(df['Team']).mean()
Team_After = df[['Shots Fired After']].groupby(df['Team']).mean()
In [ ]:
Team_Before
In [ ]:
Team_After
In [ ]:
Team_Before.join(Team_After)
In [ ]:
plt.style.use('ggplot')
plt.rcParams["figure.figsize"] = (15,7)
Team_Before.join(Team_After).plot(kind="Bar")
Cool graph, but can we improve it, visually speaking? Yes of course we can! Let's look at some of the styles available within Matplotlib.
In [ ]:
plt.style.available
Personally I am quite partial to ggplot and seaborn, but not so much to fivethirtyeight. Let's try these.
In [ ]:
plt.style.use('ggplot')
plt.rcParams["figure.figsize"] = (15,7)
Team_Before.join(Team_After).plot(kind="Bar")
What about fivethirtyeight?
In [ ]:
plt.style.use('fivethirtyeight')
plt.rcParams["figure.figsize"] = (15,7)
Team_Before.join(Team_After).plot(kind="Bar")
And seaborn. Note that seaborn is a visualisation library that works with Matplotlib. You can mimic the style without actually using it.
In [ ]:
plt.style.use('seaborn')
plt.rcParams["figure.figsize"] = (15,7)
Team_Before.join(Team_After).plot(kind="Bar")
In [ ]:
plt.rcParams.update(plt.rcParamsDefault)
plt.style.use('seaborn-poster')
plt.rcParams["figure.figsize"] = (15,7)
Team_Before.join(Team_After).plot(kind="Bar")
In [ ]:
pd.crosstab(df["Team"], df["Kills"], margins = True)
In [ ]:
plt.rcParams.update(plt.rcParamsDefault)
%matplotlib inline
plt.rcParams["figure.figsize"] = (15,7)
plt.style.use('seaborn-deep')
df.groupby('Kills').mean().plot(kind="bar")
In [ ]:
import pandas as pd
import numpy as np
In [ ]:
df = pd.read_csv("data/cafe_sales2015.csv")
df.head()
In [ ]:
df["Date"] = pd.to_datetime(df["Date"])
df.set_index(["Date"], inplace = True)
df.interpolate(method = "linear", inplace = True)
In [ ]:
df.head()
In [ ]:
#print(df.apply(np.cumsum))
df.apply(np.average)
In [ ]:
df.apply(lambda x: x.max() - x.min())
In [ ]:
# What columns have missing values?
df.apply(lambda x: sum(x.isnull()),axis=0)
In [ ]:
# Using Apply to find missing values
# Obviously don't do this for datasets with thousands or millions of rows!
empty = df.apply(lambda col: pd.isnull(col))
empty
In [ ]:
import pandas as pd
import numpy as np
In [ ]:
df = pd.read_csv("data/cafe_sales2015.csv")
df.head()
In [ ]:
df["Latte"] = df["Latte"].map(lambda x: x+2)
In [ ]:
df.head()
In [ ]:
df.interpolate(method = "linear", inplace = True)
df["Water"] = df["Water"].map(lambda x: x-1 if (x>0) else 0)
In [ ]:
df.head()
In [ ]:
import pandas as pd
import numpy as np
In [ ]:
df = pd.read_csv("data/cafe_sales2015.csv")
df.head()
In [ ]:
def to_int(x):
if type(x) is float:
x = int(x)
return x
else:
return x
In [ ]:
df.interpolate(method = "linear", inplace = True)
df.applymap(to_int).head()
Further Reading
Wes McKinney's amazing book covers this issue. Refer to Page 132.
Pivot tables are summarisation tables that help the user sort, count, total or average the data available in a dataset. If you have used Excel, you will be very familiar with them. If not, let's look at it from a fresh Pandas perspective.
Typically, there are four parameters, but you don't always have to specify every one of them, as we will see in the examples below.
index: An array of the dataset that will used as indices to our new reshaped and aggregated DataFramecolumns: An array of the dataset that will provide columns to the new DataFramevalues: These are the values we wish to aggregate in each cell.aggfunc: The function we will use to perform the aggregation
In [ ]:
import pandas as pd
import numpy as np
In [ ]:
# The 'xlrd' module gets imported automatically, if not, install it with 'pip install xlrd'
df = pd.read_excel("Data/bev-sales.xlsx")
df.head()
In [ ]:
df.tail()
In [ ]:
df.describe()
In [ ]:
help(pd.pivot_table)
In [ ]:
df.head()
In [ ]:
pd.pivot_table(df,index=["Sales Exec"],values=["Revenue"],aggfunc="sum")
In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
In [ ]:
pd.pivot_table(df, index=["Sales Exec"],values=["Revenue"],aggfunc="sum").plot(kind="bar")
In [ ]:
pd.pivot_table(df,index=["Sales Exec"],values=["Revenue"],aggfunc="mean")
In [ ]:
pd.pivot_table(df, index=["Sales Exec", "Item"], values=["Revenue"], aggfunc="sum")
In [ ]:
pd.pivot_table(df,index=["Sales Exec"],values=["Revenue"],aggfunc=[np.sum])
In [ ]:
pd.pivot_table(df,index=["Sales Exec"],values=["Units sold", "Revenue"],aggfunc=[np.sum])
In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn')
plt.rcParams["figure.figsize"] = (15,7)
pd.pivot_table(df,index=["Sales Exec", "Item"],values=["Revenue"],aggfunc=[np.sum]).plot(kind="bar")
plt.title('January Sales Report')
In [ ]:
pd.pivot_table(df,index=["Sales Exec", "Item"],values=["Units sold", "Revenue"],
columns=["Price per Unit"], aggfunc="sum", margins = True)
In [ ]:
df = pd.read_csv("Data/tips.csv")
df.head()
In [ ]:
df["tip_pc"] = df["tip"] / df["total_bill"]
In [ ]:
df.head()
In [ ]:
pd.pivot_table(df,index=["sex"], values = ["tip_pc"], aggfunc="mean")
In [ ]:
pd.pivot_table(df, index = ["smoker", "sex"], values = ["tip_pc"], aggfunc = "mean")
In [ ]:
pd.pivot_table(df,index=["sex"], values = ["total_bill","tip"], aggfunc="sum")
In [ ]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
In [ ]:
df = pd.read_excel("Data/Sopranos/sopranos-killings.xlsx")
df.head()
In [ ]:
pd.pivot_table(df,index=["Cause of Death"],values = ["Season"], aggfunc="first")
In [ ]:
pd.pivot_table(df,index=["Cause of Death"],values = ["Season"], aggfunc="count", margins=True)
In [ ]:
whacked = pd.pivot_table(df,index=["Cause of Death"],values = ["Season"], aggfunc="count")
whacked
In [ ]:
plt.style.available
In [ ]:
plt.rcParams.update(plt.rcParamsDefault)
%matplotlib inline
plt.style.use('seaborn-deep')
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (15,7)
whacked.plot(kind = "bar", legend=None)
plt.title('How People Died on The Sopranos')
In [ ]:
with plt.style.context('ggplot', after_reset=True):
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (15,7)
whacked.plot(kind = "bar", legend=None)
plt.title('How People Died on The Sopranos')
In [ ]:
killer = pd.pivot_table(df,index=["Killer"],values = ["Season"], aggfunc="count")
In [ ]:
killer = killer.sort_values(by=["Season"], ascending = False)
killer
In [ ]:
plt.rcParams.update(plt.rcParamsDefault)
plt.style.use('ggplot')
plt.rcParams["figure.figsize"] = (15,7)
killer[:10].plot(kind = "bar", legend=None)
plt.title('Top 10 Killers')
In [1]:
import pandas as pd
import numpy as np
In [2]:
df = pd.read_csv("data/cafe_sales2015.csv")
df["Date"] = pd.to_datetime(df["Date"])
df.set_index(["Date"], inplace = True)
df.interpolate(method = "linear", inplace = True)
In [3]:
df.head()
Out[3]:
In [4]:
df.tail()
Out[4]:
In [5]:
df.describe()
Out[5]:
In [6]:
print("Mean\n", df.mean())
print("\n\nMedian\n", df.median())
print("\n\nMode\n", df.mode())
In [7]:
print("The Maximum value is:\n",df.max())
print("\n\nThe Minimum value is:\n",df.min())
print("\n\nKurtosis:\n",df.kurtosis())
In [ ]: