Introduction

One of the most common tasks for pandas and python is to automate the process to aggregate data from multiple spreadsheets and files.

This article will walk through the basic flow required to parse multiple excel files, combine some data, clean it up and analyze it.

Please refer to this post for the full post.

Collecting the Data

Import pandas and numpy


In [1]:
import pandas as pd
import numpy as np

Let's take a look at the files in our input directory, using the convenient shell commands in ipython.


In [2]:
!ls ../data


All-Web-Site-Data-Audience-Overview.xlsx  sales-jan-2014.xlsx
customer-status.xlsx			  sales-mar-2014.xlsx
March-2017-forecast-article.xlsx	  sales_transactions.xlsx
mn-budget-detail-2014.csv		  sample-sales-reps.xlsx
sales-estimate.xlsx			  sample-sales-tax.csv
sales-feb-2014.xlsx			  sample-salesv3.xlsx
salesfunnel.xlsx

There are a lot of files, but we only want to look at the sales .xlsx files.


In [3]:
!ls ../data/sales-*-2014.xlsx


../data/sales-feb-2014.xlsx  ../data/sales-mar-2014.xlsx
../data/sales-jan-2014.xlsx

Use the python glob module to easily list out the files we need


In [4]:
import glob

In [5]:
glob.glob("../data/sales-*-2014.xlsx")


Out[5]:
['../data/sales-feb-2014.xlsx',
 '../data/sales-jan-2014.xlsx',
 '../data/sales-mar-2014.xlsx']

This gives us what we need, let's import each of our files and combine them into one file.

Panda's concat and append can do this for us. I'm going to use append in this example.

The code snippet below will initialize a blank DataFrame then append all of the individual files into the all_data DataFrame.


In [6]:
all_data = pd.DataFrame()
for f in glob.glob("../data/sales-*-2014.xlsx"):
    df = pd.read_excel(f)
    all_data = all_data.append(df,ignore_index=True)

Now we have all the data in our all_data DataFrame. You can use describe to look at it and make sure you data looks good.


In [7]:
all_data.describe()


Out[7]:
account number quantity unit price ext price
count 384.000000 384.000000 384.000000 384.000000
mean 478125.989583 24.372396 56.651406 1394.517344
std 220902.947401 14.373219 27.075883 1117.809743
min 141962.000000 -1.000000 10.210000 -97.160000
25% 257198.000000 12.000000 32.612500 482.745000
50% 424914.000000 23.500000 58.160000 1098.710000
75% 714466.000000 37.000000 80.965000 2132.260000
max 786968.000000 49.000000 99.730000 4590.810000

Alot of this data may not make much sense for this data set but I'm most interested in the count row to make sure the number of data elements makes sense.


In [8]:
all_data.head()


Out[8]:
account number name sku quantity unit price ext price date
0 383080 Will LLC B1-20000 7 33.69 235.83 2014-02-01 09:04:59
1 412290 Jerde-Hilpert S1-27722 11 21.12 232.32 2014-02-01 11:51:46
2 412290 Jerde-Hilpert B1-86481 3 35.99 107.97 2014-02-01 17:24:32
3 412290 Jerde-Hilpert B1-20000 23 78.90 1814.70 2014-02-01 19:56:48
4 672390 Kuhn-Gusikowski S1-06532 48 55.82 2679.36 2014-02-02 03:45:20

It is not critical in this example but the best practice is to convert the date column to a date time object.


In [9]:
all_data['date'] = pd.to_datetime(all_data['date'])

Combining Data

Now that we have all of the data into one DataFrame, we can do any manipulations the DataFrame supports. In this case, the next thing we want to do is read in another file that contains the customer status by account. You can think of this as a company's customer segmentation strategy or some other mechanism for identifying their customers.

First, we read in the data.


In [10]:
status = pd.read_excel("../data/customer-status.xlsx")
status


Out[10]:
account number name status
0 740150 Barton LLC gold
1 714466 Trantow-Barrows silver
2 218895 Kulas Inc bronze
3 307599 Kassulke, Ondricka and Metz bronze
4 412290 Jerde-Hilpert bronze
5 729833 Koepp Ltd silver
6 146832 Kiehn-Spinka silver
7 688981 Keeling LLC silver
8 786968 Frami, Hills and Schmidt silver
9 239344 Stokes LLC gold
10 672390 Kuhn-Gusikowski silver
11 141962 Herman LLC gold
12 424914 White-Trantow silver
13 527099 Sanford and Sons bronze
14 642753 Pollich LLC bronze
15 257198 Cronin, Oberbrunner and Spencer gold

We want to merge this data with our concatenated data set of sales. We use panda's merge function and tell it to do a left join which is similar to Excel's vlookup function.


In [11]:
all_data_st = pd.merge(all_data, status, how='left')
all_data_st.head()


Out[11]:
account number name sku quantity unit price ext price date status
0 383080 Will LLC B1-20000 7 33.69 235.83 2014-02-01 09:04:59 NaN
1 412290 Jerde-Hilpert S1-27722 11 21.12 232.32 2014-02-01 11:51:46 bronze
2 412290 Jerde-Hilpert B1-86481 3 35.99 107.97 2014-02-01 17:24:32 bronze
3 412290 Jerde-Hilpert B1-20000 23 78.90 1814.70 2014-02-01 19:56:48 bronze
4 672390 Kuhn-Gusikowski S1-06532 48 55.82 2679.36 2014-02-02 03:45:20 silver

This looks pretty good but let's look at a specific account.


In [12]:
all_data_st[all_data_st["account number"]==737550].head()


Out[12]:
account number name sku quantity unit price ext price date status
15 737550 Fritsch, Russel and Anderson S1-47412 40 51.01 2040.40 2014-02-05 01:20:40 NaN
25 737550 Fritsch, Russel and Anderson S1-06532 34 18.69 635.46 2014-02-07 09:22:02 NaN
66 737550 Fritsch, Russel and Anderson S1-27722 15 70.23 1053.45 2014-02-16 18:24:42 NaN
78 737550 Fritsch, Russel and Anderson S2-34077 26 93.35 2427.10 2014-02-20 18:45:43 NaN
80 737550 Fritsch, Russel and Anderson S1-93683 31 10.52 326.12 2014-02-21 13:55:45 NaN

This account number was not in our status file, so we have a bunch of NaN's. We can decide how we want to handle this situation. For this specific case, let's label all missing accounts as bronze. Use the fillna function to easily accomplish this on the status column.


In [13]:
all_data_st['status'].fillna('bronze',inplace=True)
all_data_st.head()


Out[13]:
account number name sku quantity unit price ext price date status
0 383080 Will LLC B1-20000 7 33.69 235.83 2014-02-01 09:04:59 bronze
1 412290 Jerde-Hilpert S1-27722 11 21.12 232.32 2014-02-01 11:51:46 bronze
2 412290 Jerde-Hilpert B1-86481 3 35.99 107.97 2014-02-01 17:24:32 bronze
3 412290 Jerde-Hilpert B1-20000 23 78.90 1814.70 2014-02-01 19:56:48 bronze
4 672390 Kuhn-Gusikowski S1-06532 48 55.82 2679.36 2014-02-02 03:45:20 silver

Check the data just to make sure we're all good.


In [14]:
all_data_st[all_data_st["account number"]==737550].head()


Out[14]:
account number name sku quantity unit price ext price date status
15 737550 Fritsch, Russel and Anderson S1-47412 40 51.01 2040.40 2014-02-05 01:20:40 bronze
25 737550 Fritsch, Russel and Anderson S1-06532 34 18.69 635.46 2014-02-07 09:22:02 bronze
66 737550 Fritsch, Russel and Anderson S1-27722 15 70.23 1053.45 2014-02-16 18:24:42 bronze
78 737550 Fritsch, Russel and Anderson S2-34077 26 93.35 2427.10 2014-02-20 18:45:43 bronze
80 737550 Fritsch, Russel and Anderson S1-93683 31 10.52 326.12 2014-02-21 13:55:45 bronze

Now we have all of the data along with the status column filled in. We can do our normal data manipulations using the full suite of pandas capability.

Using Categories

One of the relatively new functions in pandas is support for categorical data. From the pandas, documentation -

"Categoricals are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take on only a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales."

For our purposes, the status field is a good candidate for a category type.

You must make sure you have a recent version of pandas installed for this example to work.


In [15]:
pd.__version__


Out[15]:
'0.20.2'

First, we typecast it to a category using astype.


In [16]:
all_data_st["status"] = all_data_st["status"].astype("category")

This doesn't immediately appear to change anything yet.


In [17]:
all_data_st.head()


Out[17]:
account number name sku quantity unit price ext price date status
0 383080 Will LLC B1-20000 7 33.69 235.83 2014-02-01 09:04:59 bronze
1 412290 Jerde-Hilpert S1-27722 11 21.12 232.32 2014-02-01 11:51:46 bronze
2 412290 Jerde-Hilpert B1-86481 3 35.99 107.97 2014-02-01 17:24:32 bronze
3 412290 Jerde-Hilpert B1-20000 23 78.90 1814.70 2014-02-01 19:56:48 bronze
4 672390 Kuhn-Gusikowski S1-06532 48 55.82 2679.36 2014-02-02 03:45:20 silver

Buy you can see that it is a new data type.


In [18]:
all_data_st.dtypes


Out[18]:
account number             int64
name                      object
sku                       object
quantity                   int64
unit price               float64
ext price                float64
date              datetime64[ns]
status                  category
dtype: object

Categories get more interesting when you assign order to the categories. Right now, if we call sort on the column, it will sort alphabetically.


In [19]:
all_data_st.sort_values(by=["status"]).head()


Out[19]:
account number name sku quantity unit price ext price date status
0 383080 Will LLC B1-20000 7 33.69 235.83 2014-02-01 09:04:59 bronze
196 218895 Kulas Inc S2-83881 41 78.27 3209.07 2014-01-20 09:37:58 bronze
197 383080 Will LLC B1-33364 26 90.19 2344.94 2014-01-20 09:39:59 bronze
198 604255 Halvorson, Crona and Champlin S2-11481 37 96.71 3578.27 2014-01-20 13:07:28 bronze
200 527099 Sanford and Sons B1-05914 18 64.32 1157.76 2014-01-20 21:40:58 bronze

We use set_categories to tell it the order we want to use for this category object. In this case, we use the Olympic medal ordering.


In [20]:
all_data_st["status"].cat.set_categories([ "gold","silver","bronze"],inplace=True)

Now, we can sort it so that gold shows on top.


In [21]:
all_data_st.sort_values(by=["status"]).head()


Out[21]:
account number name sku quantity unit price ext price date status
68 740150 Barton LLC B1-38851 17 81.22 1380.74 2014-02-17 17:12:16 gold
63 257198 Cronin, Oberbrunner and Spencer S1-27722 28 10.21 285.88 2014-02-15 17:27:44 gold
207 740150 Barton LLC B1-86481 20 30.41 608.20 2014-01-22 16:33:51 gold
61 740150 Barton LLC B1-20000 28 81.39 2278.92 2014-02-15 07:45:16 gold
60 239344 Stokes LLC S2-83881 30 43.00 1290.00 2014-02-15 02:13:23 gold

In [22]:
all_data_st["status"].describe()


Out[22]:
count        384
unique         3
top       bronze
freq         172
Name: status, dtype: object

For instance, if you want to take a quick look at how your top tier customers are performaing compared to the bottom. Use groupby to give us the average of the values.


In [23]:
all_data_st.groupby(["status"])["quantity","unit price","ext price"].mean()


Out[23]:
quantity unit price ext price
status
gold 24.375000 53.723889 1351.944583
silver 22.842857 57.272714 1320.032214
bronze 25.616279 57.371163 1472.965930

Of course, you can run multiple aggregation functions on the data to get really useful information


In [24]:
all_data_st.groupby(["status"])["quantity","unit price","ext price"].agg([np.sum,np.mean, np.std])


Out[24]:
quantity unit price ext price
sum mean std sum mean std sum mean std
status
gold 1755 24.375000 14.575145 3868.12 53.723889 28.740080 97340.01 1351.944583 1182.657312
silver 3198 22.842857 14.512843 8018.18 57.272714 26.556242 184804.51 1320.032214 1086.384051
bronze 4406 25.616279 14.136071 9867.84 57.371163 26.857370 253350.14 1472.965930 1116.683843

So, what does this tell you? Well, the data is completely random but my first observation is that we sell more units to our bronze customers than gold. Even when you look at the total dollar value associated with bronze vs. gold, it looks backwards.

Maybe we should look at how many bronze customers we have and see what is going on.

What I plan to do is filter out the unique accounts and see how many gold, silver and bronze customers there are.

I'm purposely stringing a lot of commands together which is not necessarily best practice but does show how powerful pandas can be. Feel free to review my previous articles and play with this command yourself to understand what all these commands mean.


In [25]:
all_data_st.drop_duplicates(subset=["account number","name"]).iloc[:,[0,1,7]].groupby(["status"])["name"].count()


Out[25]:
status
gold      4
silver    7
bronze    9
Name: name, dtype: int64

Ok. This makes a little more sense. We see that we have 9 bronze customers and only 4 customers. That is probably why the volumes are so skewed towards our bronze customers.


In [ ]: