Introduction

One of the most common tasks for pandas and python is to automate the process to aggregate data from multiple spreadsheets and files.

This article will walk through the basic flow required to parse multiple excel files, combine some data, clean it up and analyze it.

Please refer to this post for the full post.

Collecting the Data

Import pandas and numpy



In [1]:

    
import pandas as pd
import numpy as np

Let's take a look at the files in our input directory, using the convenient shell commands in ipython.



In [2]:

    
!ls ../data









    



All-Web-Site-Data-Audience-Overview.xlsx  sales-jan-2014.xlsx
customer-status.xlsx			  sales-mar-2014.xlsx
March-2017-forecast-article.xlsx	  sales_transactions.xlsx
mn-budget-detail-2014.csv		  sample-sales-reps.xlsx
sales-estimate.xlsx			  sample-sales-tax.csv
sales-feb-2014.xlsx			  sample-salesv3.xlsx
salesfunnel.xlsx

There are a lot of files, but we only want to look at the sales .xlsx files.



In [3]:

    
!ls ../data/sales-*-2014.xlsx









    



../data/sales-feb-2014.xlsx  ../data/sales-mar-2014.xlsx
../data/sales-jan-2014.xlsx

Use the python glob module to easily list out the files we need



In [4]:

    
import glob



In [5]:

    
glob.glob("../data/sales-*-2014.xlsx")









    Out[5]:





['../data/sales-feb-2014.xlsx',
 '../data/sales-jan-2014.xlsx',
 '../data/sales-mar-2014.xlsx']

This gives us what we need, let's import each of our files and combine them into one file.

Panda's concat and append can do this for us. I'm going to use append in this example.

The code snippet below will initialize a blank DataFrame then append all of the individual files into the all_data DataFrame.



In [6]:

    
all_data = pd.DataFrame()
for f in glob.glob("../data/sales-*-2014.xlsx"):
    df = pd.read_excel(f)
    all_data = all_data.append(df,ignore_index=True)

Now we have all the data in our all_data DataFrame. You can use describe to look at it and make sure you data looks good.



In [7]:

    
all_data.describe()









    Out[7]:







  
    
      
      account number
      quantity
      unit price
      ext price
    
  
  
    
      count
      384.000000
      384.000000
      384.000000
      384.000000
    
    
      mean
      478125.989583
      24.372396
      56.651406
      1394.517344
    
    
      std
      220902.947401
      14.373219
      27.075883
      1117.809743
    
    
      min
      141962.000000
      -1.000000
      10.210000
      -97.160000
    
    
      25%
      257198.000000
      12.000000
      32.612500
      482.745000
    
    
      50%
      424914.000000
      23.500000
      58.160000
      1098.710000
    
    
      75%
      714466.000000
      37.000000
      80.965000
      2132.260000
    
    
      max
      786968.000000
      49.000000
      99.730000
      4590.810000

Alot of this data may not make much sense for this data set but I'm most interested in the count row to make sure the number of data elements makes sense.



In [8]:

    
all_data.head()









    Out[8]:







  
    
      
      account number
      name
      sku
      quantity
      unit price
      ext price
      date
    
  
  
    
      0
      383080
      Will LLC
      B1-20000
      7
      33.69
      235.83
      2014-02-01 09:04:59
    
    
      1
      412290
      Jerde-Hilpert
      S1-27722
      11
      21.12
      232.32
      2014-02-01 11:51:46
    
    
      2
      412290
      Jerde-Hilpert
      B1-86481
      3
      35.99
      107.97
      2014-02-01 17:24:32
    
    
      3
      412290
      Jerde-Hilpert
      B1-20000
      23
      78.90
      1814.70
      2014-02-01 19:56:48
    
    
      4
      672390
      Kuhn-Gusikowski
      S1-06532
      48
      55.82
      2679.36
      2014-02-02 03:45:20

It is not critical in this example but the best practice is to convert the date column to a date time object.



In [9]:

    
all_data['date'] = pd.to_datetime(all_data['date'])

Combining Data

Now that we have all of the data into one DataFrame, we can do any manipulations the DataFrame supports. In this case, the next thing we want to do is read in another file that contains the customer status by account. You can think of this as a company's customer segmentation strategy or some other mechanism for identifying their customers.

First, we read in the data.



In [10]:

    
status = pd.read_excel("../data/customer-status.xlsx")
status









    Out[10]:







  
    
      
      account number
      name
      status
    
  
  
    
      0
      740150
      Barton LLC
      gold
    
    
      1
      714466
      Trantow-Barrows
      silver
    
    
      2
      218895
      Kulas Inc
      bronze
    
    
      3
      307599
      Kassulke, Ondricka and Metz
      bronze
    
    
      4
      412290
      Jerde-Hilpert
      bronze
    
    
      5
      729833
      Koepp Ltd
      silver
    
    
      6
      146832
      Kiehn-Spinka
      silver
    
    
      7
      688981
      Keeling LLC
      silver
    
    
      8
      786968
      Frami, Hills and Schmidt
      silver
    
    
      9
      239344
      Stokes LLC
      gold
    
    
      10
      672390
      Kuhn-Gusikowski
      silver
    
    
      11
      141962
      Herman LLC
      gold
    
    
      12
      424914
      White-Trantow
      silver
    
    
      13
      527099
      Sanford and Sons
      bronze
    
    
      14
      642753
      Pollich LLC
      bronze
    
    
      15
      257198
      Cronin, Oberbrunner and Spencer
      gold

We want to merge this data with our concatenated data set of sales. We use panda's merge function and tell it to do a left join which is similar to Excel's vlookup function.



In [11]:

    
all_data_st = pd.merge(all_data, status, how='left')
all_data_st.head()









    Out[11]:







  
    
      
      account number
      name
      sku
      quantity
      unit price
      ext price
      date
      status
    
  
  
    
      0
      383080
      Will LLC
      B1-20000
      7
      33.69
      235.83
      2014-02-01 09:04:59
      NaN
    
    
      1
      412290
      Jerde-Hilpert
      S1-27722
      11
      21.12
      232.32
      2014-02-01 11:51:46
      bronze
    
    
      2
      412290
      Jerde-Hilpert
      B1-86481
      3
      35.99
      107.97
      2014-02-01 17:24:32
      bronze
    
    
      3
      412290
      Jerde-Hilpert
      B1-20000
      23
      78.90
      1814.70
      2014-02-01 19:56:48
      bronze
    
    
      4
      672390
      Kuhn-Gusikowski
      S1-06532
      48
      55.82
      2679.36
      2014-02-02 03:45:20
      silver

This looks pretty good but let's look at a specific account.



In [12]:

    
all_data_st[all_data_st["account number"]==737550].head()









    Out[12]:







  
    
      
      account number
      name
      sku
      quantity
      unit price
      ext price
      date
      status
    
  
  
    
      15
      737550
      Fritsch, Russel and Anderson
      S1-47412
      40
      51.01
      2040.40
      2014-02-05 01:20:40
      NaN
    
    
      25
      737550
      Fritsch, Russel and Anderson
      S1-06532
      34
      18.69
      635.46
      2014-02-07 09:22:02
      NaN
    
    
      66
      737550
      Fritsch, Russel and Anderson
      S1-27722
      15
      70.23
      1053.45
      2014-02-16 18:24:42
      NaN
    
    
      78
      737550
      Fritsch, Russel and Anderson
      S2-34077
      26
      93.35
      2427.10
      2014-02-20 18:45:43
      NaN
    
    
      80
      737550
      Fritsch, Russel and Anderson
      S1-93683
      31
      10.52
      326.12
      2014-02-21 13:55:45
      NaN

This account number was not in our status file, so we have a bunch of NaN's. We can decide how we want to handle this situation. For this specific case, let's label all missing accounts as bronze. Use the fillna function to easily accomplish this on the status column.



In [13]:

    
all_data_st['status'].fillna('bronze',inplace=True)
all_data_st.head()









    Out[13]:







  
    
      
      account number
      name
      sku
      quantity
      unit price
      ext price
      date
      status
    
  
  
    
      0
      383080
      Will LLC
      B1-20000
      7
      33.69
      235.83
      2014-02-01 09:04:59
      bronze
    
    
      1
      412290
      Jerde-Hilpert
      S1-27722
      11
      21.12
      232.32
      2014-02-01 11:51:46
      bronze
    
    
      2
      412290
      Jerde-Hilpert
      B1-86481
      3
      35.99
      107.97
      2014-02-01 17:24:32
      bronze
    
    
      3
      412290
      Jerde-Hilpert
      B1-20000
      23
      78.90
      1814.70
      2014-02-01 19:56:48
      bronze
    
    
      4
      672390
      Kuhn-Gusikowski
      S1-06532
      48
      55.82
      2679.36
      2014-02-02 03:45:20
      silver

Check the data just to make sure we're all good.



In [14]:

    
all_data_st[all_data_st["account number"]==737550].head()









    Out[14]:







  
    
      
      account number
      name
      sku
      quantity
      unit price
      ext price
      date
      status
    
  
  
    
      15
      737550
      Fritsch, Russel and Anderson
      S1-47412
      40
      51.01
      2040.40
      2014-02-05 01:20:40
      bronze
    
    
      25
      737550
      Fritsch, Russel and Anderson
      S1-06532
      34
      18.69
      635.46
      2014-02-07 09:22:02
      bronze
    
    
      66
      737550
      Fritsch, Russel and Anderson
      S1-27722
      15
      70.23
      1053.45
      2014-02-16 18:24:42
      bronze
    
    
      78
      737550
      Fritsch, Russel and Anderson
      S2-34077
      26
      93.35
      2427.10
      2014-02-20 18:45:43
      bronze
    
    
      80
      737550
      Fritsch, Russel and Anderson
      S1-93683
      31
      10.52
      326.12
      2014-02-21 13:55:45
      bronze

Now we have all of the data along with the status column filled in. We can do our normal data manipulations using the full suite of pandas capability.

Using Categories

One of the relatively new functions in pandas is support for categorical data. From the pandas, documentation -

"Categoricals are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take on only a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales."

For our purposes, the status field is a good candidate for a category type.

You must make sure you have a recent version of pandas installed for this example to work.



In [15]:

    
pd.__version__









    Out[15]:





'0.20.2'

First, we typecast it to a category using astype.



In [16]:

    
all_data_st["status"] = all_data_st["status"].astype("category")

This doesn't immediately appear to change anything yet.



In [17]:

    
all_data_st.head()









    Out[17]:







  
    
      
      account number
      name
      sku
      quantity
      unit price
      ext price
      date
      status
    
  
  
    
      0
      383080
      Will LLC
      B1-20000
      7
      33.69
      235.83
      2014-02-01 09:04:59
      bronze
    
    
      1
      412290
      Jerde-Hilpert
      S1-27722
      11
      21.12
      232.32
      2014-02-01 11:51:46
      bronze
    
    
      2
      412290
      Jerde-Hilpert
      B1-86481
      3
      35.99
      107.97
      2014-02-01 17:24:32
      bronze
    
    
      3
      412290
      Jerde-Hilpert
      B1-20000
      23
      78.90
      1814.70
      2014-02-01 19:56:48
      bronze
    
    
      4
      672390
      Kuhn-Gusikowski
      S1-06532
      48
      55.82
      2679.36
      2014-02-02 03:45:20
      silver

Buy you can see that it is a new data type.



In [18]:

    
all_data_st.dtypes









    Out[18]:





account number             int64
name                      object
sku                       object
quantity                   int64
unit price               float64
ext price                float64
date              datetime64[ns]
status                  category
dtype: object

Categories get more interesting when you assign order to the categories. Right now, if we call sort on the column, it will sort alphabetically.



In [19]:

    
all_data_st.sort_values(by=["status"]).head()









    Out[19]:







  
    
      
      account number
      name
      sku
      quantity
      unit price
      ext price
      date
      status
    
  
  
    
      0
      383080
      Will LLC
      B1-20000
      7
      33.69
      235.83
      2014-02-01 09:04:59
      bronze
    
    
      196
      218895
      Kulas Inc
      S2-83881
      41
      78.27
      3209.07
      2014-01-20 09:37:58
      bronze
    
    
      197
      383080
      Will LLC
      B1-33364
      26
      90.19
      2344.94
      2014-01-20 09:39:59
      bronze
    
    
      198
      604255
      Halvorson, Crona and Champlin
      S2-11481
      37
      96.71
      3578.27
      2014-01-20 13:07:28
      bronze
    
    
      200
      527099
      Sanford and Sons
      B1-05914
      18
      64.32
      1157.76
      2014-01-20 21:40:58
      bronze

We use set_categories to tell it the order we want to use for this category object. In this case, we use the Olympic medal ordering.



In [20]:

    
all_data_st["status"].cat.set_categories([ "gold","silver","bronze"],inplace=True)

Now, we can sort it so that gold shows on top.



In [21]:

    
all_data_st.sort_values(by=["status"]).head()









    Out[21]:







  
    
      
      account number
      name
      sku
      quantity
      unit price
      ext price
      date
      status
    
  
  
    
      68
      740150
      Barton LLC
      B1-38851
      17
      81.22
      1380.74
      2014-02-17 17:12:16
      gold
    
    
      63
      257198
      Cronin, Oberbrunner and Spencer
      S1-27722
      28
      10.21
      285.88
      2014-02-15 17:27:44
      gold
    
    
      207
      740150
      Barton LLC
      B1-86481
      20
      30.41
      608.20
      2014-01-22 16:33:51
      gold
    
    
      61
      740150
      Barton LLC
      B1-20000
      28
      81.39
      2278.92
      2014-02-15 07:45:16
      gold
    
    
      60
      239344
      Stokes LLC
      S2-83881
      30
      43.00
      1290.00
      2014-02-15 02:13:23
      gold



In [22]:

    
all_data_st["status"].describe()









    Out[22]:





count        384
unique         3
top       bronze
freq         172
Name: status, dtype: object

For instance, if you want to take a quick look at how your top tier customers are performaing compared to the bottom. Use groupby to give us the average of the values.



In [23]:

    
all_data_st.groupby(["status"])["quantity","unit price","ext price"].mean()









    Out[23]:







  
    
      
      quantity
      unit price
      ext price
    
    
      status
      
      
      
    
  
  
    
      gold
      24.375000
      53.723889
      1351.944583
    
    
      silver
      22.842857
      57.272714
      1320.032214
    
    
      bronze
      25.616279
      57.371163
      1472.965930

Of course, you can run multiple aggregation functions on the data to get really useful information



In [24]:

    
all_data_st.groupby(["status"])["quantity","unit price","ext price"].agg([np.sum,np.mean, np.std])









    Out[24]:







  
    
      
      quantity
      unit price
      ext price
    
    
      
      sum
      mean
      std
      sum
      mean
      std
      sum
      mean
      std
    
    
      status
      
      
      
      
      
      
      
      
      
    
  
  
    
      gold
      1755
      24.375000
      14.575145
      3868.12
      53.723889
      28.740080
      97340.01
      1351.944583
      1182.657312
    
    
      silver
      3198
      22.842857
      14.512843
      8018.18
      57.272714
      26.556242
      184804.51
      1320.032214
      1086.384051
    
    
      bronze
      4406
      25.616279
      14.136071
      9867.84
      57.371163
      26.857370
      253350.14
      1472.965930
      1116.683843

So, what does this tell you? Well, the data is completely random but my first observation is that we sell more units to our bronze customers than gold. Even when you look at the total dollar value associated with bronze vs. gold, it looks backwards.

Maybe we should look at how many bronze customers we have and see what is going on.

What I plan to do is filter out the unique accounts and see how many gold, silver and bronze customers there are.

I'm purposely stringing a lot of commands together which is not necessarily best practice but does show how powerful pandas can be. Feel free to review my previous articles and play with this command yourself to understand what all these commands mean.



In [25]:

    
all_data_st.drop_duplicates(subset=["account number","name"]).iloc[:,[0,1,7]].groupby(["status"])["name"].count()









    Out[25]:





status
gold      4
silver    7
bronze    9
Name: name, dtype: int64

Ok. This makes a little more sense. We see that we have 9 bronze customers and only 4 customers. That is probably why the volumes are so skewed towards our bronze customers.



In [ ]:

	account number	quantity	unit price	ext price
count	384.000000	384.000000	384.000000	384.000000
mean	478125.989583	24.372396	56.651406	1394.517344
std	220902.947401	14.373219	27.075883	1117.809743
min	141962.000000	-1.000000	10.210000	-97.160000
25%	257198.000000	12.000000	32.612500	482.745000
50%	424914.000000	23.500000	58.160000	1098.710000
75%	714466.000000	37.000000	80.965000	2132.260000
max	786968.000000	49.000000	99.730000	4590.810000

	account number	name	sku	quantity	unit price	ext price	date
0	383080	Will LLC	B1-20000	7	33.69	235.83	2014-02-01 09:04:59
1	412290	Jerde-Hilpert	S1-27722	11	21.12	232.32	2014-02-01 11:51:46
2	412290	Jerde-Hilpert	B1-86481	3	35.99	107.97	2014-02-01 17:24:32
3	412290	Jerde-Hilpert	B1-20000	23	78.90	1814.70	2014-02-01 19:56:48
4	672390	Kuhn-Gusikowski	S1-06532	48	55.82	2679.36	2014-02-02 03:45:20

	account number	name	status
0	740150	Barton LLC	gold
1	714466	Trantow-Barrows	silver
2	218895	Kulas Inc	bronze
3	307599	Kassulke, Ondricka and Metz	bronze
4	412290	Jerde-Hilpert	bronze
5	729833	Koepp Ltd	silver
6	146832	Kiehn-Spinka	silver
7	688981	Keeling LLC	silver
8	786968	Frami, Hills and Schmidt	silver
9	239344	Stokes LLC	gold
10	672390	Kuhn-Gusikowski	silver
11	141962	Herman LLC	gold
12	424914	White-Trantow	silver
13	527099	Sanford and Sons	bronze
14	642753	Pollich LLC	bronze
15	257198	Cronin, Oberbrunner and Spencer	gold

	account number	name	sku	quantity	unit price	ext price	date	status
15	737550	Fritsch, Russel and Anderson	S1-47412	40	51.01	2040.40	2014-02-05 01:20:40	NaN
25	737550	Fritsch, Russel and Anderson	S1-06532	34	18.69	635.46	2014-02-07 09:22:02	NaN
66	737550	Fritsch, Russel and Anderson	S1-27722	15	70.23	1053.45	2014-02-16 18:24:42	NaN
78	737550	Fritsch, Russel and Anderson	S2-34077	26	93.35	2427.10	2014-02-20 18:45:43	NaN
80	737550	Fritsch, Russel and Anderson	S1-93683	31	10.52	326.12	2014-02-21 13:55:45	NaN

	account number	name	sku	quantity	unit price	ext price	date	status
68	740150	Barton LLC	B1-38851	17	81.22	1380.74	2014-02-17 17:12:16	gold
63	257198	Cronin, Oberbrunner and Spencer	S1-27722	28	10.21	285.88	2014-02-15 17:27:44	gold
207	740150	Barton LLC	B1-86481	20	30.41	608.20	2014-01-22 16:33:51	gold
61	740150	Barton LLC	B1-20000	28	81.39	2278.92	2014-02-15 07:45:16	gold
60	239344	Stokes LLC	S2-83881	30	43.00	1290.00	2014-02-15 02:13:23	gold

	quantity	unit price	ext price
status
gold	24.375000	53.723889	1351.944583
silver	22.842857	57.272714	1320.032214
bronze	25.616279	57.371163	1472.965930