Reading data from Excel

Let's get some data. Download Sample Superstore Sales .xls file or my local copy and open it in Excel to see what it looks like.

Data of interest that we want to process in Python often comes in the form of an Excel spreadsheet, but the data is in a special format that we can't read directly:



In [1]:

    
with open('data/SampleSuperstoreSales.xls', "rb") as f:
    txt = f.read()
    print(txt[0:100])









    



b'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00>\x00\x03\x00\xfe\xff\t\x00\x06\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00/\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x10\x00\x00\x13\x17\x00\x00\x01\x00\x00\x00\xfe\xff\xff\xff\x00\x00\x00\x00\x00\x00\x00\x00b\x00\x00\x00\xe3\x00\x00\x00d\x01\x00\x00\xe5\x01\x00\x00f\x02\x00\x00'

Converting Excel files with csvkit

There's a really useful tool kit called csvkit, which you can install with:

pip install csvkit

Unfortunately, at the moment, there is some kind of a weird bug at the moment, unrelated to csvkit, so we get lots of warnings even though it works.

/Users/parrt/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: ImportWarning: can't resolve package from __spec__ or __package__, falling back on __name__ and __path__

So, the following command works without having to run or even own Excel on your laptop, but you get lots of warnings:

$ in2csv data/SampleSuperstoreSales.xls > /tmp/t.csv

Reading Excel files with Pandas

The easiest way to read Excel files with Python is to use Pandas:



In [9]:

    
import pandas
table = pandas.read_excel("data/SampleSuperstoreSales.xls")
table.head()









    Out[9]:







  
    
      
      Row ID
      Order ID
      Order Date
      Order Priority
      Order Quantity
      Sales
      Discount
      Ship Mode
      Profit
      Unit Price
      ...
      Customer Name
      Province
      Region
      Customer Segment
      Product Category
      Product Sub-Category
      Product Name
      Product Container
      Product Base Margin
      Ship Date
    
  
  
    
      0
      1
      3
      2010-10-13
      Low
      6
      261.5400
      0.04
      Regular Air
      -213.2500
      38.94
      ...
      Muhammed MacIntyre
      Nunavut
      Nunavut
      Small Business
      Office Supplies
      Storage & Organization
      Eldon Base for stackable storage shelf, platinum
      Large Box
      0.80
      2010-10-20
    
    
      1
      49
      293
      2012-10-01
      High
      49
      10123.0200
      0.07
      Delivery Truck
      457.8100
      208.16
      ...
      Barry French
      Nunavut
      Nunavut
      Consumer
      Office Supplies
      Appliances
      1.7 Cubic Foot Compact "Cube" Office Refrigera...
      Jumbo Drum
      0.58
      2012-10-02
    
    
      2
      50
      293
      2012-10-01
      High
      27
      244.5700
      0.01
      Regular Air
      46.7075
      8.69
      ...
      Barry French
      Nunavut
      Nunavut
      Consumer
      Office Supplies
      Binders and Binder Accessories
      Cardinal Slant-D® Ring Binder, Heavy Gauge Vinyl
      Small Box
      0.39
      2012-10-03
    
    
      3
      80
      483
      2011-07-10
      High
      30
      4965.7595
      0.08
      Regular Air
      1198.9710
      195.99
      ...
      Clay Rozendal
      Nunavut
      Nunavut
      Corporate
      Technology
      Telephones and Communication
      R380
      Small Box
      0.58
      2011-07-12
    
    
      4
      85
      515
      2010-08-28
      Not Specified
      19
      394.2700
      0.08
      Regular Air
      30.9400
      21.78
      ...
      Carlos Soltero
      Nunavut
      Nunavut
      Consumer
      Office Supplies
      Appliances
      Holmes HEPA Air Purifier
      Medium Box
      0.50
      2010-08-30
    
  

5 rows × 21 columns

CSV data

Grab the CSV version of the Excel file SampleSuperstoreSales.csv we've been playing with.

Dealing with commas double quotes in CSV

For the most part, CSV files are very simple, but they can get complicated when we need to embed a comma. One such case from the above file shows how fields with commas get quoted:

"Eldon Base for stackable storage shelf, platinum"

What happens when we want to encode a quote? Well, somehow people decided that "" double quotes was the answer (not!) and we get fields encoded like this:

"1.7 Cubic Foot Compact ""Cube"" Office Refrigerators"

The good news is that Python's csv package knows how to read Excel-generated files that use such encoding. Here's a sample script that reads such a file into a list of lists:



In [10]:

    
import sys
import csv

table_file = "data/SampleSuperstoreSales.csv"
with open(table_file, "r") as csvfile:
    f = csv.reader(csvfile, dialect='excel')
    data = []
    for row in f:
        data.append(row)

print(data[:6])









    



[['Row ID', 'Order ID', 'Order Date', 'Order Priority', 'Order Quantity', 'Sales', 'Discount', 'Ship Mode', 'Profit', 'Unit Price', 'Shipping Cost', 'Customer Name', 'Province', 'Region', 'Customer Segment', 'Product Category', 'Product Sub-Category', 'Product Name', 'Product Container', 'Product Base Margin', 'Ship Date'], ['1.0', '3.0', '2010-10-13', 'Low', '6.0', '261.54', '0.04', 'Regular Air', '-213.25', '38.94', '35.0', 'Muhammed MacIntyre', 'Nunavut', 'Nunavut', 'Small Business', 'Office Supplies', 'Storage & Organization', 'Eldon Base for stackable storage shelf, platinum', 'Large Box', '0.8', '2010-10-20'], ['49.0', '293.0', '2012-10-01', 'High', '49.0', '10123.02', '0.07', 'Delivery Truck', '457.81', '208.16', '68.02', 'Barry French', 'Nunavut', 'Nunavut', 'Consumer', 'Office Supplies', 'Appliances', '1.7 Cubic Foot Compact "Cube" Office Refrigerators', 'Jumbo Drum', '0.58', '2012-10-02'], ['50.0', '293.0', '2012-10-01', 'High', '27.0', '244.57', '0.01', 'Regular Air', '46.7075', '8.69', '2.99', 'Barry French', 'Nunavut', 'Nunavut', 'Consumer', 'Office Supplies', 'Binders and Binder Accessories', 'Cardinal Slant-D® Ring Binder, Heavy Gauge Vinyl', 'Small Box', '0.39', '2012-10-03'], ['80.0', '483.0', '2011-07-10', 'High', '30.0', '4965.759499999999', '0.08', 'Regular Air', '1198.971', '195.99', '3.99', 'Clay Rozendal', 'Nunavut', 'Nunavut', 'Corporate', 'Technology', 'Telephones and Communication', 'R380', 'Small Box', '0.58', '2011-07-12'], ['85.0', '515.0', '2010-08-28', 'Not Specified', '19.0', '394.27', '0.08', 'Regular Air', '30.94', '21.78', '5.94', 'Carlos Soltero', 'Nunavut', 'Nunavut', 'Consumer', 'Office Supplies', 'Appliances', 'Holmes HEPA Air Purifier', 'Medium Box', '0.5', '2010-08-30']]

Or add to a numpy array:



In [11]:

    
import numpy as np
np.array(data)









    Out[11]:





array([['Row ID', 'Order ID', 'Order Date', ..., 'Product Container',
        'Product Base Margin', 'Ship Date'],
       ['1.0', '3.0', '2010-10-13', ..., 'Large Box', '0.8',
        '2010-10-20'],
       ['49.0', '293.0', '2012-10-01', ..., 'Jumbo Drum', '0.58',
        '2012-10-02'],
       ...,
       ['7906.0', '56550.0', '2011-04-08', ..., 'Small Pack', '0.41',
        '2011-04-10'],
       ['7907.0', '56550.0', '2011-04-08', ..., 'Small Box', '0.56',
        '2011-04-09'],
       ['7914.0', '56581.0', '2009-02-08', ..., 'Medium Box', '0.65',
        '2009-02-11']], dtype='<U98')

Reading CSV into Pandas Data frames

In the end, the easiest way to deal with loading CSV files is probably with Pandas. For example, to load our sales CSV, we don't even have to manually open and close a file:



In [5]:

    
import pandas
df = pandas.read_csv("data/SampleSuperstoreSales.csv")
df.head()









    Out[5]:







  
    
      
      Row ID
      Order ID
      Order Date
      Order Priority
      Order Quantity
      Sales
      Discount
      Ship Mode
      Profit
      Unit Price
      ...
      Customer Name
      Province
      Region
      Customer Segment
      Product Category
      Product Sub-Category
      Product Name
      Product Container
      Product Base Margin
      Ship Date
    
  
  
    
      0
      1.0
      3.0
      2010-10-13
      Low
      6.0
      261.5400
      0.04
      Regular Air
      -213.2500
      38.94
      ...
      Muhammed MacIntyre
      Nunavut
      Nunavut
      Small Business
      Office Supplies
      Storage & Organization
      Eldon Base for stackable storage shelf, platinum
      Large Box
      0.80
      2010-10-20
    
    
      1
      49.0
      293.0
      2012-10-01
      High
      49.0
      10123.0200
      0.07
      Delivery Truck
      457.8100
      208.16
      ...
      Barry French
      Nunavut
      Nunavut
      Consumer
      Office Supplies
      Appliances
      1.7 Cubic Foot Compact "Cube" Office Refrigera...
      Jumbo Drum
      0.58
      2012-10-02
    
    
      2
      50.0
      293.0
      2012-10-01
      High
      27.0
      244.5700
      0.01
      Regular Air
      46.7075
      8.69
      ...
      Barry French
      Nunavut
      Nunavut
      Consumer
      Office Supplies
      Binders and Binder Accessories
      Cardinal Slant-D® Ring Binder, Heavy Gauge Vinyl
      Small Box
      0.39
      2012-10-03
    
    
      3
      80.0
      483.0
      2011-07-10
      High
      30.0
      4965.7595
      0.08
      Regular Air
      1198.9710
      195.99
      ...
      Clay Rozendal
      Nunavut
      Nunavut
      Corporate
      Technology
      Telephones and Communication
      R380
      Small Box
      0.58
      2011-07-12
    
    
      4
      85.0
      515.0
      2010-08-28
      Not Specified
      19.0
      394.2700
      0.08
      Regular Air
      30.9400
      21.78
      ...
      Carlos Soltero
      Nunavut
      Nunavut
      Consumer
      Office Supplies
      Appliances
      Holmes HEPA Air Purifier
      Medium Box
      0.50
      2010-08-30
    
  

5 rows × 21 columns

Pandas hides all of the details. I also find that pulling out columns is nice with pandas. Here's how to print the customer name column:



In [6]:

    
df['Customer Name'].head()









    Out[6]:





0    Muhammed MacIntyre
1          Barry French
2          Barry French
3         Clay Rozendal
4        Carlos Soltero
Name: Customer Name, dtype: object



In [7]:

    
df.Profit.head()









    Out[7]:





0    -213.2500
1     457.8100
2      46.7075
3    1198.9710
4      30.9400
Name: Profit, dtype: float64

You can learn more about slicing and dicing data from our Boot Camp notes.

Exercise

Read the AAPL.csv file into a data frame using Pandas.

Exercise

From the sales CSV file, use pandas to read in the data and multiple the Order Quantity and Unit Price columns to get a new column.



In [8]:

    
df = pandas.read_csv("data/SampleSuperstoreSales.csv")
(df['Order Quantity']*df['Unit Price']).head()









    Out[8]:





0      233.64
1    10199.84
2      234.63
3     5879.70
4      413.82
dtype: float64

	Row ID	Order ID	Order Date	Order Priority	Order Quantity	Sales	Discount	Ship Mode	Profit	Unit Price	...	Customer Name	Province	Region	Customer Segment	Product Category	Product Sub-Category	Product Name	Product Container	Product Base Margin	Ship Date
0	1	3	2010-10-13	Low	6	261.5400	0.04	Regular Air	-213.2500	38.94	...	Muhammed MacIntyre	Nunavut	Nunavut	Small Business	Office Supplies	Storage & Organization	Eldon Base for stackable storage shelf, platinum	Large Box	0.80	2010-10-20
1	49	293	2012-10-01	High	49	10123.0200	0.07	Delivery Truck	457.8100	208.16	...	Barry French	Nunavut	Nunavut	Consumer	Office Supplies	Appliances	1.7 Cubic Foot Compact "Cube" Office Refrigera...	Jumbo Drum	0.58	2012-10-02
2	50	293	2012-10-01	High	27	244.5700	0.01	Regular Air	46.7075	8.69	...	Barry French	Nunavut	Nunavut	Consumer	Office Supplies	Binders and Binder Accessories	Cardinal Slant-D® Ring Binder, Heavy Gauge Vinyl	Small Box	0.39	2012-10-03
3	80	483	2011-07-10	High	30	4965.7595	0.08	Regular Air	1198.9710	195.99	...	Clay Rozendal	Nunavut	Nunavut	Corporate	Technology	Telephones and Communication	R380	Small Box	0.58	2011-07-12
4	85	515	2010-08-28	Not Specified	19	394.2700	0.08	Regular Air	30.9400	21.78	...	Carlos Soltero	Nunavut	Nunavut	Consumer	Office Supplies	Appliances	Holmes HEPA Air Purifier	Medium Box	0.50	2010-08-30