Reading data from Excel

Let's get some data. Download Sample Superstore Sales .xls file or my local copy and open it in Excel to see what it looks like.

Data of interest that we want to process in Python often comes in the form of an Excel spreadsheet, but the data is in a special format that we can't read directly:


In [1]:
with open('data/SampleSuperstoreSales.xls', "rb") as f:
    txt = f.read()
    print(txt[0:100])


b'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00>\x00\x03\x00\xfe\xff\t\x00\x06\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00/\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x10\x00\x00\x13\x17\x00\x00\x01\x00\x00\x00\xfe\xff\xff\xff\x00\x00\x00\x00\x00\x00\x00\x00b\x00\x00\x00\xe3\x00\x00\x00d\x01\x00\x00\xe5\x01\x00\x00f\x02\x00\x00'

Converting Excel files with csvkit

There's a really useful tool kit called csvkit, which you can install with:

pip install csvkit

Unfortunately, at the moment, there is some kind of a weird bug at the moment, unrelated to csvkit, so we get lots of warnings even though it works.

/Users/parrt/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: ImportWarning: can't resolve package from __spec__ or __package__, falling back on __name__ and __path__

So, the following command works without having to run or even own Excel on your laptop, but you get lots of warnings:

$ in2csv data/SampleSuperstoreSales.xls > /tmp/t.csv

Reading Excel files with Pandas

The easiest way to read Excel files with Python is to use Pandas:


In [9]:
import pandas
table = pandas.read_excel("data/SampleSuperstoreSales.xls")
table.head()


Out[9]:
Row ID Order ID Order Date Order Priority Order Quantity Sales Discount Ship Mode Profit Unit Price ... Customer Name Province Region Customer Segment Product Category Product Sub-Category Product Name Product Container Product Base Margin Ship Date
0 1 3 2010-10-13 Low 6 261.5400 0.04 Regular Air -213.2500 38.94 ... Muhammed MacIntyre Nunavut Nunavut Small Business Office Supplies Storage & Organization Eldon Base for stackable storage shelf, platinum Large Box 0.80 2010-10-20
1 49 293 2012-10-01 High 49 10123.0200 0.07 Delivery Truck 457.8100 208.16 ... Barry French Nunavut Nunavut Consumer Office Supplies Appliances 1.7 Cubic Foot Compact "Cube" Office Refrigera... Jumbo Drum 0.58 2012-10-02
2 50 293 2012-10-01 High 27 244.5700 0.01 Regular Air 46.7075 8.69 ... Barry French Nunavut Nunavut Consumer Office Supplies Binders and Binder Accessories Cardinal Slant-D® Ring Binder, Heavy Gauge Vinyl Small Box 0.39 2012-10-03
3 80 483 2011-07-10 High 30 4965.7595 0.08 Regular Air 1198.9710 195.99 ... Clay Rozendal Nunavut Nunavut Corporate Technology Telephones and Communication R380 Small Box 0.58 2011-07-12
4 85 515 2010-08-28 Not Specified 19 394.2700 0.08 Regular Air 30.9400 21.78 ... Carlos Soltero Nunavut Nunavut Consumer Office Supplies Appliances Holmes HEPA Air Purifier Medium Box 0.50 2010-08-30

5 rows × 21 columns

CSV data

Grab the CSV version of the Excel file SampleSuperstoreSales.csv we've been playing with.

Dealing with commas double quotes in CSV

For the most part, CSV files are very simple, but they can get complicated when we need to embed a comma. One such case from the above file shows how fields with commas get quoted:

"Eldon Base for stackable storage shelf, platinum"

What happens when we want to encode a quote? Well, somehow people decided that "" double quotes was the answer (not!) and we get fields encoded like this:

"1.7 Cubic Foot Compact ""Cube"" Office Refrigerators"

The good news is that Python's csv package knows how to read Excel-generated files that use such encoding. Here's a sample script that reads such a file into a list of lists:


In [10]:
import sys
import csv

table_file = "data/SampleSuperstoreSales.csv"
with open(table_file, "r") as csvfile:
    f = csv.reader(csvfile, dialect='excel')
    data = []
    for row in f:
        data.append(row)

print(data[:6])


[['Row ID', 'Order ID', 'Order Date', 'Order Priority', 'Order Quantity', 'Sales', 'Discount', 'Ship Mode', 'Profit', 'Unit Price', 'Shipping Cost', 'Customer Name', 'Province', 'Region', 'Customer Segment', 'Product Category', 'Product Sub-Category', 'Product Name', 'Product Container', 'Product Base Margin', 'Ship Date'], ['1.0', '3.0', '2010-10-13', 'Low', '6.0', '261.54', '0.04', 'Regular Air', '-213.25', '38.94', '35.0', 'Muhammed MacIntyre', 'Nunavut', 'Nunavut', 'Small Business', 'Office Supplies', 'Storage & Organization', 'Eldon Base for stackable storage shelf, platinum', 'Large Box', '0.8', '2010-10-20'], ['49.0', '293.0', '2012-10-01', 'High', '49.0', '10123.02', '0.07', 'Delivery Truck', '457.81', '208.16', '68.02', 'Barry French', 'Nunavut', 'Nunavut', 'Consumer', 'Office Supplies', 'Appliances', '1.7 Cubic Foot Compact "Cube" Office Refrigerators', 'Jumbo Drum', '0.58', '2012-10-02'], ['50.0', '293.0', '2012-10-01', 'High', '27.0', '244.57', '0.01', 'Regular Air', '46.7075', '8.69', '2.99', 'Barry French', 'Nunavut', 'Nunavut', 'Consumer', 'Office Supplies', 'Binders and Binder Accessories', 'Cardinal Slant-D® Ring Binder, Heavy Gauge Vinyl', 'Small Box', '0.39', '2012-10-03'], ['80.0', '483.0', '2011-07-10', 'High', '30.0', '4965.759499999999', '0.08', 'Regular Air', '1198.971', '195.99', '3.99', 'Clay Rozendal', 'Nunavut', 'Nunavut', 'Corporate', 'Technology', 'Telephones and Communication', 'R380', 'Small Box', '0.58', '2011-07-12'], ['85.0', '515.0', '2010-08-28', 'Not Specified', '19.0', '394.27', '0.08', 'Regular Air', '30.94', '21.78', '5.94', 'Carlos Soltero', 'Nunavut', 'Nunavut', 'Consumer', 'Office Supplies', 'Appliances', 'Holmes HEPA Air Purifier', 'Medium Box', '0.5', '2010-08-30']]

Or add to a numpy array:


In [11]:
import numpy as np
np.array(data)


Out[11]:
array([['Row ID', 'Order ID', 'Order Date', ..., 'Product Container',
        'Product Base Margin', 'Ship Date'],
       ['1.0', '3.0', '2010-10-13', ..., 'Large Box', '0.8',
        '2010-10-20'],
       ['49.0', '293.0', '2012-10-01', ..., 'Jumbo Drum', '0.58',
        '2012-10-02'],
       ...,
       ['7906.0', '56550.0', '2011-04-08', ..., 'Small Pack', '0.41',
        '2011-04-10'],
       ['7907.0', '56550.0', '2011-04-08', ..., 'Small Box', '0.56',
        '2011-04-09'],
       ['7914.0', '56581.0', '2009-02-08', ..., 'Medium Box', '0.65',
        '2009-02-11']], dtype='<U98')

Reading CSV into Pandas Data frames

In the end, the easiest way to deal with loading CSV files is probably with Pandas. For example, to load our sales CSV, we don't even have to manually open and close a file:


In [5]:
import pandas
df = pandas.read_csv("data/SampleSuperstoreSales.csv")
df.head()


Out[5]:
Row ID Order ID Order Date Order Priority Order Quantity Sales Discount Ship Mode Profit Unit Price ... Customer Name Province Region Customer Segment Product Category Product Sub-Category Product Name Product Container Product Base Margin Ship Date
0 1.0 3.0 2010-10-13 Low 6.0 261.5400 0.04 Regular Air -213.2500 38.94 ... Muhammed MacIntyre Nunavut Nunavut Small Business Office Supplies Storage & Organization Eldon Base for stackable storage shelf, platinum Large Box 0.80 2010-10-20
1 49.0 293.0 2012-10-01 High 49.0 10123.0200 0.07 Delivery Truck 457.8100 208.16 ... Barry French Nunavut Nunavut Consumer Office Supplies Appliances 1.7 Cubic Foot Compact "Cube" Office Refrigera... Jumbo Drum 0.58 2012-10-02
2 50.0 293.0 2012-10-01 High 27.0 244.5700 0.01 Regular Air 46.7075 8.69 ... Barry French Nunavut Nunavut Consumer Office Supplies Binders and Binder Accessories Cardinal Slant-D® Ring Binder, Heavy Gauge Vinyl Small Box 0.39 2012-10-03
3 80.0 483.0 2011-07-10 High 30.0 4965.7595 0.08 Regular Air 1198.9710 195.99 ... Clay Rozendal Nunavut Nunavut Corporate Technology Telephones and Communication R380 Small Box 0.58 2011-07-12
4 85.0 515.0 2010-08-28 Not Specified 19.0 394.2700 0.08 Regular Air 30.9400 21.78 ... Carlos Soltero Nunavut Nunavut Consumer Office Supplies Appliances Holmes HEPA Air Purifier Medium Box 0.50 2010-08-30

5 rows × 21 columns

Pandas hides all of the details. I also find that pulling out columns is nice with pandas. Here's how to print the customer name column:


In [6]:
df['Customer Name'].head()


Out[6]:
0    Muhammed MacIntyre
1          Barry French
2          Barry French
3         Clay Rozendal
4        Carlos Soltero
Name: Customer Name, dtype: object

In [7]:
df.Profit.head()


Out[7]:
0    -213.2500
1     457.8100
2      46.7075
3    1198.9710
4      30.9400
Name: Profit, dtype: float64

You can learn more about slicing and dicing data from our Boot Camp notes.

Exercise

Read the AAPL.csv file into a data frame using Pandas.

Exercise

From the sales CSV file, use pandas to read in the data and multiple the Order Quantity and Unit Price columns to get a new column.


In [8]:
df = pandas.read_csv("data/SampleSuperstoreSales.csv")
(df['Order Quantity']*df['Unit Price']).head()


Out[8]:
0      233.64
1    10199.84
2      234.63
3     5879.70
4      413.82
dtype: float64