Let's get some data. Download Sample Superstore Sales .xls file or my local copy and open it in Excel to see what it looks like.
Data of interest that we want to process in Python often comes in the form of an Excel spreadsheet, but the data is in a special format that we can't read directly:
In [1]:
with open('data/SampleSuperstoreSales.xls', "rb") as f:
txt = f.read()
print(txt[0:100])
pip install csvkit
Unfortunately, at the moment, there is some kind of a weird bug at the moment, unrelated to csvkit, so we get lots of warnings even though it works.
/Users/parrt/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: ImportWarning: can't resolve package from __spec__ or __package__, falling back on __name__ and __path__
So, the following command works without having to run or even own Excel on your laptop, but you get lots of warnings:
$ in2csv data/SampleSuperstoreSales.xls > /tmp/t.csv
In [9]:
import pandas
table = pandas.read_excel("data/SampleSuperstoreSales.xls")
table.head()
Out[9]:
Grab the CSV version of the Excel file SampleSuperstoreSales.csv we've been playing with.
For the most part, CSV files are very simple, but they can get complicated when we need to embed a comma. One such case from the above file shows how fields with commas get quoted:
"Eldon Base for stackable storage shelf, platinum"
What happens when we want to encode a quote? Well, somehow people decided that ""
double quotes was the answer (not!) and we get fields encoded like this:
"1.7 Cubic Foot Compact ""Cube"" Office Refrigerators"
The good news is that Python's csv
package knows how to read Excel-generated files that use such encoding. Here's a sample script that reads such a file into a list of lists:
In [10]:
import sys
import csv
table_file = "data/SampleSuperstoreSales.csv"
with open(table_file, "r") as csvfile:
f = csv.reader(csvfile, dialect='excel')
data = []
for row in f:
data.append(row)
print(data[:6])
Or add to a numpy array
:
In [11]:
import numpy as np
np.array(data)
Out[11]:
In the end, the easiest way to deal with loading CSV files is probably with Pandas. For example, to load our sales CSV, we don't even have to manually open and close a file:
In [5]:
import pandas
df = pandas.read_csv("data/SampleSuperstoreSales.csv")
df.head()
Out[5]:
Pandas hides all of the details. I also find that pulling out columns is nice with pandas. Here's how to print the customer name column:
In [6]:
df['Customer Name'].head()
Out[6]:
In [7]:
df.Profit.head()
Out[7]:
You can learn more about slicing and dicing data from our Boot Camp notes.
Read the AAPL.csv file into a data frame using Pandas.
In [8]:
df = pandas.read_csv("data/SampleSuperstoreSales.csv")
(df['Order Quantity']*df['Unit Price']).head()
Out[8]: