In [1]:
from __future__ import print_function

%matplotlib inline
from matplotlib import pyplot as plt

import numpy as np
import pandas as pd

What is Pandas?

Pandas provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.

Numpy/Scipy/astropy.tables provide ways to store + manipulate data, but sometimes you have to hack things together when using those libraries. Pandas makes those interactions more straightforward. In my experience, the simplicity of working with Pandas helps keep code maintainable and understandable.

Where can I see some other examples of Pandas?

Numpy example


In [2]:
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/ggplot2/diamonds.csv"
data = np.genfromtxt(url, delimiter=",", dtype=None, names=True)

data


Out[2]:
array([ (b'"1"', 0.23, b'"Ideal"', b'"E"', b'"SI2"', 61.5, 55.0, 326, 3.95, 3.98, 2.43),
       (b'"2"', 0.21, b'"Premium"', b'"E"', b'"SI1"', 59.8, 61.0, 326, 3.89, 3.84, 2.31),
       (b'"3"', 0.23, b'"Good"', b'"E"', b'"VS1"', 56.9, 65.0, 327, 4.05, 4.07, 2.31),
       ...,
       (b'"53938"', 0.7, b'"Very Good"', b'"D"', b'"SI1"', 62.8, 60.0, 2757, 5.66, 5.68, 3.56),
       (b'"53939"', 0.86, b'"Premium"', b'"H"', b'"SI2"', 61.0, 58.0, 2757, 6.15, 6.12, 3.74),
       (b'"53940"', 0.75, b'"Ideal"', b'"D"', b'"SI2"', 62.2, 55.0, 2757, 5.83, 5.87, 3.64)], 
      dtype=[('f0', 'S7'), ('carat', '<f8'), ('cut', 'S11'), ('color', 'S3'), ('clarity', 'S6'), ('depth', '<f8'), ('table', '<f8'), ('price', '<i8'), ('x', '<f8'), ('y', '<f8'), ('z', '<f8')])

data is now a "structured" numpy array. We can access it like a normal 2d array:


In [3]:
data[0][2]


Out[3]:
b'"Ideal"'

but we can also access columns using their names:


In [4]:
data[0]["cut"]


Out[4]:
b'"Ideal"'

And we can work with these columns just like a standard numpy array


In [5]:
data["price"].mean()


Out[5]:
3932.7997219132371

But what happens if we try to add new columns? It'd be nice if it behaved like a dict


In [6]:
data["price_per_carat"] = data["price"] / data["carat"]


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-fee25bb63cf9> in <module>()
----> 1 data["price_per_carat"] = data["price"] / data["carat"]

ValueError: no field of name price_per_carat

Okay. That's not a huge problem. There are alternatives, but it can be ugly.

Let's try separating our data into subsets, and plotting. It's not bad, but also not great.


In [7]:
cuts = set(data["cut"])
for cut in cuts:
    plt.figure()
    plt.title(cut.decode())
    plt.hist(data[data["cut"]==cut]["price"])


Pandas version

Let's do some of the same things, except now using a pandas.DataFrame.

A DataFrame is like an extension of a structured array, but with more features.


In [8]:
url = 'https://github.com/vincentarelbundock/Rdatasets/raw/master/csv/ggplot2/diamonds.csv'
df = pd.read_csv(url, index_col=0)
type(df)


Out[8]:
pandas.core.frame.DataFrame

Similar to astropy tables, DataFrames have nice ways display an overview of the data:


In [9]:
df.head()


Out[9]:
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75

In [10]:
df.describe()


Out[10]:
carat depth table price x y z
count 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000
mean 0.797940 61.749405 57.457184 3932.799722 5.731157 5.734526 3.538734
std 0.474011 1.432621 2.234491 3989.439738 1.121761 1.142135 0.705699
min 0.200000 43.000000 43.000000 326.000000 0.000000 0.000000 0.000000
25% 0.400000 61.000000 56.000000 950.000000 4.710000 4.720000 2.910000
50% 0.700000 61.800000 57.000000 2401.000000 5.700000 5.710000 3.530000
75% 1.040000 62.500000 59.000000 5324.250000 6.540000 6.540000 4.040000
max 5.010000 79.000000 95.000000 18823.000000 10.740000 58.900000 31.800000

We can access columns through their names (like with structured arrays) or we can access columns as an object attribute


In [11]:
df["price"].std()


Out[11]:
3989.4397381463023

In [12]:
df.price.max()


Out[12]:
18823

We can add new columns:


In [13]:
df["price_per_carat"] = df["price"] / df.carat

It also provides nice wrapper functions for visualizing an entire dataset:


In [14]:
df.hist(figsize=(15,15))


Out[14]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x109c23ac8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1067ab2b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10be39160>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x10a091eb8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x109e943c8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x109d109e8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x109df44e0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x109e30a20>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x109d93630>]], dtype=object)

We can create groups based on the values of a column:


In [15]:
df.groupby("cut").price.std()


Out[15]:
cut
Fair         3560.386496
Good         3681.589467
Ideal        3808.401108
Premium      4349.204870
Very Good    3935.862040
Name: price, dtype: float64

And we can plot using these groups. It's much less verbose than when we did it in numpy


In [16]:
df.hist(column="price", by="cut", figsize=(20,20))


Out[16]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x10701d588>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10beea630>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x10bf342b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10ba81048>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x109b1f128>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10b444978>]], dtype=object)

You can also do fancier plots. But sometimes the defaults aren't very pretty (e.g. the plots below are missing x-axis labels, and aren't labeled by cut).

But if you want to see better ways to use pandas with plotting, check out: http://pandas.pydata.org/pandas-docs/stable/visualization.html


In [17]:
df.groupby("cut").plot.hexbin("price", "carat", gridsize=20, title="")


Out[17]:
cut
Fair         Axes(0.125,0.125;0.62x0.775)
Good         Axes(0.125,0.125;0.62x0.775)
Ideal        Axes(0.125,0.125;0.62x0.775)
Premium      Axes(0.125,0.125;0.62x0.775)
Very Good    Axes(0.125,0.125;0.62x0.775)
dtype: object

Pandas with SQL Databases

If you work with databases, you might notice that a DataFrame is similar to a table.

This makes it (somewhat) easy to interface with a table in an existing database. Here I'll read in an existing table directly to a DataFrame


In [18]:
df_from_table = pd.read_sql_table("status", "sqlite:///sample_database.db")
df_from_table = df_from_table.set_index("id")
df_from_table.head()


Out[18]:
data_dir status
id
cb33250c-7c9a-490a-be79-903e8bb8e338 ../cluster_parameter_study/ Complete
34ad9744-fd58-43f8-ab50-09b499cf7115 ../cluster_parameter_study/ Complete
59b7f2b4-ed74-48bc-ab18-7a860043a7de ../cluster_parameter_study/ Complete
6b63464e-5e6d-40c7-be47-d7429bfe0b97 ../cluster_parameter_study/ Complete
e153d17b-dfe2-4172-ba68-1c59862b57cc ../cluster_parameter_study/ Complete

In [19]:
df_from_table[df_from_table["status"] == "Running"]


Out[19]:
data_dir status
id
c668b684-b0ac-474f-bdc0-a985c1b695b7 ../cluster_parameter_study/ Running
e1d181bc-0095-47d0-98f8-54f3efd95833 ../cluster_parameter_study/ Running
58ea2992-5299-4598-b20d-1cdc36c5fb56 ../cluster_parameter_study/ Running
b597813d-cc8e-4323-97f1-e6f14c6a9e83 ../cluster_parameter_study/ Running
651fc6bd-38c9-4ed5-a2ee-ab0e3553ac1c ../cluster_parameter_study/ Running
6b33e24d-efbe-4849-bf02-3a717e772407 ../cluster_parameter_study/ Running
26f97021-7ea1-4db7-bcc2-a7899841ebac ../cluster_parameter_study/ Running
5e06e18c-5721-425d-8df2-ce92ad02520c ../cluster_parameter_study/ Running
5d571036-8dab-4e3f-923d-d058d88f7e96 ../cluster_parameter_study/ Running
d431dd27-ee94-4f72-b81d-e572f9f74266 ../cluster_parameter_study/ Running
c678165d-60cd-43da-926a-ecdab71d5e23 ../cluster_parameter_study/ Running
4185a715-14cc-416b-af2a-21cc1695e5a4 ../cluster_parameter_study/ Running
e730e083-9335-4a55-ba55-280c5bd0bcbb ../cluster_parameter_study/ Running
80d7cc84-d6e4-4db2-a20e-c3b932a65105 ../cluster_parameter_study/ Running
d4e175f6-8344-4a64-ad74-b7339c13c079 ../cluster_parameter_study/ Running
99c33d19-3e59-4254-9c05-4ee74b528612 ../cluster_parameter_study/ Running
1f232598-0cd3-4251-9876-8a0a92516482 ../cluster_parameter_study/ Running
395c9d8c-9092-4ba6-bf72-104ca7121838 ../cluster_parameter_study/ Running
60d6f520-b9ee-4031-b9b1-aaca431abf34 ../cluster_parameter_study/ Running
20111006-1840-4d4a-a88d-08af5855c132 ../cluster_parameter_study/ Running

In [20]:
df_from_table.loc["cb33250c-7c9a-490a-be79-903e8bb8e338"]


Out[20]:
data_dir    ../cluster_parameter_study/
status                         Complete
Name: cb33250c-7c9a-490a-be79-903e8bb8e338, dtype: object

Time Domain

Pandas was also developed with time domain applications in mind, so it includes some extra power for dealing with DateTimes.


In [21]:
url = "http://mesonet.agron.iastate.edu/cgi-bin/request/asos.py?station=WVI&data=tmpf&year1=2014&month1=1&day1=1&year2=2014&month2=12&day2=31&tz=America%2FLos_Angeles&format=comma&latlon=no&direct=no"


df = pd.read_csv(url, 
            comment="#", 
            names=["Station", "Time", "Temp"], 
            parse_dates=True,
            header=1, 
            na_values="M",
            index_col="Time")

df.head()


Out[21]:
Station Temp
Time
2014-01-01 01:53:00 WVI 41.00
2014-01-01 02:53:00 WVI 42.08
2014-01-01 03:53:00 WVI 37.94
2014-01-01 04:53:00 WVI 37.94
2014-01-01 05:53:00 WVI 37.94

In [22]:
df.plot()


Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x1102bf3c8>

Let's take the average weekly temperature (resample on a weekly timescale):


In [23]:
df.resample("w").plot()


Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x10f344be0>

That's an ugly plot. It'd be nicer if we just smooth the plot, instead of down-sampling


In [24]:
pd.rolling_mean(df, freq="d", window=7).plot()


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x10f3cea20>

Astropy Tables?

I can't provide much comparison to the astropy tables, since I don't use them.

Downsides

  • Indexing with Pandas can be ugly
    • Because you have an index column, you need to say whether you want the $0^{\mathrm{th}}$ row (df.iloc[0]) or the row with an index value of $0$ (df.loc[0])
  • Pandas can be slow

    • This probably isn't an issue unless you're writing loops
    • There are ways to speed things up. The easiest is to look into how to index data correctly
  • Pandas Gotchas

Want more complex examples?

http://pandas.pydata.org/pandas-docs/stable/cookbook.html