In [1]:
from __future__ import print_function
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
Pandas
?Pandas
provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.
Numpy
/Scipy
/astropy.tables
provide ways to store + manipulate data, but sometimes you have to hack things together when using those libraries. Pandas
makes those interactions more straightforward. In my experience, the simplicity of working with Pandas
helps keep code maintainable and understandable.
In [2]:
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/ggplot2/diamonds.csv"
data = np.genfromtxt(url, delimiter=",", dtype=None, names=True)
data
Out[2]:
data
is now a "structured" numpy
array. We can access it like a normal 2d array:
In [3]:
data[0][2]
Out[3]:
but we can also access columns using their names:
In [4]:
data[0]["cut"]
Out[4]:
And we can work with these columns just like a standard numpy
array
In [5]:
data["price"].mean()
Out[5]:
But what happens if we try to add new columns? It'd be nice if it behaved like a dict
In [6]:
data["price_per_carat"] = data["price"] / data["carat"]
Okay. That's not a huge problem. There are alternatives, but it can be ugly.
Let's try separating our data into subsets, and plotting. It's not bad, but also not great.
In [7]:
cuts = set(data["cut"])
for cut in cuts:
plt.figure()
plt.title(cut.decode())
plt.hist(data[data["cut"]==cut]["price"])
In [8]:
url = 'https://github.com/vincentarelbundock/Rdatasets/raw/master/csv/ggplot2/diamonds.csv'
df = pd.read_csv(url, index_col=0)
type(df)
Out[8]:
Similar to astropy
tables, DataFrames
have nice ways display an overview of the data:
In [9]:
df.head()
Out[9]:
In [10]:
df.describe()
Out[10]:
We can access columns through their names (like with structured arrays) or we can access columns as an object attribute
In [11]:
df["price"].std()
Out[11]:
In [12]:
df.price.max()
Out[12]:
We can add new columns:
In [13]:
df["price_per_carat"] = df["price"] / df.carat
It also provides nice wrapper functions for visualizing an entire dataset:
In [14]:
df.hist(figsize=(15,15))
Out[14]:
We can create groups based on the values of a column:
In [15]:
df.groupby("cut").price.std()
Out[15]:
And we can plot using these groups. It's much less verbose than when we did it in numpy
In [16]:
df.hist(column="price", by="cut", figsize=(20,20))
Out[16]:
You can also do fancier plots. But sometimes the defaults aren't very pretty (e.g. the plots below are missing x-axis labels, and aren't labeled by cut).
But if you want to see better ways to use pandas with plotting, check out: http://pandas.pydata.org/pandas-docs/stable/visualization.html
In [17]:
df.groupby("cut").plot.hexbin("price", "carat", gridsize=20, title="")
Out[17]:
In [18]:
df_from_table = pd.read_sql_table("status", "sqlite:///sample_database.db")
df_from_table = df_from_table.set_index("id")
df_from_table.head()
Out[18]:
In [19]:
df_from_table[df_from_table["status"] == "Running"]
Out[19]:
In [20]:
df_from_table.loc["cb33250c-7c9a-490a-be79-903e8bb8e338"]
Out[20]:
In [21]:
url = "http://mesonet.agron.iastate.edu/cgi-bin/request/asos.py?station=WVI&data=tmpf&year1=2014&month1=1&day1=1&year2=2014&month2=12&day2=31&tz=America%2FLos_Angeles&format=comma&latlon=no&direct=no"
df = pd.read_csv(url,
comment="#",
names=["Station", "Time", "Temp"],
parse_dates=True,
header=1,
na_values="M",
index_col="Time")
df.head()
Out[21]:
In [22]:
df.plot()
Out[22]:
Let's take the average weekly temperature (resample on a weekly timescale):
In [23]:
df.resample("w").plot()
Out[23]:
That's an ugly plot. It'd be nicer if we just smooth the plot, instead of down-sampling
In [24]:
pd.rolling_mean(df, freq="d", window=7).plot()
Out[24]:
Pandas
can be uglydf.iloc[0]
) or the row with an index value of $0$ (df.loc[0]
)Pandas can be slow