In [1]:
import pandas as pd
Note : This cell should run without raising a traceback. Assuming it runs, you can also try printing the value of pd.__version__
to see what version of pandas
you have installed.
We'll be analysing a real-world dataset together. It's about my favourite thing in the world : coffee. This dataset was collected at the Mathematics Institute at the University of Warwick. It's a time-series dataset, describing the total number of coffees made by our espresso machine by a certain date.
Slides are available at https://github.com/QCaudron/pydata_pandas. One notebook contains solutions; beware of spoilers.
The notebooks contain notes about what we're doing that I'll skip during this workshop, but try to explain on the way.
The pandas
API is enormous. The documentation is excellent, don't hesitate to look things up.
In [2]:
import pandas as pd
%matplotlib inline
Note : The second line here tells matplotlib
to plot directly under the cell where any plotting code is called. pandas
uses matplotlib
to generate graphs, and without this, the graphs would appear outside the Jupyter notebook when you called plt.show()
- but we just want them to appear without having to do this.
http://ipython.readthedocs.io/en/stable/interactive/plotting.html#id1
Let's import the coffee data from CSV.
In [3]:
# Read data from data/coffees.csv
data = pd.read_csv("data/coffees.csv")
Note : pandas
can read from many data formats : CSV, JSON, Excel, HDF5, SQL, and more.
In [4]:
data
Out[4]:
In [5]:
# .head()
data.head()
Out[5]:
We have an index, and three columns : timestamp
, coffees
, and contributor
.
Uh-oh. Why is there a string of text, testing
, in our coffee numbers ? What's going on in the coffees
column in the row after that ?
Note : df.head(n=10)
would show the first ten rows. The default is n=5
.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html
In [6]:
# .loc or .iloc
data.loc[2]
Out[6]:
Definitely a string. We'll note this as something to fix after we finish looking around.
Note : .loc
uses a label-based lookup, which means that the value you pass into the square brackets must be in the index. Another method, .iloc
, is integer-location-based, so .iloc[2]
would return the third row. In this case, they're the same, but had we changed our index, as we'll see later, things would work differently.
Indexing a dataframe with []
directly returns a pd.Series
or pd.DataFrame
by searching over columns, not rows. Indexing a pd.Series
with []
is like indexing a dataframe with .iloc
.
In [7]:
# [] indexing on a series
data.coffees[:5]
Out[7]:
Note : here, we're indexing a series ( a pd.Series
object ). From a pd.DataFrame
( here, data
), when you access a single column ( data.coffees
or data["coffees"]
), the object returned is a pd.Series
. From that, indexing directly with []
works in an integer-location-based manner, and like with numpy arrays, you can take slices ( [:5]
).
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html
In [8]:
print("Dataset length :")
# len()
print(len(data))
In [9]:
# .describe()
data.describe()
Out[9]:
Looks like we also have some missing data - we have 671 rows, but the coffees
column only has 658 entries.
Note : .describe()
returns different things based on what's in the dataframe, as we'll see later. For numerical columns, it will return things like the mean, standard deviation, and percentiles. For object columns ( strings or datetimes ), it will return the most frequent entry and the first and last items. For all columns, .describe()
will return the count of objects in that column ( not counting NaNs ) and the unique number of entries. You can determine what's returned using .describe()
's keyword arguments.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html
In [10]:
# .isnull() and boolean indexing with []
data[data.coffees.isnull()]
Out[10]:
Note : .isnull()
returns a boolean array ( an array of True
s and False
s ), that you can then use to index the dataframe directly. Here, our boolean array tells us which entries in the coffees
column are null, and we use that to index against the full dataframe - so we get back every column in the dataframe, but only those rows where coffees
is null.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isnull.html
In [11]:
# .dtypes
data.dtypes
Out[11]:
The contributor
column makes sense as object
, because we expect strings there; but surely the timestamp
should be a timestamp-type, and coffees
should be numerical ?
In [12]:
# print the first element of the series with [] indexing
print(data.timestamp[0])
# print its type()
print(type(data.timestamp[0]))
It looks like the timestamp
field was read from CSV as a string. That makes sense - CSV files are very basic. We'll have pandas
interpret these strings as datetimes for us automatically.
Note : here's an example of using direct []
indexing on a pd.Series
. We're accessing the first entry, just to see what type of object we have there.
In [13]:
# cast the coffees column using pd.to_numeric, and coerce errors
data.coffees = pd.to_numeric(data.coffees, errors="coerce")
data.head()
Out[13]:
In [14]:
# Use .dropna() using a subset, and pass inplace
data.dropna(subset=["coffees"], inplace=True)
data.head()
Out[14]:
In [15]:
# Cast to int using .astype()
data.coffees = data.coffees.astype(int)
data.head()
Out[15]:
In [16]:
# pd.to_datetime()
data.timestamp = pd.to_datetime(data.timestamp)
# Confirm dtypes
data.dtypes
Out[16]:
In [17]:
# .describe(), passing the include kwarg to see all information
data.describe(include="all")
Out[17]:
In [18]:
# What do the first few rows look like ?
data.head()
Out[18]:
Note : .describe(include="all")
is describing all attributes of all columns, but some don't make sense based on the column's dtype
. For example, the contributor column has no first
and last
attributes, because those describe the first and last entries in an ordered series. That makes sense for the timestamp - those have an intuitive definition of sorting - but not so much for strings ( alphabetical order doesn't really matter when they're arbitrary strings ). Similary, the timestamp column has no mean or other numerical traits. What does it mean to calculate the mean timestamp ?
In [19]:
# .plot() on the coffees series
data.coffees.plot()
Out[19]:
pandas
is plotting the coffees against the index, which is just a series of integers.
Note : .plot()
on a pd.Series
will plot the data against the index. On a pd.DataFrame
, the .plot()
method allows plotting of one column against another.
By default, .plot()
renders a line graph, but you can specify which type of plot you'd like - bar, line, histogram, area, scatter, etc..
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.plot.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html
In [20]:
# .plot() on the dataframe, setting x to the timestamp, with dot-dash style
data.plot(x=data.timestamp, style=".-")
Out[20]:
We have some very uneven spacing in places. We might start by cutting off the last few points of this time-series, which is missing a lot of data.
In [21]:
# .tail() with ten rows
data.tail(n=10)
Out[21]:
After mid-March, things start getting spaced rather erratically.
In [22]:
# Use conditional indexing against the timestamp
data = data[data.timestamp < "2013-03-01"]
data.tail()
Out[22]:
Note : this is another example of boolean indexing. data.timestamp < "2013-03-01"
is a boolean array, and can be passed into the dataframe immediately in []
, much like with a np.ndarray
.
In [23]:
# Once again, plot the data against the timestamp
data.plot(x=data.timestamp, style=".-")
Out[23]:
In [24]:
# .value_counts()
data.contributor.value_counts()
Out[24]:
Note : .value_counts()
counts the unique values in a series. It's similar to doing a .groupby()
followed by a .count()
, as we'll see soon.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html
In [25]:
# .plot() a bar chart from the value counts
data.contributor.value_counts().plot(kind="bar")
Out[25]:
In [26]:
# Create a series of the weekdays for each entry using .dt.weekday
weekdays = data.timestamp.dt.weekday
# assign() it to our dataframe
data = data.assign(weekdays=weekdays)
data.head()
Out[26]:
In [27]:
weekday_names = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
weekday_dict = {key: weekday_names[key] for key in range(7)}
# Use .apply() to apply a custom function to the weekdays column
data.weekdays = data.weekdays.apply(lambda x: weekday_dict[x])
data.head()
Out[27]:
In [28]:
# .groupby() the weekdays and then .count() rows in each group
weekday_counts = data.groupby("weekdays").count()
# We can reorder this dataframe by our weekday_names list
weekday_counts = weekday_counts.loc[weekday_names]
weekday_counts
Out[28]:
Note : this first line could be replaced by weekday_counts = data.weekdays.value_counts()
, with the only difference being that that would return a series to us, and here, we got back a dataframe.
In [29]:
# .plot() a bar chart of data in weekday_counts
weekday_counts.timestamp.plot(kind="bar", title="Datapoints added on each weekday")
Out[29]:
In [30]:
# Set the dataframe's .index property
data.index = data.timestamp
# Let's drop the timestamp column, as we no longer need it
data.drop(["timestamp"], axis=1, inplace=True)
data.head()
Out[30]:
In [31]:
# pd.date_range, with daily frequency, and normalisation
midnights = pd.date_range(data.index[0], data.index[-1], freq="D", normalize=True)
midnights
Out[31]:
Note : pd.date_range
creates a fixed-frequency DatetimeIndex. normalize=True
ensures these datetimes are at midnight, and not at whatever time the starting point is.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html
In [32]:
# Take the union of the existing and new indices
new_index = midnights.union(data.index)
new_index
Out[32]:
Note : the union of these indices is just a new index where entries from both indices are present. It's sorted by time.
In [33]:
# .reindex() the dataframe
upsampled_data = data.reindex(new_index)
upsampled_data.head(10)
Out[33]:
Note : .reindex()
keeps any values that conform to the new index, and inserts NaN
s where we have no values.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html
In [34]:
# .interpolate the upsampled_data using the time method
upsampled_data = upsampled_data.interpolate(method="time")
upsampled_data.head(10)
Out[34]:
In [35]:
# .resample() followed by .asfreq()
daily_data = upsampled_data.resample("D").asfreq()
# Drop the contributor column, we no longer need it
daily_data = daily_data.drop(["contributor"], axis=1)
# Generate a column of weekday names
daily_data["weekdays"] = daily_data.index.weekday_name # We did it the slow way before...
daily_data.head()
Out[35]:
In [36]:
# Let's plot the data once more, to see how we're doing
daily_data.plot(figsize=(15, 4), style=".")
Out[36]:
In [37]:
# Use .diff() on the coffees column; follow up with .shift()
coffees_made = daily_data.coffees.diff().shift(-1)
# Add this as a column to the dataframe
daily_data["coffees_made_today"] = coffees_made
daily_data.head()
Out[37]:
Note : we use .shift()
here because if we look at the .diff()
between a Monday and a Tuesday, those coffees are attributed to the Tuesday. However, what we want to say is "this many coffees were made at some point on the Monday", so we shift the entire series up one.
In [38]:
# .groupby weekdays, take the mean, and grab the coffees_made_today column
coffees_by_day = daily_data.groupby("weekdays").mean().coffees_made_today
coffees_by_day
Out[38]:
In [39]:
# Sort coffees_by_day by our list of weekday names
coffees_by_day = coffees_by_day[weekday_names]
# Plot a bar chart
coffees_by_day.plot(kind="bar")
Out[39]:
Wednesdays was seminar day...
In [40]:
# Bring in data/department_members.csv;
# have the first column be the index, and parse the dates
people = pd.read_csv("data/department_members.csv", index_col=[0], parse_dates=True)
people.head()
Out[40]:
In [41]:
# Use an outer join, then interpolate over missing values using nearest values
daily_data = daily_data.join(people, how="outer").interpolate(method="nearest")
daily_data.head()
Out[41]:
Note : by default, inner joins are performed. That is, if a row from one of the datasets has an index that isn't in the other dataset, that row is dropped. You can specify whether you want outer, left, or right joins, as well plenty of other useful options. The pandas
API for joining or merging datasets is very developed.
In [42]:
# New column is the ratio of coffees made on a given day to number of members in the department
daily_data["coffees_per_person"] = daily_data.coffees_made_today / daily_data.members
# Let's drop those remaining NaNs while we're at it
daily_data.dropna(inplace=True)
daily_data.head()
Out[42]:
In [43]:
# Plot the coffees_per_person column
daily_data.coffees_per_person.plot()
Out[43]:
In [44]:
# pd.read_csv(); try using data/coffee_status.csv
# parse_dates as kwarg; also pass index_col
machine_status = pd.read_csv("data/coffee_status.csv", parse_dates=["date"], index_col="date")
machine_status.head()
Out[44]:
Note : the parse_dates
keyword argument takes several values. By passing in a list of strings, we're telling pandas
to attempt to parse the dates in columns with those names.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
In [45]:
# .value_counts()
machine_status.status.value_counts()
Out[45]:
In [46]:
# Make a pd.Series from the status series where things are OK
numerical_status = machine_status.status == "OK"
numerical_status.plot()
Out[46]:
Note : the first line here creates a boolean pd.Series
, holding the value True
when machine_status.status
is "OK"
, and False
otherwise. Because it's a pd.Series
, its index stays the same as that of machine_status
, which was a DatetimeIndex
. Then, we can plot the boolean series ( True
appearing as 1
, and False
appearing as 0
), and just quickly scan to see that there are long areas where the coffee machine was operations, with short bouts ( thankfully ! ) of the machine being broken.
In [47]:
# .join()
daily_data = daily_data.join(machine_status)
daily_data.head()
Out[47]:
In [48]:
# Column depicting when the status was "OK"
# Cast the series to ints before as you create a new column in the dataframe
daily_data["numerical_status"] = (daily_data.status == "OK").astype(int)
daily_data.head()
Out[48]:
In [49]:
# Plot both columns on the same graph, using default args
daily_data[["coffees_per_person", "numerical_status"]].plot()
Out[49]:
In [50]:
# Resample weekly, taking the mean of each week to get a weekly value
weekly_data = daily_data.resample("W").mean()
weekly_data[["coffees_per_person", "numerical_status"]].plot()
Out[50]: