The first part of this series was very well received so I thought I would continue the theme of showing how to do common Excel tasks in pandas.
In the first article, I focused on common, math tasks in Excel and how to do them in pandas. In this article, I'll focus on some other Excel tasks related to data selection and how to map them to pandas.
Please refer to this post for the full post.
Import the pandas and numpy modules.
In [1]:
import pandas as pd
import numpy as np
Load in the Excel data that represents a year's worth of sales.
In [2]:
df = pd.read_excel("../data/sample-salesv3.xlsx")
Take a quick look at the data types to make sure everything came through as expected.
In [3]:
df.dtypes
Out[3]:
You'll notice that our date column is showing up as a generic object
. We are going to convert it to datetime object to make some selections a little easier.
In [4]:
df['date'] = pd.to_datetime(df['date'])
In [5]:
df.head()
Out[5]:
In [6]:
df.dtypes
Out[6]:
The date is now a datetime object which will be useful in future steps.
Similar to the autofilter function in Excel, you can use pandas to filter and select certain subsets of data.
For instance, if we want to just see a specific account number, we can easily do that with pandas.
Note, I am going to use the head
function to show the top results. This is purely for the purposes of keeping the article shorter.
In [7]:
df[df["account number"]==307599].head()
Out[7]:
You could also do the filtering based on numeric values.
In [8]:
df[df["quantity"] > 22].head()
Out[8]:
If we want to do more complex filtering, we can use map
to filter. In this example, let's look for items with sku's that start with B1.
In [9]:
df[df["sku"].map(lambda x: x.startswith('B1'))].head()
Out[9]:
It's easy to chain two statements together using the &.
In [10]:
df[df["sku"].map(lambda x: x.startswith('B1')) & (df["quantity"] > 22)].head()
Out[10]:
Another useful function that pandas supports is called isin
. It allows us to define a list of values we want to look for.
In this case, we look for all records that include two specific account numbers.
In [11]:
df[df["account number"].isin([714466,218895])].head()
Out[11]:
Pandas supports another function called query
which allows you to efficiently select subsets of data. It does require the installation of numexpr so make sure you have it installed before trying this step.
If you would like to get a list of customers by name, you can do that with a query, similar to the python syntax shown above.
In [12]:
df.query('name == ["Kulas Inc","Barton LLC"]').head()
Out[12]:
The query function allows you do more than just this simple example but for the purposes of this discussion, I'm showing it so you are aware that it is out there for you.
Using pandas, you can do complex filtering on dates. Before doing anything with dates, I encourage you to sort by the date column to make sure the results return what you are expecting.
In [13]:
df = df.sort_values(by='date')
df.head()
Out[13]:
The python filtering syntax shown before works with dates.
In [14]:
df[df['date'] >='20140905'].head()
Out[14]:
One of the really nice features of pandas is that it understands dates so will allow us to do partial filtering. If we want to only look for data more recent than a specific month, we can do so.
In [15]:
df[df['date'] >='2014-03'].head()
Out[15]:
Of course, you can chain the criteria.
In [16]:
df[(df['date'] >='20140701') & (df['date'] <= '20140715')].head()
Out[16]:
Because pandas understands date columns, you can express the date value in multiple formats and it will give you the results you expect.
In [17]:
df[df['date'] >= 'Oct-2014'].head()
Out[17]:
In [18]:
df[df['date'] >= '10-10-2014'].head()
Out[18]:
When working with time series data, if we convert the data to use the date as at the index, we can do some more filtering.
Set the new index using set_index
.
In [19]:
df2 = df.set_index(['date'])
df2.head()
Out[19]:
We can slice the data to get a range.
In [20]:
df2["20140101":"20140201"].head()
Out[20]:
Once again, we can use various date representations to remove any ambiguity around date naming conventions.
In [21]:
df2["2014-Jan-1":"2014-Feb-1"].head()
Out[21]:
In [22]:
df2["2014-Jan-1":"2014-Feb-1"].tail()
Out[22]:
In [23]:
df2["2014"].head()
Out[23]:
In [24]:
df2["2014-Dec"].head()
Out[24]:
Pandas has support for vectorized string functions as well. If we want to identify all the skus that contain a certain value, we can use str.contains
. In this case, we know that the sku is always represented in the same way, so B1 only shows up in the front of the sku.
In [25]:
df[df['sku'].str.contains('B1')].head()
Out[25]:
We can string queries together and use sort to control how the data is ordered.
A common need I have in Excel is to understand all the unique items in a column. For instance, maybe I only want to know when customers purchased in this time period. The unique function makes this trivial.
In [26]:
df[(df['sku'].str.contains('B1-531')) & (df['quantity']>40)].sort_values(by=['quantity','name'],ascending=[0,1])
Out[26]:
I frequently find myself trying to get a list of unique items in a long list within Excel. It is a multi-step process to do this in Excel but is fairly simple in pandas. We just use the unique
function on a column to get the list.
In [27]:
df["name"].unique()
Out[27]:
If we wanted to include the account number, we could use drop_duplicates
.
In [28]:
df.drop_duplicates(subset=["account number","name"]).head()
Out[28]:
We are obviously pulling in more data than we need and getting some non-useful information, so select only the first and second columns using ix
.
In [29]:
df.drop_duplicates(subset=["account number","name"]).iloc[:,[0,1]]
Out[29]:
I hope you found this useful. I encourage you to try and apply these ideas to some of your own repetitive Excel tasks and streamline your work flow.
In [ ]: