In [1]:
import pandas as pd
import matplotlib.pyplot as plt
When plotting charts on a jupyter notebook we can use an extension to make sure that the plots render inline
In [2]:
%matplotlib notebook
The dataset we are going to use is a variation of the Online Retail Dataset (source)
This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.
In [3]:
data = pd.read_csv("../data/online_orders.csv")
In [4]:
data.head()
Out[4]:
The dataset consists of the following fields.
We use line charts generally when we want to see the trend of one (or many variables) over time. For example, lets say we want to see the sales in germany compared to france over time. Pandas plot
method tries a line plot by default.
We can set the index of the dataframe to the date and pandas plot method will automatically pick it up as the x axis
In [5]:
data_indexed = data.set_index('date')
In [6]:
data_indexed.head()
Out[6]:
Pandas plotting library is just a thin wrapper around matplotlib. Matplotlib has a somewhat convoluted api, and pandas makes plotting common charts much easier.
For example, here we call directly matplotlib legend
to display the legend on top of pandas plots
In [7]:
data_indexed[data_indexed.country=='Germany'].sales.plot(label="Germany")
data_indexed[data_indexed.country=='France'].sales.plot(label="France")
plt.legend();
We see that sales numbers in France and Germany are generally similar. And that each country has a spike of sales at a certain day in 2011
In [8]:
data.plot.scatter(x='n_items', y='sales');
No surprises here, we see that there is a linear relationship between the number of items purchased and the sales revenue
We can use matplotlib plt.plot
method directly as well. This gives us much more flexibility regarding how we can display the data
In [9]:
for country in data.country.unique():
plt.plot(data[data.country == country].n_items, # plot this series as the x
data[data.country == country].sales, #plot this series as the y
marker='o', #make the markers circle shaped
linestyle='', #dont connect the dots with lines
ms=3, # size of the parkers (in pixels)
label=country # use the country as the label of this plot
)
plt.legend();
We can see for example how do sales relate to the day of week
In [10]:
data.plot.scatter(x='day_of_week', y="sales");
We can see that sales are generally equaly distributed, except a little uptick on Thursdays and no sales at all on Saturday! This would be cause for an indepth QA of the data (online sales dont usually stop on saturdays)
In [11]:
data.groupby('country')['sales'].sum().plot.barh();
Here we can see that the top countries in terms of sales on the dataset are Netherlands, Germany and EIRE. This does not mean that they are the top countries in terms of sales, because remember, the dataset consist of country + date.
In [12]:
data.boxplot(column="sales", by="country");
Here we see that the top countries in terms of points (France, Germany and Eire) have a significant number of outliers in terms of sales. Netherlands on the other hand, has sales that are more stable (probably a smaller number of bigger orders)
We can use histograms to make sure that nothing is fishy with the data, as well as to gain an understanding of its distribution. For example, we can use it to see how are sales distributed in EIRE
In [13]:
data[data.country=='EIRE'].sales.hist();
Here we can see that, even though the big majority of sales days in EIRE are less than 2500 GBP, some days there are significantly higher sales.
We can limit the extend of the xaxis of a histogram by passing the parameter xlim
to the plot
In [14]:
data[data.country=='EIRE'].sales.plot.hist(xlim=(0,7000));
We can also specify the number of groups of the histogram by using the paramenter bins
In [15]:
data[data.country=='EIRE'].sales.plot.hist(xlim=(0,7000), bins=70);
So we see that the most common daily sales revenue in EIRE is 1000 GBP per day
Using pandas .plot
as it is is good enough for when doing data analysis (remember, data understanding is one of the main goals of data visualization).
However, when we want to share a chart with someone else (wether that person is another data scientist or someone without a technical background), we need to take more steps in order to provide the most effective visualizations.
When producing charts for external use, always remember:
Pandas uses matplotlib as a plotting backend. Thus, we can use matplotlib styles and api to modify our
We can change the style and make use of all of matplotlib styles
In [16]:
plt.style.available
Out[16]:
In [17]:
plt.style.use('seaborn')
In [18]:
data[data.country=='EIRE'].sales.plot.hist(xlim=(0,7000), bins=70);
We can see that now the plot has a completely different set of fonts, color and sizes
Be warned!
Different styles might not render the same results
In [19]:
data.boxplot(column="sales", by="country");
For example, we see that the seaborn
style does not display the outliers on the boxplots, so you might be mislead into believing that there are no outliers just because of how the plot style is setup!
Finally, if we wanted to share our plot with someone else (to display on a paper or to share with a client) we can make use of matplotlib customization to make our plot more explicit and nicer looking
In [20]:
plt.style.use('ggplot')
In [25]:
data_indexed[data_indexed.country=='Germany'].sales.plot(label="Germany")
data_indexed[data_indexed.country=='France'].sales.plot(label="France")
plt.legend()
plt.title("Daily sales in France and Germany")
plt.xlabel("Date")
plt.ylabel("Sales (GPB)")
plt.figtext(.3, .01, "Source: Online Retail Dataset Transactions between 01/12/2010 and 09/12/2011");
In [ ]: