In [1]:

    
import pandas as pd
import matplotlib.pyplot as plt

When plotting charts on a jupyter notebook we can use an extension to make sure that the plots render inline



In [2]:

    
%matplotlib notebook

The dataset we are going to use is a variation of the Online Retail Dataset (source)

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.



In [3]:

    
data = pd.read_csv("../data/online_orders.csv")



In [4]:

    
data.head()









    Out[4]:






  
    
      
      country
      date
      sales
      n_items
      n_orders
      day_of_week
    
  
  
    
      0
      Australia
      2010-12-01
      358.25
      107
      1
      2
    
    
      1
      Australia
      2010-12-08
      258.90
      214
      1
      2
    
    
      2
      Australia
      2010-12-17
      415.70
      146
      1
      4
    
    
      3
      Australia
      2011-01-06
      7154.38
      4802
      2
      3
    
    
      4
      Australia
      2011-01-10
      81.60
      96
      1
      0

The dataset consists of the following fields.

country: Country name
date: Date the row is showing a summary of
sales: Total sales revenue (in UK pounds) for that country and date
n_items: Number of items sold
n_orders: Number of different online orders
day_of_week: Day of week

Line Charts

We use line charts generally when we want to see the trend of one (or many variables) over time. For example, lets say we want to see the sales in germany compared to france over time. Pandas plot method tries a line plot by default.

We can set the index of the dataframe to the date and pandas plot method will automatically pick it up as the x axis



In [5]:

    
data_indexed = data.set_index('date')



In [6]:

    
data_indexed.head()









    Out[6]:






  
    
      
      country
      sales
      n_items
      n_orders
      day_of_week
    
    
      date
      
      
      
      
      
    
  
  
    
      2010-12-01
      Australia
      358.25
      107
      1
      2
    
    
      2010-12-08
      Australia
      258.90
      214
      1
      2
    
    
      2010-12-17
      Australia
      415.70
      146
      1
      4
    
    
      2011-01-06
      Australia
      7154.38
      4802
      2
      3
    
    
      2011-01-10
      Australia
      81.60
      96
      1
      0

Pandas plotting library is just a thin wrapper around matplotlib. Matplotlib has a somewhat convoluted api, and pandas makes plotting common charts much easier.

For example, here we call directly matplotlib legend to display the legend on top of pandas plots



In [7]:

    
data_indexed[data_indexed.country=='Germany'].sales.plot(label="Germany")
data_indexed[data_indexed.country=='France'].sales.plot(label="France")
plt.legend();

We see that sales numbers in France and Germany are generally similar. And that each country has a spike of sales at a certain day in 2011

Scatter plots

We can use a scatter plot to see the relationship between the number of items sold and the total sales revenue



In [8]:

    
data.plot.scatter(x='n_items', y='sales');

No surprises here, we see that there is a linear relationship between the number of items purchased and the sales revenue

We can use matplotlib plt.plot method directly as well. This gives us much more flexibility regarding how we can display the data



In [9]:

    
for country in data.country.unique():
    plt.plot(data[data.country == country].n_items,  # plot this series as the x
             data[data.country == country].sales,    #plot this series as the y
             marker='o',   #make the markers circle shaped
             linestyle='', #dont connect the dots with lines
             ms=3,         # size of the parkers (in pixels)
             label=country # use the country as the label of this plot
            )

plt.legend();

We can see for example how do sales relate to the day of week



In [10]:

    
data.plot.scatter(x='day_of_week', y="sales");

We can see that sales are generally equaly distributed, except a little uptick on Thursdays and no sales at all on Saturday! This would be cause for an indepth QA of the data (online sales dont usually stop on saturdays)

Bar chart



In [11]:

    
data.groupby('country')['sales'].sum().plot.barh();

Here we can see that the top countries in terms of sales on the dataset are Netherlands, Germany and EIRE. This does not mean that they are the top countries in terms of sales, because remember, the dataset consist of country + date.

BoxPlot



In [12]:

    
data.boxplot(column="sales", by="country");

Here we see that the top countries in terms of points (France, Germany and Eire) have a significant number of outliers in terms of sales. Netherlands on the other hand, has sales that are more stable (probably a smaller number of bigger orders)

Histograms

We can use histograms to make sure that nothing is fishy with the data, as well as to gain an understanding of its distribution. For example, we can use it to see how are sales distributed in EIRE



In [13]:

    
data[data.country=='EIRE'].sales.hist();

Here we can see that, even though the big majority of sales days in EIRE are less than 2500 GBP, some days there are significantly higher sales.

We can limit the extend of the xaxis of a histogram by passing the parameter xlim to the plot



In [14]:

    
data[data.country=='EIRE'].sales.plot.hist(xlim=(0,7000));

We can also specify the number of groups of the histogram by using the paramenter bins



In [15]:

    
data[data.country=='EIRE'].sales.plot.hist(xlim=(0,7000), bins=70);

So we see that the most common daily sales revenue in EIRE is 1000 GBP per day

Customizing

Using pandas .plot as it is is good enough for when doing data analysis (remember, data understanding is one of the main goals of data visualization).

However, when we want to share a chart with someone else (wether that person is another data scientist or someone without a technical background), we need to take more steps in order to provide the most effective visualizations.

When producing charts for external use, always remember:

Add a title describing the chart
Add labels for all the axes.
Check the axes limits to make sure they are appropriate and help convey the right information
Add legends if necessary (when dealing with multiple groups)
Make sure the color palette you choose will display properly on the medium where it is going to be consumed (for example, which color is the background where the chart will be inserted affects how the chart is visualized)
It is good practice, specially if the chart is to be displayed publicly (and thus probably isolated from its original document) to add a footnote to the chart adding the source of the data.

Styles

Pandas uses matplotlib as a plotting backend. Thus, we can use matplotlib styles and api to modify our

We can change the style and make use of all of matplotlib styles



In [16]:

    
plt.style.available









    Out[16]:





['ggplot',
 'seaborn',
 'classic',
 'seaborn-ticks',
 'fivethirtyeight',
 'seaborn-deep',
 'grayscale',
 'seaborn-dark-palette',
 'seaborn-talk',
 'bmh',
 'seaborn-paper',
 'seaborn-dark',
 'seaborn-bright',
 'dark_background',
 'seaborn-notebook',
 'seaborn-pastel',
 'seaborn-white',
 'seaborn-muted',
 'seaborn-colorblind',
 'seaborn-whitegrid',
 'seaborn-poster',
 'seaborn-darkgrid']



In [17]:

    
plt.style.use('seaborn')



In [18]:

    
data[data.country=='EIRE'].sales.plot.hist(xlim=(0,7000), bins=70);

We can see that now the plot has a completely different set of fonts, color and sizes

Be warned!

Different styles might not render the same results



In [19]:

    
data.boxplot(column="sales", by="country");

For example, we see that the seaborn style does not display the outliers on the boxplots, so you might be mislead into believing that there are no outliers just because of how the plot style is setup!

Labels and titles

Finally, if we wanted to share our plot with someone else (to display on a paper or to share with a client) we can make use of matplotlib customization to make our plot more explicit and nicer looking



In [20]:

    
plt.style.use('ggplot')



In [25]:

    
data_indexed[data_indexed.country=='Germany'].sales.plot(label="Germany")
data_indexed[data_indexed.country=='France'].sales.plot(label="France")
plt.legend()
plt.title("Daily sales in France and Germany")
plt.xlabel("Date")
plt.ylabel("Sales (GPB)")
plt.figtext(.3, .01, "Source: Online Retail Dataset Transactions between 01/12/2010 and 09/12/2011");



In [ ]:

	country	date	sales	n_items	n_orders	day_of_week
0	Australia	2010-12-01	358.25	107	1	2
1	Australia	2010-12-08	258.90	214	1	2
2	Australia	2010-12-17	415.70	146	1	4
3	Australia	2011-01-06	7154.38	4802	2	3
4	Australia	2011-01-10	81.60	96	1	0

	country	sales	n_items	n_orders	day_of_week
date
2010-12-01	Australia	358.25	107	1	2
2010-12-08	Australia	258.90	214	1	2
2010-12-17	Australia	415.70	146	1	4
2011-01-06	Australia	7154.38	4802	2	3
2011-01-10	Australia	81.60	96	1	0