Chefkoch Link Data: Exploratory Analysis

This is a first attempt to look at some of the data resulting from the chefkoch link data (without getting fancy about graph databases).

Note these analyses currently use a subset of approx. 30k recipes (of a total of 800k).

Setup: Libraries, data import and formatting


In [14]:
# Libraries and graphics settings
import pandas as pd
import re
import statsmodels as sm


from ggplot import *
theme_bw()

import matplotlib.pyplot as mpl
mpl.style.use('ggplot')
%matplotlib inline

from pylab import rcParams
rcParams['figure.figsize'] = 10, 5

In [2]:
# Read data
df = pd.read_csv("/Users/Leon/Documents/02_Research_Learning/Research/Recipes/03_Data/link_data.csv",
                index_col=0)

In [3]:
# Fix data formats
df['activationdate'] = pd.to_datetime(df['activationdate']) # Well that was easy!
df['difficulty'] = df['difficulty'].astype('category')

# Fix data format for preparation time
df['preptime'] = df['preptime'].astype('string')
df['prep_mins'] = df['preptime'].apply(lambda x: x.replace(" min.", "")).astype('float64')

# Fix data format for strings
df['subtitle'] = df['subtitle'].astype('str') # These aren't working (for obscure reasons)
df['title'] = df['title'].astype('str')
# df.dtypes

# Define additional variable for yearmonth (for plotting)
df['yearmonth'] = df['activationdate'].map(lambda x: x.year*1000 + x.month)
df['year'] = df['activationdate'].map(lambda x: x.year)

# Add one to vote count for plotting with log axis
df['votes_n_plus1'] = df['votes_n'] + 1

Basic descriptives


In [4]:
df.shape


Out[4]:
(29978, 14)

In [12]:
df.head(5)


Out[12]:
activationdate category category_list_page difficulty preptime subtitle title url votes_avg votes_n prep_mins yearmonth year votes_n_plus1
1857691301034784 2011-03-25 Getraenke www.chefkoch.de/rs/s0g102/Getraenke.html simpel 5 min. erwärmt und mit Sahnehaube ein besonderer Genuß Bratapfellikör www.chefkoch.de/rezepte/1857691301034784/Brata... 4.65 15 5.0 2011003 2011 16
537901150889686 2006-06-21 Getraenke www.chefkoch.de/rs/s0g102/Getraenke.html pfiffig 40 min. nach Original sardischem Nonna-Rezept Crema di Limoncello www.chefkoch.de/rezepte/537901150889686/Crema-... 4.64 65 40.0 2006006 2006 66
1341731239030939 2009-08-04 Getraenke www.chefkoch.de/rs/s0g102/Getraenke.html simpel 20 min. Die Erdbeeren putzen und zusammen mit dem Zuck... Erdbeer - Limes www.chefkoch.de/rezepte/1341731239030939/Erdbe... 4.65 79 20.0 2009008 2009 80
541961151505565 2006-06-28 Getraenke www.chefkoch.de/rs/s0g102/Getraenke.html normal 30 min. der absolut weltbeste, leckerste, dickflüssigs... Eierlikör nach DDR-Tradition www.chefkoch.de/rezepte/541961151505565/Eierli... 4.81 1085 30.0 2006006 2006 1086
731571175843661 2007-06-04 Getraenke www.chefkoch.de/rs/s0g102/Getraenke.html normal 60 min. Pflaumen (entsteint) im Entsafter 2 Stunden ko... Pflaumenlikör www.chefkoch.de/rezepte/731571175843661/Pflaum... 4.67 40 60.0 2007006 2007 41

In [5]:
df.columns


Out[5]:
Index([u'activationdate', u'category', u'category_list_page', u'difficulty',
       u'preptime', u'subtitle', u'title', u'url', u'votes_avg', u'votes_n',
       u'prep_mins', u'yearmonth', u'year', u'votes_n_plus1'],
      dtype='object')

In [6]:
df.describe()


Out[6]:
votes_avg votes_n prep_mins yearmonth year votes_n_plus1
count 29978.000000 29978.000000 29978.000000 2.997800e+04 29978.000000 29978.000000
mean 3.075588 4.411735 19.716926 2.009556e+06 2009.549236 5.411735
std 0.479588 28.596995 27.003455 3.748080e+03 3.748266 28.596995
min 1.000000 0.000000 1.000000 2.000010e+06 2000.000000 1.000000
25% 3.000000 1.000000 10.000000 2.007005e+06 2007.000000 2.000000
50% 3.000000 1.000000 15.000000 2.009010e+06 2009.000000 2.000000
75% 3.000000 2.000000 30.000000 2.013001e+06 2013.000000 3.000000
max 4.840000 2982.000000 2400.000000 2.016012e+06 2016.000000 2983.000000

In [7]:
df[['category','votes_n']].groupby('category').count()


Out[7]:
votes_n
category
Getraenke 12788
Menueart 17190

First descriptive analyses

Number of recipes over time


In [10]:
small = df[['year', 'votes_avg']]
small.groupby('year').count().plot(kind='bar', color='darkblue', title="Number of recipes added each year", legend=None)


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1158671d0>

In the subsample of the data parsed so far, the number of recipes added grew strongly from 2003-2008, then decreased with a notable jump in 2015.


In [11]:
cumsum = small.sort_values(by='year', ascending=True).groupby('year').count().cumsum(axis=0)
cumsum.plot(kind='bar', color='darkblue', title='Total number of recipes on the platform', legend=None)


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x11632b210>

Recipe Scores


In [125]:
df.plot(kind='hist', y='votes_avg',
        color='darkblue', alpha=0.7, legend=None,
        title='Distribution of recipe scores')


Out[125]:
<matplotlib.axes._subplots.AxesSubplot at 0x132811410>

Rating for recipes are dispoportionately clustered around 3.


In [55]:
small.boxplot(by='year')


Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x12b14b6d0>

A few things things are immediately visible from the boxplot:

  • Average ratings are strongly bunched around 3
  • Average ratings are substantially lower for recipes added later
  • Most of the recipes added in 2007-2012 had reatings around 3

In [13]:
df.plot(style=".", x='activationdate', y='votes_avg', legend=None, color='darkblue', alpha=0.1)


Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x115766d10>

In [ ]:
# lowess = sm.nonparametric.lowess(df[['votes_avg']], df[['activationdate']], frac=.3)
mod = sm.formula.api.ols(formula='votes_avg ~ activationdate', data=df)
res = mod.fit()
print res.summary()

In [115]:
df.plot(kind='scatter', x='votes_n_plus1', y='votes_avg', 
        color='darkblue', s=20, alpha=0.5, 
        title='Count and average of votes for recipes', logx=True).set_ylim([1,5])


Out[115]:
(1, 5)

Notes on the plot above:

  • Note the log scale for the x-axis (counts) for recipes with few votes
  • One can clearly see the patterns of discrete rating averages (from rounding)
  • Distribution of votes is very skew

In [129]:
df.plot(kind='hist', y='votes_n', 
        bins=100, logy=True,
        legend=None, color='darkblue', alpha=0.7)


Out[129]:
<matplotlib.axes._subplots.AxesSubplot at 0x134571d90>

Difficulty


In [144]:
# df.groupby(df['difficulty','year']).count()
df[['votes_n','difficulty','year']].groupby(['year','difficulty']).count().unstack().plot(kind='bar', stacked=True)


Out[144]:
<matplotlib.axes._subplots.AxesSubplot at 0x134245350>

In [ ]: