Chefkoch Link Data: Exploratory Analysis

This is a first attempt to look at some of the data resulting from the chefkoch link data (without getting fancy about graph databases).

Note these analyses currently use a subset of approx. 30k recipes (of a total of 800k).

Setup: Libraries, data import and formatting



In [14]:

    
# Libraries and graphics settings
import pandas as pd
import re
import statsmodels as sm


from ggplot import *
theme_bw()

import matplotlib.pyplot as mpl
mpl.style.use('ggplot')
%matplotlib inline

from pylab import rcParams
rcParams['figure.figsize'] = 10, 5



In [2]:

    
# Read data
df = pd.read_csv("/Users/Leon/Documents/02_Research_Learning/Research/Recipes/03_Data/link_data.csv",
                index_col=0)



In [3]:

    
# Fix data formats
df['activationdate'] = pd.to_datetime(df['activationdate']) # Well that was easy!
df['difficulty'] = df['difficulty'].astype('category')

# Fix data format for preparation time
df['preptime'] = df['preptime'].astype('string')
df['prep_mins'] = df['preptime'].apply(lambda x: x.replace(" min.", "")).astype('float64')

# Fix data format for strings
df['subtitle'] = df['subtitle'].astype('str') # These aren't working (for obscure reasons)
df['title'] = df['title'].astype('str')
# df.dtypes

# Define additional variable for yearmonth (for plotting)
df['yearmonth'] = df['activationdate'].map(lambda x: x.year*1000 + x.month)
df['year'] = df['activationdate'].map(lambda x: x.year)

# Add one to vote count for plotting with log axis
df['votes_n_plus1'] = df['votes_n'] + 1

Basic descriptives



In [4]:

    
df.shape









    Out[4]:





(29978, 14)



In [12]:

    
df.head(5)









    Out[12]:






  
    
      
      activationdate
      category
      category_list_page
      difficulty
      preptime
      subtitle
      title
      url
      votes_avg
      votes_n
      prep_mins
      yearmonth
      year
      votes_n_plus1
    
  
  
    
      1857691301034784
      2011-03-25
      Getraenke
      www.chefkoch.de/rs/s0g102/Getraenke.html
      simpel
      5 min.
      erwärmt und mit Sahnehaube ein besonderer Genuß
      Bratapfellikör
      www.chefkoch.de/rezepte/1857691301034784/Brata...
      4.65
      15
      5.0
      2011003
      2011
      16
    
    
      537901150889686
      2006-06-21
      Getraenke
      www.chefkoch.de/rs/s0g102/Getraenke.html
      pfiffig
      40 min.
      nach Original sardischem Nonna-Rezept
      Crema di Limoncello
      www.chefkoch.de/rezepte/537901150889686/Crema-...
      4.64
      65
      40.0
      2006006
      2006
      66
    
    
      1341731239030939
      2009-08-04
      Getraenke
      www.chefkoch.de/rs/s0g102/Getraenke.html
      simpel
      20 min.
      Die Erdbeeren putzen und zusammen mit dem Zuck...
      Erdbeer - Limes
      www.chefkoch.de/rezepte/1341731239030939/Erdbe...
      4.65
      79
      20.0
      2009008
      2009
      80
    
    
      541961151505565
      2006-06-28
      Getraenke
      www.chefkoch.de/rs/s0g102/Getraenke.html
      normal
      30 min.
      der absolut weltbeste, leckerste, dickflüssigs...
      Eierlikör nach DDR-Tradition
      www.chefkoch.de/rezepte/541961151505565/Eierli...
      4.81
      1085
      30.0
      2006006
      2006
      1086
    
    
      731571175843661
      2007-06-04
      Getraenke
      www.chefkoch.de/rs/s0g102/Getraenke.html
      normal
      60 min.
      Pflaumen (entsteint) im Entsafter 2 Stunden ko...
      Pflaumenlikör
      www.chefkoch.de/rezepte/731571175843661/Pflaum...
      4.67
      40
      60.0
      2007006
      2007
      41



In [5]:

    
df.columns









    Out[5]:





Index([u'activationdate', u'category', u'category_list_page', u'difficulty',
       u'preptime', u'subtitle', u'title', u'url', u'votes_avg', u'votes_n',
       u'prep_mins', u'yearmonth', u'year', u'votes_n_plus1'],
      dtype='object')



In [6]:

    
df.describe()









    Out[6]:






  
    
      
      votes_avg
      votes_n
      prep_mins
      yearmonth
      year
      votes_n_plus1
    
  
  
    
      count
      29978.000000
      29978.000000
      29978.000000
      2.997800e+04
      29978.000000
      29978.000000
    
    
      mean
      3.075588
      4.411735
      19.716926
      2.009556e+06
      2009.549236
      5.411735
    
    
      std
      0.479588
      28.596995
      27.003455
      3.748080e+03
      3.748266
      28.596995
    
    
      min
      1.000000
      0.000000
      1.000000
      2.000010e+06
      2000.000000
      1.000000
    
    
      25%
      3.000000
      1.000000
      10.000000
      2.007005e+06
      2007.000000
      2.000000
    
    
      50%
      3.000000
      1.000000
      15.000000
      2.009010e+06
      2009.000000
      2.000000
    
    
      75%
      3.000000
      2.000000
      30.000000
      2.013001e+06
      2013.000000
      3.000000
    
    
      max
      4.840000
      2982.000000
      2400.000000
      2.016012e+06
      2016.000000
      2983.000000



In [7]:

    
df[['category','votes_n']].groupby('category').count()









    Out[7]:






  
    
      
      votes_n
    
    
      category
      
    
  
  
    
      Getraenke
      12788
    
    
      Menueart
      17190

First descriptive analyses

Number of recipes over time



In [10]:

    
small = df[['year', 'votes_avg']]
small.groupby('year').count().plot(kind='bar', color='darkblue', title="Number of recipes added each year", legend=None)









    Out[10]:





<matplotlib.axes._subplots.AxesSubplot at 0x1158671d0>

In the subsample of the data parsed so far, the number of recipes added grew strongly from 2003-2008, then decreased with a notable jump in 2015.



In [11]:

    
cumsum = small.sort_values(by='year', ascending=True).groupby('year').count().cumsum(axis=0)
cumsum.plot(kind='bar', color='darkblue', title='Total number of recipes on the platform', legend=None)









    Out[11]:





<matplotlib.axes._subplots.AxesSubplot at 0x11632b210>

Recipe Scores



In [125]:

    
df.plot(kind='hist', y='votes_avg',
        color='darkblue', alpha=0.7, legend=None,
        title='Distribution of recipe scores')









    Out[125]:





<matplotlib.axes._subplots.AxesSubplot at 0x132811410>

Rating for recipes are dispoportionately clustered around 3.



In [55]:

    
small.boxplot(by='year')









    Out[55]:





<matplotlib.axes._subplots.AxesSubplot at 0x12b14b6d0>

A few things things are immediately visible from the boxplot:

Average ratings are strongly bunched around 3
Average ratings are substantially lower for recipes added later
Most of the recipes added in 2007-2012 had reatings around 3



In [13]:

    
df.plot(style=".", x='activationdate', y='votes_avg', legend=None, color='darkblue', alpha=0.1)









    Out[13]:





<matplotlib.axes._subplots.AxesSubplot at 0x115766d10>



In [ ]:

    
# lowess = sm.nonparametric.lowess(df[['votes_avg']], df[['activationdate']], frac=.3)
mod = sm.formula.api.ols(formula='votes_avg ~ activationdate', data=df)
res = mod.fit()
print res.summary()



In [115]:

    
df.plot(kind='scatter', x='votes_n_plus1', y='votes_avg', 
        color='darkblue', s=20, alpha=0.5, 
        title='Count and average of votes for recipes', logx=True).set_ylim([1,5])









    Out[115]:





(1, 5)

Notes on the plot above:

Note the log scale for the x-axis (counts) for recipes with few votes
One can clearly see the patterns of discrete rating averages (from rounding)
Distribution of votes is very skew



In [129]:

    
df.plot(kind='hist', y='votes_n', 
        bins=100, logy=True,
        legend=None, color='darkblue', alpha=0.7)









    Out[129]:





<matplotlib.axes._subplots.AxesSubplot at 0x134571d90>

Difficulty



In [144]:

    
# df.groupby(df['difficulty','year']).count()
df[['votes_n','difficulty','year']].groupby(['year','difficulty']).count().unstack().plot(kind='bar', stacked=True)









    Out[144]:





<matplotlib.axes._subplots.AxesSubplot at 0x134245350>



In [ ]:

	activationdate	category	category_list_page	difficulty	preptime	subtitle	title	url	votes_avg	votes_n	prep_mins	yearmonth	year	votes_n_plus1
1857691301034784	2011-03-25	Getraenke	www.chefkoch.de/rs/s0g102/Getraenke.html	simpel	5 min.	erwärmt und mit Sahnehaube ein besonderer Genuß	Bratapfellikör	www.chefkoch.de/rezepte/1857691301034784/Brata...	4.65	15	5.0	2011003	2011	16
537901150889686	2006-06-21	Getraenke	www.chefkoch.de/rs/s0g102/Getraenke.html	pfiffig	40 min.	nach Original sardischem Nonna-Rezept	Crema di Limoncello	www.chefkoch.de/rezepte/537901150889686/Crema-...	4.64	65	40.0	2006006	2006	66
1341731239030939	2009-08-04	Getraenke	www.chefkoch.de/rs/s0g102/Getraenke.html	simpel	20 min.	Die Erdbeeren putzen und zusammen mit dem Zuck...	Erdbeer - Limes	www.chefkoch.de/rezepte/1341731239030939/Erdbe...	4.65	79	20.0	2009008	2009	80
541961151505565	2006-06-28	Getraenke	www.chefkoch.de/rs/s0g102/Getraenke.html	normal	30 min.	der absolut weltbeste, leckerste, dickflüssigs...	Eierlikör nach DDR-Tradition	www.chefkoch.de/rezepte/541961151505565/Eierli...	4.81	1085	30.0	2006006	2006	1086
731571175843661	2007-06-04	Getraenke	www.chefkoch.de/rs/s0g102/Getraenke.html	normal	60 min.	Pflaumen (entsteint) im Entsafter 2 Stunden ko...	Pflaumenlikör	www.chefkoch.de/rezepte/731571175843661/Pflaum...	4.67	40	60.0	2007006	2007	41

	votes_avg	votes_n	prep_mins	yearmonth	year	votes_n_plus1
count	29978.000000	29978.000000	29978.000000	2.997800e+04	29978.000000	29978.000000
mean	3.075588	4.411735	19.716926	2.009556e+06	2009.549236	5.411735
std	0.479588	28.596995	27.003455	3.748080e+03	3.748266	28.596995
min	1.000000	0.000000	1.000000	2.000010e+06	2000.000000	1.000000
25%	3.000000	1.000000	10.000000	2.007005e+06	2007.000000	2.000000
50%	3.000000	1.000000	15.000000	2.009010e+06	2009.000000	2.000000
75%	3.000000	2.000000	30.000000	2.013001e+06	2013.000000	3.000000
max	4.840000	2982.000000	2400.000000	2.016012e+06	2016.000000	2983.000000