In [101]:
from dds_lab import *
from collections import Counter
Next, we read in a CSV (comma separated values) file that is stored on the server. This is the week-by-week menu from JMCC. (You can also look at the same data via this GitHub file.)
In [102]:
menu_csv = pd.read_csv("../data/uoe_catering/JMCC_Student_Menu_2015-2016.csv")
In [103]:
menu_csv
Out[103]:
You'll notice lots of cells containing 'NaN'. This stands for Not a Number, and arises because the menu file has lots of blanks in it. It's been designed for visual inspection rather than for any kind of processing.
We'll focus on the column that is currently labeled 'Unnamed: 4'.
In [104]:
food = menu_csv['Unnamed: 4']
Next, we use the function dropna to get rid of the NaN values, and also convert it a regular Python list. The notation [1:] just says that we want to drop the first item in the list.
In [105]:
food = food.dropna().tolist()[1:]
Let's look at the first 10 items in the list:
In [106]:
food[:9]
Out[106]:
So this is beginning to look more interesting. One simple thing we might want to ask is: what are the most frequent words in the menu items?
We will use the split() method to convert a string like 'TOPPED WITH CHEESE' into a list of words by splitting on every space.
In [107]:
'TOPPED WITH CHEESE'.split()
Out[107]:
And since allcaps is UGLY, we can convert words into lower case with the lower() method.
In [108]:
'TOPPED'.lower()
Out[108]:
In [109]:
food_items = [word.lower() for item in food for word in item.split()]
Let's look at the top 10 items in our new list.
In [110]:
food_items[:9]
Out[110]:
It would be nice to count how many occurrences of each word appear in this list. Python has a convenient way of getting frequency counts using a Counter. It works like this:
In [111]:
c = Counter(food_items)
c.most_common(20)
Out[111]:
Whoops, that's not so good. We don't care about boring words like 'with' and 'and'. So let's get rid of them. First, we make a list boring of words to ignore. Next, we redefine food_items. You can read [item for item in food_items if item not in boring] as saying: "construct the list of all items in food_items which are not in the boring list". Then we count again.
In [112]:
boring = ['a','all','with','on','of','and','item','menu','&','chefs','counters','bar','mixed','choice','assorted','selection']
food_items = [item for item in food_items if item not in boring]
c = Counter(food_items)
c.most_common(20)
Out[112]:
Surprise! Potatoes are really popular with the JMCC chefs.
If we want, we can convert this frequency list back into a DataFrame, using the following commands.
In [127]:
df = pd.DataFrame.from_dict(c, orient='index')
df = df.rename(columns={0:'count'})
df
Out[127]:
And we can get a nice looking table by sorting and taking the top 20 items.
In [133]:
df_sorted = df.sort_values(by='count', ascending=False).head(20)
df_sorted
Out[133]:
In [139]:
%matplotlib inline
df_sorted.plot(kind='bar')
Out[139]:
Finally, we can write this DataFrame back to a CSV file, which we'll call food_items.csv.
In [119]:
df.to_csv('food_items.csv')