USDA Food Data - Preliminary Analysis

USDA Food Data is obtained from a consolidated dataset published by the Open Food Facts organization (https://world.openfoodfacts.org/) and made available on the Kaggle website (https://www.kaggle.com/openfoodfacts/world-food-facts).

Open Food Facts is a free, open, collbarative database of food products from around the world, with ingredients, allergens, nutrition facts and all the tidbits of information we can find on product labels (source: ://www.kaggle.com/openfoodfacts/world-food-facts).

Link to the available data can be found here - https://www.kaggle.com/openfoodfacts/world-food-facts/downloads/en.openfoodfacts.org.products.tsv

For the purpose of our analysis we will only be looking at USDA data and not data sourced from other countries since the USDA data appears to be the dataset that is well populated with values.

Loading the data



In [3]:

    
# load pre-requisite imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from gensim import corpora, models, similarities



In [101]:

    
# load world food data into a pandas dataframe
world_food_facts =pd.read_csv("../w209finalproject_data/data/en.openfoodfacts.org.products.tsv", sep='\t',low_memory=False)

# extract USDA data from world data
usda_import = world_food_facts[world_food_facts.creator=="usda-ndb-import"]

# save the usda data to a csv file
usda_import.to_csv("../w209finalproject_data/data/usda_imports_v2.csv")

Preliminary look at the USDA data



In [ ]:

    
# Examining available fields
print("Number of records:",len(usda_import))
print("Number of columns:",len(list(usda_import)))

print("\nField Names:")
list(usda_import)

len(usda_import)

Quick look at a few of the rows

Each row contains fields that specify the value for a given nutrient. Note that only those fields with valid values are populated. The others are empty.



In [93]:

    
usda_import_subset = usda_import.head(1)
print "Code:",usda_import_subset['code'][1]
print "Product Name:",usda_import_subset['product_name'][1]
print "Ingredients:",usda_import_subset['ingredients_text'][1]
print "Sugar 100g",usda_import_subset['sugars_100g'][1]
print "Vitamin A 100g",usda_import_subset['vitamin-a_100g'][1]









    



Code: 0000000004530
Product Name: Banana Chips Sweetened (Whole)
Ingredients: Bananas, vegetable oil (coconut oil, corn oil and/or palm oil) sugar, natural banana flavor.
Sugar 100g 14.29
Vitamin A 100g 0.0

Quick look at ingredients

Ingredients are not broken down similar to nutrients into separate fields. Rather, all ingredients are grouped together into a single line of text.



In [8]:

    
usda_import['ingredients_text'].head(5)









    Out[8]:





1    Bananas, vegetable oil (coconut oil, corn oil ...
2    Peanuts, wheat flour, sugar, rice flour, tapio...
3    Organic hazelnuts, organic cashews, organic wa...
4                                      Organic polenta
5    Rolled oats, grape concentrate, expeller press...
Name: ingredients_text, dtype: object

In this step, we convert the ingredients text into a format that can be vectorized.



In [9]:

    
# Extracting ingredients for a particular product
pd.set_option('display.max_rows', 600)
pd.set_option('display.max_columns', 600)

print "Vectorizable ingredients text"

for x in range(3):
    ingredients = re.split(',|\(|\)',usda_import['ingredients_text'].iloc[x])
    ingredients = [w.strip().replace(' ','-') for w in ingredients]
    print(' '.join(ingredients))









    



Vectorizable ingredients text
Bananas vegetable-oil coconut-oil corn-oil-and/or-palm-oil sugar natural-banana-flavor.
Peanuts wheat-flour sugar rice-flour tapioca-starch salt leavening ammonium-bicarbonate baking-soda  soy-sauce water soybeans wheat salt  potato-starch.
Organic-hazelnuts organic-cashews organic-walnuts-almonds organic-sunflower-oil sea-salt.

Cleaning up the dataset

We now look at the available data in the dataset and look for possible issues with the data that could impact our analysis.

Notice that several entries are not full populated with all available nutrition.

Going by the results, we can limit the categories that we use for the analysis to just those over 100,000 values to ensure that we avoid having to work with columns that are not sufficiently populated.



In [10]:

    
# Looking for columns that are not sufficiently populated

# display count of all rows
print("Total rows in USDA dataset are:",len(usda_import))

# display count of all non-NAN entries in each column
print("\nCount of non-NaN values in each column")

print(usda_import.count().sort_values(ascending=False))









    



('Total rows in USDA dataset are:', 169868)

Count of non-NaN values in each column
code                                          169868
states_en                                     169868
countries                                     169868
url                                           169868
creator                                       169868
created_t                                     169868
created_datetime                              169868
states_tags                                   169868
last_modified_t                               169868
last_modified_datetime                        169868
countries_tags                                169868
countries_en                                  169868
states                                        169868
additives_n                                   169867
ingredients_text                              169867
ingredients_from_palm_oil_n                   169867
ingredients_that_may_be_from_palm_oil_n       169867
serving_size                                  169866
additives                                     169862
product_name                                  169823
carbohydrates_100g                            169594
energy_100g                                   169470
fat_100g                                      169468
sodium_100g                                   169074
salt_100g                                     169074
brands                                        169012
brands_tags                                   169012
proteins_100g                                 168974
sugars_100g                                   158497
saturated-fat_100g                            143925
cholesterol_100g                              141744
trans-fat_100g                                141117
fiber_100g                                    138463
nutrition_grade_fr                            137619
nutrition-score-uk_100g                       137619
nutrition-score-fr_100g                       137619
vitamin-c_100g                                137333
iron_100g                                     137182
calcium_100g                                  136010
vitamin-a_100g                                135007
additives_en                                  110783
additives_tags                                110783
potassium_100g                                 23872
monounsaturated-fat_100g                       20485
polyunsaturated-fat_100g                       20482
vitamin-pp_100g                                10476
vitamin-b1_100g                                 9673
vitamin-b2_100g                                 9593
vitamin-d_100g                                  5929
vitamin-b6_100g                                 5543
phosphorus_100g                                 4898
magnesium_100g                                  4883
vitamin-b12_100g                                4223
ingredients_that_may_be_from_palm_oil_tags      4060
vitamin-b9_100g                                 4030
zinc_100g                                       3415
folates_100g                                    3032
copper_100g                                     1929
pantothenic-acid_100g                           1761
manganese_100g                                  1460
selenium_100g                                    977
vitamin-k_100g                                   744
pnns_groups_2                                     70
pnns_groups_1                                     70
-lactose_100g                                     33
categories_tags                                   32
categories_en                                     32
main_category                                     32
categories                                        32
main_category_en                                  32
caffeine_100g                                     28
quantity                                          28
packaging                                         15
packaging_tags                                    15
purchase_places                                    6
labels_en                                          5
generic_name                                       5
manufacturing_places                               5
labels                                             5
manufacturing_places_tags                          5
labels_tags                                        5
traces_en                                          5
traces                                             5
traces_tags                                        5
stores                                             4
origins_tags                                       3
origins                                            3
allergens                                          3
fruits-vegetables-nuts_100g                        2
ingredients_from_palm_oil_tags                     2
-oleic-acid_100g                                   1
cities                                             0
first_packaging_code_geo                           0
emb_codes_tags                                     0
ingredients_that_may_be_from_palm_oil              0
allergens_en                                       0
ingredients_from_palm_oil                          0
no_nutriments                                      0
emb_codes                                          0
cities_tags                                        0
nutrition_grade_uk                                 0
water-hardness_100g                                0
image_url                                          0
image_small_url                                    0
-sucrose_100g                                      0
-glucose_100g                                      0
-fructose_100g                                     0
-maltose_100g                                      0
-maltodextrins_100g                                0
starch_100g                                        0
polyols_100g                                       0
casein_100g                                        0
serum-proteins_100g                                0
nucleotides_100g                                   0
alcohol_100g                                       0
beta-carotene_100g                                 0
vitamin-e_100g                                     0
biotin_100g                                        0
silica_100g                                        0
bicarbonate_100g                                   0
chloride_100g                                      0
fluoride_100g                                      0
chromium_100g                                      0
molybdenum_100g                                    0
iodine_100g                                        0
taurine_100g                                       0
ph_100g                                            0
collagen-meat-protein-ratio_100g                   0
cocoa_100g                                         0
chlorophyl_100g                                    0
carbon-footprint_100g                              0
-nervonic-acid_100g                                0
-erucic-acid_100g                                  0
-mead-acid_100g                                    0
-cerotic-acid_100g                                 0
energy-from-fat_100g                               0
-butyric-acid_100g                                 0
-caproic-acid_100g                                 0
-caprylic-acid_100g                                0
-capric-acid_100g                                  0
-lauric-acid_100g                                  0
-myristic-acid_100g                                0
-palmitic-acid_100g                                0
-stearic-acid_100g                                 0
-arachidic-acid_100g                               0
-behenic-acid_100g                                 0
-lignoceric-acid_100g                              0
-montanic-acid_100g                                0
-gondoic-acid_100g                                 0
glycemic-index_100g                                0
omega-3-fat_100g                                   0
-alpha-linolenic-acid_100g                         0
-eicosapentaenoic-acid_100g                        0
-docosahexaenoic-acid_100g                         0
omega-6-fat_100g                                   0
-linoleic-acid_100g                                0
-arachidonic-acid_100g                             0
-gamma-linolenic-acid_100g                         0
-dihomo-gamma-linolenic-acid_100g                  0
omega-9-fat_100g                                   0
-elaidic-acid_100g                                 0
-melissic-acid_100g                                0
dtype: int64

Looking for similar products based on ingredients

This section attempts to use item similarity to look for similar products based on ingredients present. We vectorize all ingredients and use the resulting vector to look for similar items.



In [68]:

    
# load the subsample USDA data
#usda_sample_data =pd.read_csv("./data/usda_imports_20k.csv", sep=',',low_memory=False)
#usda_sample_data =pd.read_csv("./data/usda_imports_v2_1000_hdr.csv", sep=',',low_memory=False)
usda_sample_data =pd.read_csv("./data/usda_imports_v2.csv", sep=',',low_memory=False)



In [69]:

    
# add a new column that includes a modified version of ingredients list that can be vectorized
ingredients_list=[]

index = 0
for x in range(len(usda_sample_data)):
    str_to_split = usda_import['ingredients_text'].iloc[x]

    try:
        ingredients = re.split(',|\(|\)|\[|\]',str_to_split)
    except:
        ingredients = re.split(',|\(|\)|\[|\]',"None")
        
    ingredients = [w.strip().replace(' ','-') for w in ingredients]
    ingredients_str = ' '.join(ingredients)
    
    ingredients_list.append(ingredients_str)
    
    index+=1 
    
# add the new column to the dataframe
usda_sample_data['ingredients_list'] = ingredients_list

print(usda_sample_data['ingredients_list'])









    



0         Bananas vegetable-oil coconut-oil corn-oil-and...
1         Peanuts wheat-flour sugar rice-flour tapioca-s...
2         Organic-hazelnuts organic-cashews organic-waln...
3                                           Organic-polenta
4         Rolled-oats grape-concentrate expeller-pressed...
5                             Organic-long-grain-white-rice
6         Org-oats org-hemp-granola org-oats evaporated-...
7         Organic-chocolate-liquor organic-raw-cane-suga...
8         Organic-expeller-pressed refined-high-oleic-su...
9                                      Organic-adzuki-beans
10               Organic-refined-durum-semolina-wheat-flour
11        Roasted-peanuts peanuts peanut-or-canola-oil s...
12                                Organic-golden-flax-seeds
13        Organic-dry-roasted-pumpkin-seeds tamari soybe...
14        Organic-rolled-oats honey raisins almonds sunf...
15                                   Organic-raw-hazelnuts.
16        Organic-bananas organic-coconut-oil organic-sugar
17                               Organic-brown-jasmine-rice
18                                       Organic-oat-groats
19        Yogurt-raisins tamari-roasted-almonds organic-...
20        Chocolate-stars dehydrated-cane-juice sweetene...
21        Organic-rolled-oats organic-evaporated-cane-ju...
22        Dry-roasted-almonds hatch-green-chile-seasonin...
23        Peanut-butter dry-roasted-peanuts palm-oil sal...
24             Ancient-sea-salt-with-natural-trace-minerals
25        Organic-whole-rolled-oats organic-expeller-pre...
26            Whole-cashews sunflower-oil salt black-pepper
27        Cashews curry-seasoning salt maltodextrin spic...
28        Almonds wasabi-spice salt rice-flour evaporate...
29                                          Organic-coconut
30                                       Organic-red-quinoa
31        Coconut-bar coconut brown-rice-syrup  dark-cho...
32        Organic-semi-sweet-chocolate-chips organic-sug...
33        Organic-whole-rolled-oates organic-raisins org...
34        Organic-rolled-oats organic-evaporated-cane-ju...
35                                 Raw-cane-demerara-sugar.
36        Organic-rolled-oats flame-raisins organic-coco...
37                                         Organic-coconut.
38                                      Organic-black-beans
39                 Bluebird-grain-organic-grain-emmer-farro
40                          Organic-hard-red-wheat-berries.
41        Organic-rolled-oats organic-brown-rice-syrup f...
42        Dark-chocolate-coating unsweetened-chocolate e...
43        Organic-rolled-oats organic-rolled-rye organic...
44                                 Organic-black-chia-seeds
45                             Organic-french-green-lentils
46                                   Organic-garbanzo-beans
47                                 Organic-green-split-peas
48                High-fiber low-fat vegetable-nourishment.
49                               Organic-small-white-beans.
50                               Organic-grey-green-lentils
51                                Organic-yellow-split-peas
52                                       Organic-mung-beans
53                                  Organic-baby-lima-beans
54                             Organic-dark-red-kidney-bean
55        Whole rolled-oats vegetable-oil canola-and/or-...
56        Whole-rolled-oats milled-cane-sugar vegetable-...
57        Whole-rolled-oats milled-cane-sugar vegetable-...
58        Whole-rolled-oats raisins raisins vegetable-gl...
59        Rolled-oats dates may-contain-date-pits  sunfl...
60        Organic-semolina-flour organic-spinach-powder ...
61        Egg-pasta refined-durum-semolina-wheat-flour p...
62        Refined-enriched-durum-semolina-wheat thiamine...
63               Organic-refined-durum-semolina-wheat-flour
64        Refined enriched-durum-semolina-wheat-flour th...
65                             Organic-refined-spelt-flour.
66        Wheat rye triticale oat corn barley soy-bean b...
67        New-zealand-sea-salt-with-natural-trace-elements.
68                           Organic-evaporated-cane-juice.
69        Organic-refined-wheat-flour niacin iron thiami...
70                           Organic-whole-brown-rice-flour
71                          Organic-whole-grain-spelt-flour
72                                 Organic-whole-rye-flour.
73        Whole-rolled-oats milled-cane-sugar vegetable-...
74                 Organic-whole-grain-hard-red-wheat-flour
75                          Organic-soft-white-wheat-flour.
76        Peanuts honey coating sucrose wheat-starch hon...
77                                         Organic-cashews.
78                           Organic-sesame-seed-with-hulls
79                              Organic-hulled-sesame-seeds
80        Almonds salt corn-starch hydrolyzed-corn-prote...
81        Spanish-peanuts expeller-pressed-high-monounsa...
82        Peanuts almonds cashews pecans sucrose honey w...
83                       Sunflower-kernels coconut-oil salt
84        Cashews almonds hazelnuts pecans brazil-nuts e...
85        Honey-roast-mixed-nuts peanuts almonds cashews...
86                                          Organic-walnuts
87                             Organic-raw-sunflower-seeds.
88        Organic-raisins organic-dry-roasted-peanuts or...
89                                 Us-grown-organic-pecans.
90                                     Dry-roasted-almonds.
91                                    Organic-pumpkin-seeds
92        100%-cold-pressed-unrefined-oil-from-the-first...
93                     Water bragg's-formulated-soy-protein
94              Organic-expeller-pressed-refined-canola-oil
95        Water organic-whole-soybeans organic-whole-whe...
96        Water organic-whole-soybeans salt-and-alcohol ...
97                                                    Maple
98        Precooked-lentils salt curry-spices-and-herbs ...
99        Precooked-green-split-peas carrots salt onion ...
100       Figs stone-ground-whole-wheat-flour honey pear...
101       Peanut-butter-chips evaported-cane-juice fract...
102       Peanut-butter-coating evaporated-cane-juice fr...
103       Organic-pumpkin-seeds organic-wheat-free-tamar...
104       Pretzels:-enriched-flour refined-wheat-flour m...
105                                    Organic-kamut-flakes
106                                             Rolled-oats
107                                       Quick-rolled-oats
108                                         Organic-bulgur.
109                        Organic-toasted-buckwheat-groats
110                                           Rolled-barley
111       Red-wheat-flakes white-wheat-flakes barley-fla...
112       Whole-grain-corn oats brown-rice soybeans oat-...
113       Rolled-oats rolled-wheat rolled-rye date-piece...
114                                  Organic-hulless-barley
115                                          Organic-millet
116                                         Organic-popcorn
117                        Organic-soft-white-wheat-berries
118                                 Organic-pearled-barley.
119                                      Organic-wheat-bran
120                            Organic-raw-buckwheat-groats
121                                        Organic-oat-bran
122                           Organic-thick-cut-rolled-oats
123                             Organic-regular-rolled-oats
124                               Organic-rolled-rye-flakes
125                      Steel-cut-are-whole-groats-sliced.
126                                           Spelt-berries
127                                             Rolled-oats
128       Unmilled-wehani japonica-black and-long-grain-...
129                                    Organic-arborio-rice
130                   Organic-golden-rose-medium-brown-rice
131       Brown-rice wehani-rice black-japonica-rice bla...
132                          Organic-short-grain-brown-rice
133                                Organic-sweet-brown-rice
134                     Organic-california-white-sushi-rice
135       Organic-basmati-rice organic-green-and-yellow-...
136                                       Organic-wild-rice
137             Water alcohol-and-vanilla-bean-extractives.
138                                Organic-brown-flax-seeds
139                        Cranberries sugar sunflower-oil.
140                                Apricots sulfur-dioxide.
141       Bananas vegetable-oil coconut-oil corn-oil and...
142       Whole-rolled-oats milled-cane-sugar whole-roll...
143       Dried-mango paprika sugar salt citric-acid and...
144       Milk-chocolate sugar cocoa-butter chocolate-li...
145       Peanuts wheat-flour sugar rice-flour tapioca-s...
146                            Organic-pistachios sea-salt.
147                                        Organic-apricots
148                                     Roasted-pistachios.
149                                  Organic-medjool-dates.
150                                     Organic-dried-plums
151                               Organic-deglet-noor-dates
152       Apple-juice-concentrate cranberries sunflower-oil
153                                  Organic-zante-currants
154       Organic-select-thompson-seedless-raisins and-o...
155             Organic-unrefined-extra-virgin-coconuts-oil
156                              Organic-brown-basmati-rice
157                              Organic-white-basmati-rice
158                      Organic-unrefined-mascobado-sugar.
159                            Organic-dry-roasted-peanuts.
160                                                 Almonds
161                                            Pecan-halves
162                                        Organic-cashews.
163                                    Organic-pearl-quinoa
164                                   Fresh-organic-carrots
165       Wheat-flour butter cream  water yeast sugar sa...
166       Enriched-wheat-flour wheat-flour malted-barley...
167       Enriched-wheat-flour wheat-flour-niacin reduce...
168       Enriched-wheat-flour niacin reduced-iron thiam...
169       Enriched-wheat-flour niacin reduced-iron thiam...
170       Enriched-flour bleached-wheat-flour niacin red...
171       Unbleached-enriched-flour wheat-flour barley-m...
172       Enriched-wheatflour wheatflour malted-barley-f...
173       Unbleached-wheat-flour sugar vegetable-fat-non...
174       Enriched-flour bleached-wheat-flour niacin red...
175       Wheatflour sugar eggs soybean-oil water chocol...
176       Enriched-wheat-flour wheat-flour malted-barley...
177       Enriched-wheatflour wheat-flour niacin reduced...
178       Apples sugar unbleached-enriched-wheat-flour n...
179       Enriched-flour bleached-wheat-flour niacin red...
180       Pastry:-unbleached-wheat-flour wheat-flour mal...
181       Pastry:-unbleached-wheat-flour wheat-flour mal...
182       Sugar wheat-flour wheat-flour malted-barley-fl...
183       Pastry-dough unbleached-enriched-wheat-flour w...
184       Wheat-flour butter water chocolate sugar cocoa...
185       Enriched-wheat-flour wheat-flour malted-barley...
186       Unbleached-enriched-flour wheat-flour malted-b...
187       Unbleached-wheat-flour vegetable-fat-non-hydro...
188       Enriched-wheatflour wheatflour malted-barley-f...
189       Apple-juice-from-concentrate 43%  water black-...
190                                      Chamomile-flowers.
191                                             Peppermint.
192                                         Linden-flowers.
193                                       Hibiscus-flowers.
194                    Tea cinnamon-&-natural-apple-flavor.
195                                              Green-tea.
196                                            Shave-grass.
197                                    Chamomile-spearmint.
198       Artichoke-malva-senna-leaf-hibiscus-chamomile-...
199       Andropogon-citratus uva-ursi hibiscus-flowers ...
200       Shave-grass-corn-silk-uva-ursi-juliana-astring...
201       Eucalyptus-licorice-ginger-elder-mullein-cinna...
202                                       100%-canola-oil-.
203       Canola-oil water palm-oil palm-kernel-oil salt...
204       Canola-oil water palm-oil flax-oil palm-kernel...
205       Rye-flour water wheat flour malt molasses suga...
206       Sugar cocoa-butter skimmed-milk-powder cocoa-m...
207       Sugar cocoa-butter skimmed-milk-powder cocoa-m...
208       Fortified-wheat-flour with-calcium iron niacin...
209       Flour wheat-flour calcium iron niacin thiamin ...
210       Organic-sprouted-wheat filtered-water organic-...
211       Black-chickpea-gram-flour salt raising-agent c...
212                                      100%-tea-&-passion
213       Pasteurized-milk cheese-cultures salt-and-enzy...
214       Enriched-bleached-flour wheat-flour malt-barle...
215       Enriched-bleached-flour wheat-flour malt-barle...
216       Glucose-fructose-syrup sugar cocoa-mass cocoa-...
217       Sugar wheat-starch glucose-syrup whole-soya-fl...
218       Banana-peppers onions high-fructose-corn-syrup...
219       Catsup vinegar brown-sugar apricots enzymes-pe...
220       Sorbitol gum-base maltitol-syrup xylitol dical...
221       Sorbitol gum-base maltitol-syrup xylitol natur...
222       Sugar cocoa-butter cream-powder 16%  corn 11% ...
223       Sugar cocoa-butter raisins cocoa-mass whole-mi...
224       Sugar cocoa-butter cocoa-mass whole-milk-powde...
225       Sugar almonds water sorbitol invert-sugar food...
226       Only-the-best:-un-bleached-flour organic-steel...
227       Only-the-best:-un-bleached-flour 100%-organic-...
228       Bart-&-judy's-proprietary-rice-flour-blend but...
229       Bart-&-judy's-proprietary-rice-flour-blend but...
230       Milk cream sugar corn-syrup guar-gum locust-be...
231       Milk cream sugar corn-syrup pecans pecans cott...
232       Wheat-flour salt 4 8%  palm-oil acidity-regula...
233       Sugar cocoa-butter whole-milk-powder williams-...
234       Milk-chocolate sugar milk cocoa-butter chocola...
235       Milk-chocolate sugar milk cocoa-butter chocola...
236       Milk-chocolate sugar milk cocoa-butter chocola...
237       Semi-sweet-chocolate: sugar chocolate-liquor-p...
238       White-chocolate: sugar cocoa-butter milk vanil...
239       Milk-chocolate sugar milk cocoa-butter chocola...
240       Milk-chocolate sugar milk cocoa-butter chocola...
241       Bittersweet-chocolate chocolate sugar cocoa-bu...
242       Milk-chocolate sugar milk cocoa-butter chocola...
243       Semi-sweet-chocolate: sugar chocolate-processe...
244       Milk-chocolate sugar milk cocoa-butter chocola...
245       Milk-chocolate sugar milk cocoa-butter chocola...
246       Milk-chocolate sugar milk cocoa-butter chocola...
247       Semi-sweet-chocolate: sugar chocolate-processe...
248       Semi-sweet-chocolate: sugar chocolate-processe...
249       Milk-chocolate sugar milk cocoa-butter chocola...
250       Milk-chocolate sugar milk cocoa-butter chocola...
251       Semi-sweet-chocolate sugar chocolate-processed...
252       Milk-chocolate sugar milk cocoa-butter chocola...
253       Mix-chocolate sugar milk cocoa-butter chocolat...
254       Gherkins cucumbers  water vinegar salt onions ...
255       Water xylitol citric-acid lactic-acid apple-ju...
256       Water glycerin citric-acid xylitol apple-juice...
257                                         Romaine-hearts.
258                                                Romaine.
259                                             Green-leaf.
260                                                Spinach.
261                                         Cooking-spinach
262                                       Brussels-sprouts.
263                                                  Celery
264                                 Kalettes kale-sprouts .
265       Sugar vegetable-oil coconut tree-nut  palm-ker...
266       Sugar vegetable-oil coconut tree-nut  palm-ker...
267       Sugar puffed-rice 12%  vegetable-oil-&-fat pal...
268         Fuji-apple-cider apple-cider pomegranate-juice.
269             Filtered-water lemon-juice pure-cane-sugar.
270                   Apple-cider natural-flavor-and-spice.
271                      Pink-lady-apple-cider apple-cider.
272                               100%-organic-apple-juice.
273                           Opal-apple-cider apple-cider.
274       Concentrated-apple-puree concentrated-apple-ju...
275       Concentrated-apple-puree concentrated-apple-ju...
276       Corn-syrup sugar water gelatin natural-&-artif...
277       Corn-syrup sugar water gelatin dextrose citric...
278       Corn-syrup sugar water gelatin dextrose citric...
279       Corn-syrup sugar water gelatin dextrose citric...
280       Corn-syrup sugar water gelatin dextrose citric...
281       Corn-syrup sugar water gelatin dextrose citric...
282       Corn-syrup sugar water gelatin dextrose citric...
283       Maltitol chocolate-liquor cocoa-butter inulin ...
284       Sugar assorted-nonpareils sugar corn-starch co...
285       Sugar corn-syrup modified-corn-starch citric-a...
286       Corn-syrup sugar water apple-juice-concentrate...
287                                            Fuji-apples.
288                                   Bananas strawberries.
289                                                 Mangos.
290                                                 Grapes.
291                  Fuji-apples ascorbic-acid citric-acid.
292                                   Bananas strawberries.
293                                                 Mangos.
294                                                 Grapes.
295        Cranberries sugar natural-flavors sunflower-oil.
296        Cranberries sugar natural-flavors sunflower-oil.
297        Cranberries sugar natural-flavors sunflower-oil.
298        Cranberries sugar natural-flavors sunflower-oil.
299       Dried-apricots natural-flavors sulfur-dioxide ...
                                ...                        
169568    Red-chilli-20% onion-12% garlic-12% galanga-10...
169569    Water red-curry-paste-22% red-chili lemongrass...
169570    Water green-curry-paste-3% fresh-green-chilli ...
169571                             100%-raw-buckwheat-honey
169572                                      Tamarind water 
169573    Water sugar pickled-chili red-chili salt aceti...
169574    Water sugar carrot pickled-chili red-chili sal...
169575    Tapioca-starch coconut-milk sugar black-sesame...
169576    Tapioca-starch coconut-milk sugar black-sesame...
169577                              Organic-mulberry-leaves
169578                Organic-brown-rice organic-moroheiya.
169579    Coconut-milk-powder coconut-extract maltodextr...
169580                             Pineapple vegetable-oil.
169581                                  Mango vegetable-oil
169582                               Broccoli vegetable-oil
169583                          Cherry-tomato vegetable-oil
169584                   Apple-50% strawberry-49% sugar-1%.
169585    Banana-89.18% strawberry-powder-9.00%-rice-bra...
169586    Rice palm-oil salt black-sesame crab-curry-sea...
169587    Rice palm-oil salt black-sesame sugar tom-yum-...
169588    Young-coconut sugar sriracha-sauce-seasoning s...
169589                           Glutinous-rice sweet-rice 
169590                               Tamarinde-98% salt-2%.
169591                                       Tamarind-100%.
169592                                         Coconut-milk
169593                    Pineapple water sugar citric-acid
169594                    Pineapple water sugar citric-acid
169595                    Pineapple water sugar citric-acid
169596    Green-chili lemongrass garlic salt shallot gal...
169597    Red-chili dried lemongrass garlic salt shallot...
169598    Red-chili dried lemongrass galangal garlic sal...
169599    Sugar shallot fish-sauce anchovy salt sugar pa...
169600                                          100%-durian
169601                                    100%-baby-octopus
169602    Cashew-nuts black-sesame-seeds raw-cane-sugar ...
169603                                Natural-coconut-water
169604    Beef water seasoning fructose soy-sauce wheat ...
169605    Potatoes vegetable-oil contains-one-or-more-of...
169606    Whole-grain-rolled-oat-flakes sugar whole-grai...
169607    Shrimp salt citric-acid sodium-citrate-and-sod...
169608    Enriched-flour wheat-flour niacin iron thiamin...
169609    Sugar water enriched-wheat-flour ferrous-sulfa...
169610    Beef water contains-2%-or-less-of-salt sorbito...
169611    Beef water contains-2%-or-less-of-salt sorbito...
169612    Fresh-coconut-milk 99.9%  stabilizers xanthan-...
169613    Turkey water onions mechanically-separated-tur...
169614    Chicken chicken-breast-with-rib-meat chicken-t...
169615    Whet-flour vegetables-oils palm soya-bean hydr...
169616    Water freshly-brewed-coffee milk sucrose emuls...
169617    Water aloe-vera-juice-and-aloe-vera-pulp-bits ...
169618    Corn whole-grain-wheat sugar whole-grain-rolle...
169619    Mahi-mahi fire-roasted-green-chiles potato-sta...
169620                                           Black-tea.
169621                       Orange-pekoe-and-cut-black-tea
169622                                            Milk-fat.
169623    Wheat-flour 70%  sugar vegetable-fat milk-and-...
169624    Wheat-flour 55%  sugar palm-oil butter skim-mi...
169625    Wheat-flour 56%  vegetable-oil palm-oil palm-k...
169626    Wheat-flour 33%  whole-wheat-flour 28%  palm-o...
169627    Wheat-flour 55%  sugar palm-oil butter 4%  ski...
169628                                 Hard-wheat-semolina.
169629       Fresh-garlic salt acetic-acid sodium-benzoate.
169630    Fresh-ginger fresh-garlic salt acetic-acid sod...
169631    Ginger water salt acetic-acid xanthan-gum sodi...
169632       Fresh-garlic salt acetic-acid sodium-benzoate.
169633    Ginger garlic water salt acetic-acid xanthan-g...
169634    Garlic water salt acetic-acid xanthan-gum sodi...
169635    Ginger garlic water salt acetic-acid xanthan-g...
169636                                Water organic-prunes.
169637    Shrimp salt sodium-tripolyphosphate to-retain-...
169638    Wheat-flour vegetable-palm-oil-with-tbhq sugar...
169639    Wheat-flour sugar vegetable-oil glucose-syrup ...
169640    Water mango-pulp 16.2%  sugar acidity-regulato...
169641    Cake:-sugar bleached-enriched-wheat-flour flou...
169642                                         Jasmine-rice
169643    Cultured-pasteurized-grade-a-skim-milk nonfat-...
169644                                         Basmati-rice
169645    Potato 35%  wheat-flour 32%  vegetable-oil con...
169646    Refined-wheat-flour 57%  vegetable-oil palmole...
169647    Split-mung-bean 70%  vegetable-oil peanut corn...
169648    Tepary-beans-flour 43%  vegetable-oil peanut c...
169649    Okra 45%  onion sunflower-oil tomato butter ca...
169650    Mustard-leaves 50%  butter sunflower-oil spina...
169651    Rice 28%  tomato 19%  onion 15%  sunflower-oil...
169652    Water butter 13.28-%  milk whole-black-lentils...
169653    Water tomato red-kidney-beans 15.34%  clarifie...
169654    Water yogurt 20% bengal-gram-fleur 6.3%  sunfl...
169655    Spinach-paste 33.95-%-water indian-cottage-che...
169656    Spinach 60%  indian-cottage-cheese 25%  onions...
169657    Chickpeas-flour 50%  vegetable-oil peanuts cor...
169658    Potatoes 45%  whole-wheat-flour 38%  refined-s...
169659    Water chickpeas 16%  tomatoes onions refined-s...
169660    Spinach 47%  indian-cottage-cheese 18%  butter...
169661    Tomatoes onions indian-cottage-cheese 22%  mil...
169662    Shrimp salt-sodium-tripolyphosphate to-retain-...
169663    Shrimp salt sodium-tripolyphosphate to-retain-...
169664    Shrimp salt sodium-tripolyphosphate to-retain-...
169665    Shrimp salt sodium-tripolyphosphate to-retain-...
169666    Shrimp salt sodium-tripolyphosphate to-retain-...
169667    Shrimp salt sodium-tripolyphosphate to-retain-...
169668                                    Riz-basmati-rice.
169669    Coriander cumin turmeric black-pepper garlic d...
169670    Mango-powder black-pepper iodised-salt chilli ...
169671    Fine-besan sugar iodised-salt turmeric citric-...
169672    Mung-dal sodium-bicarbonate iodised-salt citri...
169673    Cumin cassia caraway black-pepper cardamom nut...
169674    Chilli-powder garlic iodised-salt citric-acid ...
169675    Cumin-seeds black-salt black-pepper cinamon cl...
169676    Mango-powder coconut-powder sugar iodised-salt...
169677    Spices cumin-crushed coriander-powder  onion v...
169678    Shrimp water salt sodium-tripolyphosphate to-r...
169679    Refined-wheat-flour 57%  water salt sugar yeas...
169680                                         Rice-&-salt.
169681    Partially-skimmed-dry-milk sugar black-tea-ext...
169682    Partially-skimmed-dry-milk sugar black-tea-ext...
169683    Partially-skimmed-dry-milk sugar black-tea-ext...
169684    Shrimps water salt sodium-tripoly-phosphate to...
169685    Purified-water sucrose citric-acid sodium-citr...
169686    Wheat-flour eggs sugar water canola-oil wheat-...
169687    Brewed-tea filtered-water organic-fair-trade-b...
169688                Coffee natural-and-artificial-flavors
169689                                            Turmeric.
169690                                            Galangal.
169691                                            Cinnamon.
169692                                         Shrimp salt.
169693    Shrimp salt-sodium-phosphates to-retain-moisture 
169694    Oat-bran unprocessed-wheat-bran quinoa-flakes ...
169695    Cocoa-&-cocoa-butter 74%-cocoa-solids  cane-su...
169696     Cocoa-&-cocoa-butter 85%-cocoa  cane-sugar 15% .
169697    Cocoa-&-cocoa-butter 80%-cocoa-solids  cane-su...
169698    Soy gluten-&-diary-free cocoa-&-cocoa-butter 7...
169699    Pistachios almonds dried-apples unsulfured-app...
169700    Pretzel:-wheat-flour-blend wheat-flour enriche...
169701    Flower-honey italian-black-summer-truffle tube...
169702    Concentrated-grape-must* apple-juice* apple-vi...
169703                            Durum-wheat-semolina eggs
169704                  Cooked-grape-must-and-wine-vinegar.
169705    Sharp-cheddar-cheese cultured-pasteurized-milk...
169706    Garlic sugar vegetable-oil blend-of-canola soy...
169707                                         Basmati-rice
169708    Enriched-wheat-flour flour niacin reduced-iron...
169709                                        Pitted-dates.
169710    Made-with-pork salt natural-flavors organic-sp...
169711                                 Pink-salmon-fillets.
169712    Jim-beam-pulled-chicken-with-bourbon-bbq-sauce...
169713    Tomatoes crabmeat olive-oil fresh-garlic onion...
169714    Organic-oats organic expeller-pressed-high-oie...
169715    Strained-yogurt:-pasteurized-skim-milk live-ac...
169716         Palm-sugar black-soy-bean-extract water salt
169717    Wheat-flour cinnamon sugar honey yeast olive-oil.
169718    Sugar glucose vegetable-fat gelling-agent beef...
169719    Shallots garlic turmeric-root galangal ginger ...
169720    Shallots garlic turmeric-root coriander chilli...
169721     Shallots garlic pepper nutmeg salt vegetable-oil
169722    Wheat-flour sugar baking-powder raising-agent ...
169723                                             Galangal
169724                                               Ginger
169725                                           Coriander.
169726    Glucose emulsifier cocopandan-flavor food-colo...
169727    Glucose-syrup sorbitol-syrup nature-identical-...
169728    Glucose-syrup sorbitol-syrup artificial-pandan...
169729    Chili sugar water salt garlic monosodium gluta...
169730    Ginger coconut-sugar cane-sugar sea-salt cinna...
169731    Ginger coconut-sugar cane-sugar cinnamon sea-s...
169732    Ginger coconut-sugar cane-sugar lemongrass sea...
169733    Raw-chili green-tomato shallot garlic salt tam...
169734    Chili shallot garlic sugar salt vinegar fermen...
169735               Chili soursop palm-sugar tamarind salt
169736    Chili shattot garlic turmeric salt coriander f...
169737    Chili tomato shallot bay-leaf fermented-shrimp...
169738                   Red-chili salt water sugar vinegar
169739    Shrimp salt sodium-tripolyphosphate to-retain-...
169740     Shrimp salt sodium-phosphate to-retain-moisture 
169741     Shrimp salt sodium-phosphate to-retain-moisture 
169742     Shrimp salt sodium-phosphate to-retain-moisture 
169743     Shrimp salt sodium-phosphate to-retain-moisture 
169744    Sugar salt onion-powder garlic-powder coriande...
169745    Sugar non-dairy-cream glucose-syrup-solid hydr...
169746    Black-peppercorns sea-salt onion lemon-rind ga...
169747    Wheat-flour refined-palm-fat olive-oil sugar t...
169748    Water sugar carbon-dioxide acidifier-citric-ac...
169749                    Scallops sodium-tripolyphosphate.
169750    Whole-wheat-flour water soybean-oil contains-l...
169751              Cultured-pasteurized-milk salt enzymes.
169752    Filtered-water sugar mango-puree pear-juice-fr...
169753    Chia-seed-gel filtered-water chia-seed  coconu...
169754    Black-raspberries sugar cane-sugar water pecti...
169755    Salt sodium-silicoaluminate dextrose potassium...
169756    Organic-coconut-sugar organic-japanese-matcha-...
169757    Nonfat-yogurt cultured-pasteurized-nonfat-milk...
169758    Sugar wheat-flour palm-oil cocoa-mass milk-sol...
169759    Caramel 30%  sugar condensed-milk  skim-milk s...
169760    Sugar wheat-flour caramel 15.5%  sugar condens...
169761    Made-from:-sugar chocolate wheat-flour palm-oi...
169762    Wheat-flour sugar oats desiccated-coconut pres...
169763    Carbonated-water cane-sugar root-beer-brew wat...
169764    Carbonated-water cane-sugar lime-juice-from-co...
169765    Carbonated-water cane-sugar blood-orange-juice...
169766    Carbonated-water cane-sugar peach-juice-from-c...
169767    Carbonated-water cane-sugar peach-juice-from-c...
169768    Carbonated-water cane-sugar passionfruit-juice...
169769    Carbonated-water cane-sugar peach-juice-from-c...
169770    Sugar foaming-creamer glucose-syrup hydrogenat...
169771    Peanuts sugar glucose-syrup wheat  coconut but...
169772                     100%-pure-manuka tea-tree honey.
169773                      100%-pure-french-lavender-honey
169774    Feta-cheese 60%   cow's-milk mineral-salt 509 ...
169775    Raspberries-55% sugar apple-stock granny-smith...
169776    Cream sugar corn-syrup whole-milk condensed-sk...
169777    Apples water ascorbic-acid vitamin-c  erythorb...
169778    Organic-grape-or-apple organic-guava 25% and-c...
169779    Organic-apple 50%  organic-grape-or-apple-and-...
169780    Organic-orange organic-blood-orange 20%  organ...
169781    Organic-orange 30%  organic-grape-or-apple-and...
169782    Sun-dried-wild-hibiscus-flowers 60%  australia...
169783    Cane-sugar butterfly-pea-flower clitoria-terna...
169784    Hibiscus-flower 60%  hibiscus-sabdariffa  cane...
169785    Cane-sugar hibiscus-flower-juice 60%  bulgaria...
169786    Wild-hibiscus-flowers-hibiscus-sabdariffa 38% ...
169787    Paprika black-pepper garlic onion cumin corian...
169788    Cooked-white-rice cooked-white-chicken chicken...
169789                    100%-chia-seed salvia-hispanica .
169790                    100%-chia-seed salvia-hispanica .
169791    Chia-seed-gel filtered-water chia-seed  strawb...
169792    Chia-seed-gel filtered-water chia-seed  coconu...
169793    Chia-seed-gel filtered-water chia-seed  coconu...
169794    Chia-seed-gel filtered-water chia-seed  almond...
169795    Chia-seed-gel filtered-water chia-seed  whole-...
169796    Chia-seed-gel filtered-water chia-seed  whole-...
169797    Almond-milk  filtered-water almond-paste  appl...
169798    Almond-milk filtered-water-and-almond-paste  a...
169799                                          Shrimp salt
169800    Water milk cultures salt sugar blueberries str...
169801    Water milk culture salt sugar corn-syrup brown...
169802    Cultured-pasteurized-grade-a-nonfat-milk sugar...
169803    Cultured-pasteurized-grade-a-nonfat-milk sugar...
169804    Cultured-pasteurized-grade-a-nonfat-milk sugar...
169805    Cultured-pasteurized-grade-a-nonfat-milk sugar...
169806    Cultured-pasteurized-grade-a-nonfat-milk sugar...
169807    Cultured-pasteurized-grade-a-nonfat-milk sugar...
169808    Cultured-pasteurized-grade-a-nonfat-milk sugar...
169809    Cultured-pasteurized-grade-a-nonfat-milk sugar...
169810    Cultured-pasteurized-grade-a-nonfat-milk sugar...
169811    Cultured-pasteurized-grade-a-nonfat-milk sugar...
169812    Cultured-pasteurized-grade-a-nonfat-milk sugar...
169813          99.6%-coconut-oil 0.4%-lemon-essential-oil.
169814    99.89%-coconut-oil 0.1%-basil 0.01%-garlic-ess...
169815    Wild-hibiscus-flowers-hibiscus-sabdariffa 38% ...
169816    Ground-chicken water salt spices dextrose natu...
169817                             100%-natural-peppermint.
169818    Butternut-squash-puree water tomato-paste butt...
169819    Fresh-onion habanero-peppers food-starch musta...
169820    Organic-cultured-pasteurized-milk salt vegetar...
169821    Ingredients:-organic-cultured-unpasteurized-mi...
169822    Beef sugar salt tomato-paste garlic spices mon...
169823    Invert-sugar-syrup reconstituted-dried-skimmed...
169824                        Peanuts-&-salt less-than-1% .
169825                                             Peanuts.
169826    Corn-grits rice-grits palm-olein preserved-wit...
169827                   Artesian-water naturally-alkaline.
169828                                   Greenshell-mussels
169829    Organic-mountain-berry-spinach-puree organic-a...
169830    Organic-cinnamon-spiced-beet-puree organic-app...
169831    Organic-sweet-potato-pie-puree organic-apples ...
169832    Organic-tropikale-puree organic-apples organic...
169833    Organic-carrot-ginger-puree organic-apples org...
169834    Diced-tomatoes onions jalapeno-peppers cilantr...
169835    Grilled-vegetables zucchini eggplant red-peppe...
169836                        Tuna-fish vegetable-oil salt.
169837    Fried-sardines-fish sardines-fish palm-oil  wa...
169838    Fried-sardines-fish sardines-fish palm-oil  ch...
169839    Fried-sardines-fish sardines-fish palm-oil  wa...
169840        Mackerel water tomato-paste salt xanthan-gum.
169841      Sardines-fish extra-virgin-olive-oil water salt
169842    Water australian-peanut lemongrass sugar cocon...
169843    Water us-grown-great-northern-beans sugar toma...
169844    Water australian-peanuts onions sugar coconut-...
169845    Water australian-peanuts sugar onions lemongra...
169846    Sugar coconut-milk coconut-extract water  33% ...
169847    Water shallots soybean-oil garlic curry-powder...
169848    Water soybean-oil garlic toasted-desiccated-co...
169849    Sardines water tomato-paste sugar dried-chili ...
169850                        Chrysanthemum-and-rock-sugar.
169851    Sugar glucose-syrup vegetable-fat hydrogenated...
169852    Wheat-flour vegetable-oil palm-olein  sugar ve...
169853    Wheat-flour sugar vegetable-fat palm-base  cor...
169854    Gluten-free-oats organic-honey coconut blueber...
169855    Chipotle-peppers water vinegar tomatoes onions...
169856    Tomato black-gram water milk-cream cottonseed-...
169857    Pasteurized-part-skim-milk salt cheese-culture...
169858            Beetroot water sugar spirit-vinegar salt.
169859    Vegetables mushroom shiitake tomato garlic sha...
169860    Concentrated-grape-juice balsamic-vinegar-of-m...
169861    Organic-wheat-flour organic-sugar corn-syrup o...
169862    Spices including-chili-pepper cumin paprika or...
169863    Organic-oat-blend organic-rolled-oats organic-...
169864                                             Cassava.
169865    Water high-fructose-corn-syrup contains-2%-or-...
169866    Organic-peppermint organic-lemon-grass organic...
169867    Citric-acid maltodextrin instant-tea aspartame...
Name: ingredients_list, dtype: object



In [70]:

    
## Generate a word cloud for the ingredients

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

# create a new column using a modified version of ingredients list that can be vectorized
vectorizer = CountVectorizer()

corpus_data=usda_sample_data['ingredients_list']
count_matrix = vectorizer.fit_transform(corpus_data)

# display the features/tokens 
all_feature_names = vectorizer.get_feature_names()
print(" ".join(list(all_feature_names[:50])))

%matplotlib inline

# generate wordcloud
from os import path
from scipy.misc.pilutil import imread
import matplotlib.pyplot as plt
import random

from wordcloud import WordCloud, STOPWORDS

wordcloud = WordCloud(font_path='/Library/Fonts/Verdana.ttf',
                      relative_scaling = 1.0,
                      stopwords = 'to of the ,'
                      ).generate("".join(usda_sample_data['ingredients_list']))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()









    



00 000 000079 0009 001 002 003 005 0051 006 008 0091 01 010 010000 011 0152 017 02 025 026212 03 04 045 046 05 054 055 05mg 06 07 071008 075 08 08g 0965 0fauta 0g 0il 10 100 1000ppm 1005 100g 100mg 100ml 100ppm 101 102 103



In [71]:

    
# remove common words and tokenize the ingredients_list values
documents = usda_sample_data['ingredients_list']
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
          for document in documents]

# remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
         frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1]
          for text in texts]

# display first 10 entries
from pprint import pprint  # pretty-printer
pprint(texts[:10])









    



[['bananas',
  'vegetable-oil',
  'coconut-oil',
  'corn-oil-and/or-palm-oil',
  'sugar',
  'natural-banana-flavor.'],
 ['peanuts',
  'wheat-flour',
  'sugar',
  'rice-flour',
  'tapioca-starch',
  'salt',
  'leavening',
  'ammonium-bicarbonate',
  'baking-soda',
  'soy-sauce',
  'water',
  'soybeans',
  'wheat',
  'salt',
  'potato-starch.'],
 ['organic-hazelnuts',
  'organic-cashews',
  'organic-sunflower-oil',
  'sea-salt.'],
 ['organic-polenta'],
 ['rolled-oats',
  'grape-concentrate',
  'expeller-pressed-canola-oil',
  'sunflower-seeds',
  'almonds',
  'sesame-seeds',
  'cashews',
  'natural-vitamin-e.'],
 ['organic-long-grain-white-rice'],
 ['org-oats',
  'org-oats',
  'evaporated-cane-juice',
  'crispy-rice',
  'org-evaporated-cane-juice',
  'sea-salt',
  'org-flax-seeds',
  'hemp-seeds',
  'org-coconut',
  'org-flax-seeds',
  'org-corn-meal',
  'sea-salt',
  'org-corn-meal',
  'org-evaporated-cane-juice',
  'sea-salt',
  'tocopherols',
  'natural-vitamin-e',
  '.'],
 ['organic-chocolate-liquor',
  'organic-raw-cane-sugar',
  'organic-cocoa-butter',
  'organic-ground-vanilla-beans.'],
 ['organic-expeller-pressed'],
 ['organic-adzuki-beans']]



In [72]:

    
# generate and persist the dictionary
dictionary = corpora.Dictionary(texts)
dictionary.save('./data/ingredients.dict')  # store the dictionary, for future reference

# generate and persist the corpus
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('./data/ingredients.mm', corpus)  # store to disk, for later use
print(corpus[:10])

# generate and persist the index
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=-1)
index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it
index.save('./data/ingredients.index')









    



[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)], [(2, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 2)], [(19, 1), (20, 1), (21, 1), (22, 1)], [(23, 1)], [(24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1)], [(32, 1)], [(33, 2), (34, 1), (35, 1), (36, 1), (37, 1), (38, 3), (39, 1), (40, 2), (41, 1), (42, 1), (43, 2), (44, 2)], [(45, 1), (46, 1), (47, 1), (48, 1)], [(49, 1)], [(50, 1)]]



In [73]:

    
# load the dictionary and matrix representation of similarity and the index
dictionary = corpora.Dictionary.load('./data/ingredients.dict')
corpus = corpora.MmCorpus('./data/ingredients.mm') 

# load the index
index = similarities.MatrixSimilarity.load('./data/ingredients.index')



In [98]:

    
# convert query to vector
max_count=3
def displaySimilarProducts(query):

    vec_bow = dictionary.doc2bow(query.lower().split())
    vec_lsi = lsi[vec_bow] # convert the query to LSI space
    #print(vec_lsi)

    sims = index[vec_lsi]
    #print(list(enumerate(sims)))

    print "\nQuery String:",query

    sims_sorted = sorted(enumerate(sims), key=lambda item: -item[1])

    #print(sims_sorted)

    count=0
    print("Top 3 matches:")
    for sim in sims_sorted:
        print "\nCode:",usda_sample_data['code'][sim[0]]
        print "Product Name:",usda_sample_data['product_name'][sim[0]]
        print "Text:",usda_sample_data['ingredients_list'][sim[0]]
        print "Match:",sim[1]

        if count==max_count-1:
            break
        else:
            count+=1



In [99]:

    
query=raw_input("Enter search text:")
displaySimilarProducts(query)









    



Enter search text:chocolate

Query String: chocolate
Top 3 matches:

Code: 736211506344
Product Name: Yaffa's Joy, Natural Sesame Treat
Text: Sesame-&-pumpkin-seeds cherries chocolate date-honey tahini-&-rice-flakes
Match: 0.755326

Code: 799137041313
Product Name: Chocolate Cashews
Text: Dry-roasted-cashews chocolate sugar chocolate liquor cocoa-butter purified-water vanilla-flavor  confectioners-glaze gum-arabic.
Match: 0.720112

Code: 797945810305
Product Name: Co Yo, Coconut Milk Yogurt Alternative, Chocolate
Text: Coconut-cream chocolate organic-raw-cacao apple-juice-concentrate stevia tapioca pectin-and-probiotic-cultures.
Match: 0.600538



In [ ]: