In [1]:
import pandas as pd

Panel data sets: Category_PANEL_outlet_startweek_endweek.dat

Panel data is provided for two BehaviorScan markets, Eau Claire, Wisconsin and Pittsfield, Massachusetts. The naming convention for these is category name then “panel” then outlet then start week and then end week, all separated by underscores, with the extension DAT, so salted snacks drug data for the earliest year would be saltsnck_PANEL_DR_1114_1165

PANID panelist number within a market 

UNITS Total number of units purchased by the Buying households. The sum of total units purchased by the households buying the Product. 

OUTLET Channel to which the store/chain belongs MA=Mass GR=Grocery DR=drug DOLLARS 

Total Paid dollars This is drawn from the store data, not entered by the panelist, in cases where IRI has store data. In cases where IRI does not receive store data, some panelists do

IRI_KEY Masked store number 

WEEK IRI WEEK  

COLUPC (Collapsed UPC). This is the UPC which matches the internal form (e.g. private label collapsed). The information in COLUPC is the same as in the combination of SY, GE, VEND, ITE. This is the combination of a upc’s system (2 digits), generation (1 digit), vendor (5 digits) and item (5 digits) fields. See product description section for an explanation of these fields. No leading zeroes are shown.


In [3]:
panel = pd.read_csv("IRI_Data//Year1/External/saltsnck/saltsnck_PANEL_DR_1114_1165.dat", delimiter="\t")
panel.head()


Out[3]:
PANID WEEK UNITS OUTLET DOLLARS IRI_KEY COLUPC
0 1212076 1133 1 DR 0.59 8003042 11600012250
1 1229641 1157 1 DR 0.59 8003042 11600012250
2 1800060 1137 1 DR 0.59 8003042 11600012250
3 1104018 1152 1 DR 0.99 8003043 11600012530
4 1106229 1123 1 DR 0.99 8003043 11600012530

Panel Demographics

Panel demographic files have been standardized and are called ads demoN.csv, where N is the year number: ads demo1.csv, ads demo2.csv ... ads demo5.csv.

The panelists included are those who satisfied IRI’s standard 52 week reporting static. This means that (1) the panelists included reported all year, and (2) the panelists are different between years. For the initial set of data provided, the panelist demos reflect data current at that time. So, for the year 1, 2, and 3 (2001-2003) data, the panelist demos are from early 2007, not 2001. For this reason, there may be panelist records without demographics. For years 4 and 5 (2004-2005) the panelist demos are from later in 2007 and may be slightly different due to the demographic updates. Similarly for years 8-11: the demos reflect information pulled in summer, 2012.

The field names and the first two panelist values are shown below. Due to the demographic updates, there are minor differences in the values for the two panelists. For example, the male head in household in 1100180 is now listed as “some college” rather than post-secondary “technical school”, and the male head occupation from laborer to machine operator.

In these files, a missing value may appear as an empty field, a blank, a period, or a zero.


In [99]:
demos1 =  pd.read_csv("IRI_Data/demos trips external 1_11 may 13/ads demo1.csv")
demos11 = pd.read_csv("IRI_Data/demos trips external 1_11 may 13/ads demos11.CSV")

a=set(demos1.columns.values)
b=set(demos11.columns.values)

In [112]:
[col.replace(" ", "_") for col in demos11.columns]


Out[112]:
['Panelist_ID',
 'Panelist_Type',
 'Combined_Pre-Tax_Income_of_HH',
 'Family_Size',
 'Household_Head_Race',
 'Type_of_Residential_Possession',
 'COUNTY',
 'Age_Group_Applied_to_Household_Head',
 'Education_Level_Reached_by_Household_Head',
 'Occupation_Code_of_Household_Head',
 'Age_Group_Applied_to_Male_HH',
 'Education_Level_Reached_by_Male_HH',
 'Occupation_Code_of_Male_HH',
 'Male_Working_Hour_Code',
 'MALE_SMOKE',
 'Age_Group_Applied_to_Female_HH',
 'Education_Level_Reached_by_Female_HH',
 'Occupation_Code_of_Female_HH',
 'Female_Working_Hour_Code',
 'FEM_SMOKE',
 'Number_of_Dogs',
 'Number_of_Cats',
 'Children_Group_Code',
 'Marital_Status',
 'HH_LANG',
 'ALL_TVS',
 'CABL_TVS',
 'Hispanic_Flag',
 'HISP_CAT',
 'RACE2',
 'RACE3',
 'MICROWAVE',
 'device_type',
 'ZIPCODE',
 'FIPSCODE',
 'market_based_upon_zipcode',
 'IRI_Geography_Number',
 'EXT_FACT']

In [101]:
intr=a.intersection(b)

In [106]:
[col.replace(" ", "_") for col in intr]


Out[106]:
['Education_Level_Reached_by_Female_HH',
 'ZIPCODE',
 'Age_Group_Applied_to_Female_HH',
 'Male_Working_Hour_Code',
 'Type_of_Residential_Possession',
 'EXT_FACT',
 'Marital_Status',
 'IRI_Geography_Number',
 'HISP_CAT',
 'Children_Group_Code',
 'Combined_Pre-Tax_Income_of_HH',
 'Panelist_Type',
 'COUNTY',
 'Panelist_ID',
 'Number_of_Cats',
 'FIPSCODE',
 'FEM_SMOKE',
 'Female_Working_Hour_Code',
 'MALE_SMOKE',
 'Family_Size',
 'Occupation_Code_of_Male_HH',
 'Education_Level_Reached_by_Male_HH',
 'Age_Group_Applied_to_Male_HH',
 'market_based_upon_zipcode',
 'Number_of_Dogs',
 'Occupation_Code_of_Female_HH']

In [97]:
#old columns removed...
a-b


Out[97]:
set()

In [98]:
#new columns added
b-a


Out[98]:
set()

Store File

Naming convention: The naming convention for these is category name then outlet then start week and then end week, all separated by underscores, with no extension, so salted snacks drug data for the earliest year would be saltsnck_drug_1114_1165.

RI_KEY Masked Store number, keyed to delivery_stores file.

WEEK IRI Week: see IRI Week Translation.xls file for calendar week translation

SY

UPC - System

GE UPC – Generation

VEND UPC - Vendor

ITEM UPC - Item

UNITS Total Unit sales

DOLLARS Total Dollar sales

F Feature: see table below

D Display: (0=NO, 1=MINOR, 2=MAJOR. MAJOR includes codes lobby and end- aisle)

PR Price Reduction flag: (1 if TPR is 5% or greater, 0 otherwise)


In [16]:
store = pd.read_fwf("../dse-iri-dataset/Year1/External/saltsnck/saltsnck_drug_1114_1165")
store.head()


Out[16]:
IRI_KEY WEEK SY GE VEND ITEM UNITS DOLLARS F D PR
0 664497 1114 0 1 41633 2154 3 2.97 NONE 0 0
1 664497 1114 0 2 41633 145 6 13.14 NONE 0 0
2 664497 1114 0 1 41633 2119 3 2.97 NONE 0 0
3 664497 1114 0 2 41633 135 3 6.57 NONE 0 0
4 664497 1114 0 1 41633 2140 1 0.99 NONE 0 0

Product Attributes

The improved file format, which incorporates further information, is prod_category.xls, 16 There are three sets of files.

The first set of files are applicable to years 1-6 and are provided in a directory called “parsed stub files”.

The second set of files are applicable to year 7 and are provided in a directory called “parsed stub files 2007”.

The third set of files are applicable to years 8-11 are are provided in a directory called “parsed stub files 2008-2011”


In [20]:
pd.read_


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-20-b8fa7a4c1fc3> in <module>()
----> 1 pd.read_

AttributeError: 'module' object has no attribute 'read_'

Panel Trips

These files represent the trips made by panelists who purchased at least one item. These files have been standardized in format from the way they were originally constructed, and placed in the directory “parsed stub files”. The naming convention is 14 tripsN jul08.csv, where N is the year . Fields are listed below. These files contain the following fields:


In [14]:
trips = pd.read_csv("../dse-iri-dataset/demos trips external 1_11 may13/trips1 jul08.csv")
trips.head()


Out[14]:
PANID WEEK IRI_Key MINUTE CENTS998 CENTS999
0 1100016 1114 234140 9701 NaN 4470.835938
1 1100016 1115 234140 8088 NaN 7287.500000
2 1100016 1116 8046669 5332 NaN 1398.000000
3 1100016 1116 234140 8165 NaN 9771.839844
4 1100016 1116 8003043 9432 NaN 1848.000000

Delivery Stores

This is a flat file with information about the stores.

This file also contains outlet, estimated acv, the market name so data can be aggregated by market, an open and close week, and finally a “chain” number representing a particular retailer. All the stores belonging to Chain8 are part of the same retailer that year.


In [18]:
stores = pd.read_fwf("../dse-iri-dataset/Year1/External/saltsnck/Delivery_Stores")
stores.head()


Out[18]:
IRI_KEY OU EST_ACV Market_Name Open Clsd MskdName
0 200039 GR 9.602997 BUFFALO/ROCHESTER 539 1219 Chain87
1 200171 GR 28.454990 MILWAUKEE 522 9998 Chain97
2 200197 GR 11.314990 PEORIA/SPRINGFLD. 903 9998 Chain59
3 200233 GR 7.473000 OKLAHOMA CITY 1122 1150 Chain102
4 200272 GR 13.266000 LOS ANGELES 873 9998 Chain124

In [16]:
%load_ext sql


The sql extension is already loaded. To reload it, use:
  %reload_ext sql

In [17]:
%%sql postgresql://sharknado:sharknado@sharknado-dse.ceg3hdkdq8l0.us-east-1.rds.amazonaws.com/sharknado
        select * from test


3 rows affected.
Out[17]:
name
alex
john
david

In [18]:
names = _

In [19]:
names


Out[19]:
name
alex
john
david

In [20]:
result = %sql select * from stores

dataframe = result.DataFrame()


2053 rows affected.

In [21]:
dataframe.head()


Out[21]:
iri_key ou est_acv marketname open clsd mskdname
0 200039 GR 9.602997 BUFFALO/ROCHESTER 539 1219 Chain87
1 200171 GR 28.454990 MILWAUKEE 522 9998 Chain97
2 200197 GR 11.314990 PEORIA/SPRINGFLD. 903 9998 Chain59
3 200233 GR 7.473000 OKLAHOMA CITY 1122 1150 Chain102
4 200272 GR 13.266000 LOS ANGELES 873 9998 Chain124

In [22]:
transactions = %sql select * from transactions limit 10
transactions.DataFrame().head()


10 rows affected.
Out[22]:
iri_key _year _type week sy ge vend item units dollars f d pr
0 664497 None drug 1114 0 1 41633 2154 3 2.97 NONE 0 0
1 664497 None drug 1114 0 2 41633 145 6 13.14 NONE 0 0
2 664497 None drug 1114 0 1 41633 2119 3 2.97 NONE 0 0
3 664497 None drug 1114 0 2 41633 135 3 6.57 NONE 0 0
4 664497 None drug 1114 0 1 41633 2140 1 0.99 NONE 0 0

In [23]:
products_query = %sql select * from products limit 10
products_query.DataFrame().head()


10 rows affected.
Out[23]:
l1 l2 l3 l4 l5 l9 level upc sy ge ... item stubspec_1431rc vol_eq product_type package flavor_scent fat_content cooking_method salt_sodium_content type_of_cut
0 CATEGORY - SALTY SNACKS POTATO CHIPS ACTON CO INC ACTON CO INC GORDONS +GORDN BBQ PTCHP THN 6OZ 9 00-02-36300-04650 0 2 ... 4650 +GORDN BBQ PTCHP THN 6OZ 0 2 36300 46... 0.3750 POTATO CHIP BAG BARBECUE MISSING MISSING MISSING THIN
1 CATEGORY - SALTY SNACKS POTATO CHIPS ACTON CO INC ACTON CO INC GORDONS +GORDN ORGL PTCHP FLAT 14.5OZ 9 00-01-36300-04632 0 1 ... 4632 +GORDN ORGL PTCHP FLAT 14.5OZ 0 1 36300 46... 0.9063 POTATO CHIP PLASTIC BAG ORIGINAL MISSING MISSING MISSING FLAT
2 CATEGORY - SALTY SNACKS POTATO CHIPS ACTON CO INC ACTON CO INC GORDONS +GORDN REDHT PTCHP RPLD 6OZ 9 00-02-36300-04652 0 2 ... 4652 +GORDN REDHT PTCHP RPLD 6OZ 0 2 36300 46... 0.3750 POTATO CHIP BAG RED HOT MISSING MISSING MISSING RIPPLED
3 CATEGORY - SALTY SNACKS POTATO CHIPS ACTON CO INC ACTON CO INC GORDONS +GORDN REG PTCHP RPLD 6OZ 9 00-03-36300-04641 0 3 ... 4641 +GORDN REG PTCHP RPLD 6OZ 0 3 36300 46... 0.3750 POTATO CHIP BAG REGULAR MISSING MISSING MISSING RIPPLED
4 CATEGORY - SALTY SNACKS POTATO CHIPS ACTON CO INC ACTON CO INC GORDONS +GORDN REG PTCHP THN 6OZ 9 00-02-36300-04640 0 2 ... 4640 +GORDN REG PTCHP THN 6OZ 0 2 36300 46... 0.3750 POTATO CHIP BAG REGULAR MISSING MISSING MISSING THIN

5 rows × 21 columns

Join transactions on products

The transactions table is more interesting if you join products on the item column


In [31]:
%%sql
select * from transactions
join products on products.item = transactions.item
limit 5


5 rows affected.
Out[31]:
iri_key _year _type week sy ge vend item units dollars f d pr l1 l2 l3 l4 l5 l9 level upc sy_1 ge_1 vend_1 item_1 stubspec_1431rc vol_eq product_type package flavor_scent fat_content cooking_method salt_sodium_content type_of_cut
44017 None drug 1114 0 2 78907 0 16 15.84 NONE 0 0 CATEGORY - SALTY SNACKS TORTILLA/TOSTADA CHIPS HACIENDA MEXC FOOD PROD HACIENDA MEXC FOOD PROD HACIENDA +HCNDA ORGL TRCHP TRNGL 12OZ 9 01-01-86443-00000 1 1 86443 0 +HCNDA ORGL TRCHP TRNGL 12OZ 1 1 86443 0 1 1 0.7500RP 08699 0.75 TORTILLA CHIP BAG ORIGINAL MISSING MISSING MISSING MISSING
44017 None drug 1114 0 2 78907 0 16 15.84 NONE 0 0 CATEGORY - SALTY SNACKS PRETZELS AUNTIE ANNES INC AUNTIE ANNES INC AUNTIE ANNES +ATANS SGS#1 PRTZL KNOT 14OZ 9 08-01-95072-00000 8 1 95072 0 +ATANS SGS#1 PRTZL KNOT 14OZ 8 1 95072 0 1 1 0.8750RP 12393 0.875 PRETZEL BAG MISSING MISSING MISSING MISSING MISSING
44017 None drug 1114 0 2 78907 0 16 15.84 NONE 0 0 CATEGORY - SALTY SNACKS PRETZELS BEIGEL & BEIGEL BEIGEL & BEIGEL BEIGEL & BEIGEL +BG&BG SESME PRTZL KNOT 7OZ 9 00-01-08982-00000 0 1 8982 0 +BG&BG SESME PRTZL KNOT 7OZ 0 1 8982 0 1 1 0.4375RP 12463 0.4375 PRETZEL BAG SESAME MISSING MISSING MISSING MISSING
44017 None drug 1114 0 2 78907 0 16 15.84 NONE 0 0 CATEGORY - SALTY SNACKS READY-TO-EAT POPCORN/CARAMEL COR KETTLE KORN BY THE SEA KETTLE KORN BY THE SEA KETTLE KORN +KLKR1 KTCRN RTEPC 3OZ 9 08-01-50278-00000 8 1 50278 0 +KLKR1 KTCRN RTEPC 3OZ 8 1 50278 0 1 1 0.1875RP 19661 0.1875 RTE POPCORN BAG KETTLE CORN MISSING MISSING MISSING MISSING
44017 None drug 1114 0 2 78907 0 16 15.84 NONE 0 0 CATEGORY - SALTY SNACKS READY-TO-EAT POPCORN/CARAMEL COR MARCIS FUN FOODS MARCIS FUN FOODS MARCI'S +MRC'S KTCRN RTEPC 8OZ 9 08-01-33841-00000 8 1 33841 0 +MRC'S KTCRN RTEPC 8OZ 8 1 33841 0 1 1 0.5000RP 19854 0.5 RTE POPCORN BAG KETTLE CORN MISSING MISSING MISSING MISSING

This query takes a long time:

  • either precompute a join table
  • optimize the group by

In [30]:
%%sql
SELECT products.product_type, sum(transactions.dollars) as dollar_sum
FROM transactions
JOIN products on products.item = transactions.item
GROUP BY products.product_type
ORDER BY dollar_sum desc
LIMIT 10


10 rows affected.
Out[30]:
product_type dollar_sum
POTATO CHIP 18465564852.6
TORTILLA CHIP 9192142590.74
RTE POPCORN 8437904236.52
PRETZEL 5301534431.15
FRIED PORK RIND 4118536541.46
CHEESE SNACK 3167251187.26
CORN CHIP 1505557590.38
POTATO CRISP 1043139151.73
ASTSS SALTED SNACKS 905159722.764
PLANTAIN CHIP 536445576.722

In [ ]: