In [1]:
import pandas as pd
Panel data is provided for two BehaviorScan markets, Eau Claire, Wisconsin and Pittsfield, Massachusetts. The naming convention for these is category name then “panel” then outlet then start week and then end week, all separated by underscores, with the extension DAT, so salted snacks drug data for the earliest year would be saltsnck_PANEL_DR_1114_1165
PANID panelist number within a market 
UNITS Total number of units purchased by the Buying households. The sum of total units purchased by the households buying the Product. 
OUTLET Channel to which the store/chain belongs MA=Mass GR=Grocery DR=drug DOLLARS 
Total Paid dollars This is drawn from the store data, not entered by the panelist, in cases where IRI has store data. In cases where IRI does not receive store data, some panelists do
IRI_KEY Masked store number 
WEEK IRI WEEK  
COLUPC (Collapsed UPC). This is the UPC which matches the internal form (e.g. private label collapsed). The information in COLUPC is the same as in the combination of SY, GE, VEND, ITE. This is the combination of a upc’s system (2 digits), generation (1 digit), vendor (5 digits) and item (5 digits) fields. See product description section for an explanation of these fields. No leading zeroes are shown.
In [3]:
panel = pd.read_csv("IRI_Data//Year1/External/saltsnck/saltsnck_PANEL_DR_1114_1165.dat", delimiter="\t")
panel.head()
Out[3]:
Panel demographic files have been standardized and are called ads demoN.csv, where N is the year number: ads demo1.csv, ads demo2.csv ... ads demo5.csv.
The panelists included are those who satisfied IRI’s standard 52 week reporting static. This means that (1) the panelists included reported all year, and (2) the panelists are different between years. For the initial set of data provided, the panelist demos reflect data current at that time. So, for the year 1, 2, and 3 (2001-2003) data, the panelist demos are from early 2007, not 2001. For this reason, there may be panelist records without demographics. For years 4 and 5 (2004-2005) the panelist demos are from later in 2007 and may be slightly different due to the demographic updates. Similarly for years 8-11: the demos reflect information pulled in summer, 2012.
The field names and the first two panelist values are shown below. Due to the demographic updates, there are minor differences in the values for the two panelists. For example, the male head in household in 1100180 is now listed as “some college” rather than post-secondary “technical school”, and the male head occupation from laborer to machine operator.
In these files, a missing value may appear as an empty field, a blank, a period, or a zero.
In [99]:
demos1 = pd.read_csv("IRI_Data/demos trips external 1_11 may 13/ads demo1.csv")
demos11 = pd.read_csv("IRI_Data/demos trips external 1_11 may 13/ads demos11.CSV")
a=set(demos1.columns.values)
b=set(demos11.columns.values)
In [112]:
[col.replace(" ", "_") for col in demos11.columns]
Out[112]:
In [101]:
intr=a.intersection(b)
In [106]:
[col.replace(" ", "_") for col in intr]
Out[106]:
In [97]:
#old columns removed...
a-b
Out[97]:
In [98]:
#new columns added
b-a
Out[98]:
Naming convention: The naming convention for these is category name then outlet then start week and then end week, all separated by underscores, with no extension, so salted snacks drug data for the earliest year would be saltsnck_drug_1114_1165.
RI_KEY Masked Store number, keyed to delivery_stores file.
WEEK IRI Week: see IRI Week Translation.xls file for calendar week translation
SY
UPC - System
GE UPC – Generation
VEND UPC - Vendor
ITEM UPC - Item
UNITS Total Unit sales
DOLLARS Total Dollar sales
F Feature: see table below
D Display: (0=NO, 1=MINOR, 2=MAJOR. MAJOR includes codes lobby and end- aisle)
PR Price Reduction flag: (1 if TPR is 5% or greater, 0 otherwise)
In [16]:
store = pd.read_fwf("../dse-iri-dataset/Year1/External/saltsnck/saltsnck_drug_1114_1165")
store.head()
Out[16]:
The improved file format, which incorporates further information, is prod_category.xls, 16 There are three sets of files.
The first set of files are applicable to years 1-6 and are provided in a directory called “parsed stub files”.
The second set of files are applicable to year 7 and are provided in a directory called “parsed stub files 2007”.
The third set of files are applicable to years 8-11 are are provided in a directory called “parsed stub files 2008-2011”
In [20]:
pd.read_
These files represent the trips made by panelists who purchased at least one item. These files have been standardized in format from the way they were originally constructed, and placed in the directory “parsed stub files”. The naming convention is 14 tripsN jul08.csv, where N is the year . Fields are listed below. These files contain the following fields:
In [14]:
trips = pd.read_csv("../dse-iri-dataset/demos trips external 1_11 may13/trips1 jul08.csv")
trips.head()
Out[14]:
This is a flat file with information about the stores.
This file also contains outlet, estimated acv, the market name so data can be aggregated by market, an open and close week, and finally a “chain” number representing a particular retailer. All the stores belonging to Chain8 are part of the same retailer that year.
In [18]:
stores = pd.read_fwf("../dse-iri-dataset/Year1/External/saltsnck/Delivery_Stores")
stores.head()
Out[18]:
In [16]:
%load_ext sql
In [17]:
%%sql postgresql://sharknado:sharknado@sharknado-dse.ceg3hdkdq8l0.us-east-1.rds.amazonaws.com/sharknado
select * from test
Out[17]:
In [18]:
names = _
In [19]:
names
Out[19]:
In [20]:
result = %sql select * from stores
dataframe = result.DataFrame()
In [21]:
dataframe.head()
Out[21]:
In [22]:
transactions = %sql select * from transactions limit 10
transactions.DataFrame().head()
Out[22]:
In [23]:
products_query = %sql select * from products limit 10
products_query.DataFrame().head()
Out[23]:
In [31]:
%%sql
select * from transactions
join products on products.item = transactions.item
limit 5
Out[31]:
In [30]:
%%sql
SELECT products.product_type, sum(transactions.dollars) as dollar_sum
FROM transactions
JOIN products on products.item = transactions.item
GROUP BY products.product_type
ORDER BY dollar_sum desc
LIMIT 10
Out[30]:
In [ ]: