Explore data

This project, participant of Data Mining Cup 2017, analyzes the set in order to recognize existing relationships of dynamic prices strategies and creates the model that can predict revenues given certain training set. Let's start by importing some libraries and examining the data.



In [1]:

    
%matplotlib inline
import pandas as pd
import numpy as np
import gc
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn import linear_model
import os

#chanhe this manualy
path = '/media/roman/Main/Programing/contest/dmc2017/dmc-2017/'
os.chdir(path)



In [2]:

    
items = pd.read_csv('data/raw/items.csv',sep='|')
train = pd.read_csv('data/raw/train.csv',sep='|')



In [3]:

    
train.head(5)









    Out[3]:







  
    
      
      lineID
      day
      pid
      adFlag
      availability
      competitorPrice
      click
      basket
      order
      price
      revenue
    
  
  
    
      0
      1
      1
      6570
      0
      2
      14.60
      1
      0
      0
      16.89
      0.00
    
    
      1
      2
      1
      14922
      1
      1
      8.57
      0
      1
      0
      8.75
      0.00
    
    
      2
      3
      1
      16382
      0
      1
      14.77
      0
      1
      0
      16.06
      0.00
    
    
      3
      4
      1
      1145
      1
      1
      6.59
      0
      0
      1
      6.55
      6.55
    
    
      4
      5
      1
      3394
      0
      1
      4.39
      0
      0
      1
      4.14
      4.14



In [4]:

    
items.head(5)









    Out[4]:







  
    
      
      pid
      manufacturer
      group
      content
      unit
      pharmForm
      genericProduct
      salesIndex
      category
      campaignIndex
      rrp
    
  
  
    
      0
      1
      1
      2FOI
      80
      ST
      TAB
      0
      40
      NaN
      NaN
      10.89
    
    
      1
      2
      1
      2FOI
      80
      ST
      Tab
      0
      40
      NaN
      NaN
      10.89
    
    
      2
      3
      1
      2FOI
      10
      G
      GLO
      0
      40
      NaN
      NaN
      16.45
    
    
      3
      4
      1
      2FOI
      80
      ST
      TAB
      0
      40
      NaN
      NaN
      10.89
    
    
      4
      5
      2
      10OJ03JS
      8
      ST
      PUL
      0
      40
      1.0
      NaN
      22.53



In [5]:

    
train.info(null_counts=True)









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2756003 entries, 0 to 2756002
Data columns (total 11 columns):
lineID             2756003 non-null int64
day                2756003 non-null int64
pid                2756003 non-null int64
adFlag             2756003 non-null int64
availability       2756003 non-null int64
competitorPrice    2655316 non-null float64
click              2756003 non-null int64
basket             2756003 non-null int64
order              2756003 non-null int64
price              2756003 non-null float64
revenue            2756003 non-null float64
dtypes: float64(3), int64(8)
memory usage: 231.3 MB



In [6]:

    
items.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22035 entries, 0 to 22034
Data columns (total 11 columns):
pid               22035 non-null int64
manufacturer      22035 non-null int64
group             22035 non-null object
content           22035 non-null object
unit              22035 non-null object
pharmForm         19708 non-null object
genericProduct    22035 non-null int64
salesIndex        22035 non-null int64
category          17408 non-null float64
campaignIndex     1338 non-null object
rrp               22035 non-null float64
dtypes: float64(2), int64(4), object(5)
memory usage: 1.8+ MB

Items dataframe stores additional information about product and can be connected with main dataframe train using column 'pid', after which we can delete train



In [7]:

    
train_items=pd.merge(train,items,on='pid')
del train
gc.collect









    Out[7]:





<function gc.collect>

At this stage we can see some problems: - some categorical feautures stored as int64 or object - missing data in some columns

Preprocessing

Let's proceed by transforming some columns to categories.



In [8]:

    
def toCategorical(df):
    columns=['availability','group','content','unit','pharmForm',
             'campaignIndex','salesIndex', 'category', 'manufacturer']
    for col in columns:
        if col in df.columns:
            df[col]=df[col].astype('category')
    return df



In [9]:

    
train_items=toCategorical(train_items)

Next step - dealing with missing data competitorPrice have missing data, it can be filled simply by mean, but rrp(recomended retail price) or price can describe rrp better than simple mean. Due to higher correlation betwen rrp and competitorPrice, we'll choose this pair for filling in missing data.



In [10]:

    
t1=train_items['rrp'].as_matrix()
t2=train_items['competitorPrice'].as_matrix()
t2_div_t1=t2/t1
t2_div_t1=t2_div_t1[np.logical_not(np.isnan(t2_div_t1))]
coef_competitorPrice_to_rrp=t2_div_t1.mean()
print coef_competitorPrice_to_rrp









    



0.713654175692

So we have mean ration between this parameters. Function below can work with both items and train_items, by adding certain flags as parameter. It's designed for RAM usage optimizarion. To fill in missing competitorPrice we multiply rrp by our precalculated ratio. For missing categories we'll add new ones.



In [11]:

    
def solveNA(df,df2,coef,flag):
    if flag==1:  
        df['pharmForm'] = df['pharmForm'].fillna('no_pharmForm')   
        df['category'] = df['category'].fillna(410)
        df['campaignIndex'] = df['campaignIndex'].fillna('D')
    else:  
        df['competitorPrice'] = df['competitorPrice'].fillna(df2['rrp']*coef)
        if 'pharmForm' in df.columns:
            df['pharmForm'] = df['pharmForm'].cat.add_categories(['no_pharmForm'])
            df['pharmForm'] = df['pharmForm'].fillna('no_pharmForm')
        if 'category' in df.columns:
            df['category'] = df['category'].cat.add_categories([410])
            df['category'] = df['category'].fillna(410)
        if 'campaignIndex' in df.columns:
            df['campaignIndex'] = df['campaignIndex'].cat.add_categories(['D'])
            df['campaignIndex'] = df['campaignIndex'].fillna('D')
    columns2=['category', 'manufacturer']
    for col2 in columns2:
        if col2 in df.columns:
            df[col2]=df[col2].astype('int')
            df[col2]=df[col2].astype('category')
    return df



In [12]:

    
items=solveNA(items,train_items,coef_competitorPrice_to_rrp,1)
train_items=solveNA(train_items,train_items,coef_competitorPrice_to_rrp,0)



In [13]:

    
train_items.info(null_counts=True)









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 2756003 entries, 0 to 2756002
Data columns (total 21 columns):
lineID             2756003 non-null int64
day                2756003 non-null int64
pid                2756003 non-null int64
adFlag             2756003 non-null int64
availability       2756003 non-null category
competitorPrice    2756003 non-null float64
click              2756003 non-null int64
basket             2756003 non-null int64
order              2756003 non-null int64
price              2756003 non-null float64
revenue            2756003 non-null float64
manufacturer       2756003 non-null category
group              2756003 non-null category
content            2756003 non-null category
unit               2756003 non-null category
pharmForm          2756003 non-null category
genericProduct     2756003 non-null int64
salesIndex         2756003 non-null category
category           2756003 non-null category
campaignIndex      2756003 non-null category
rrp                2756003 non-null float64
dtypes: category(9), float64(4), int64(8)
memory usage: 310.3 MB

So all categories have right datatypes and there is no missing data Next step is to encode categories with low cardinality as dummies



In [14]:

    
def Dummies(df):
    columns=['availability','unit','salesIndex','campaignIndex']
    dumm=pd.get_dummies(df[columns])
    df=pd.concat([df, dumm], axis=1)
    return df



In [15]:

    
train_items=Dummies(train_items)

Let's create some new feautures from existing



In [16]:

    
def moreFeautures(df):
    df['day_of_week']=df['day']%7
    df['discount']=df['price']/df['rrp']
    df['compDiscount']=df['competitorPrice']/df['price']
    return df



In [17]:

    
train_items=moreFeautures(train_items)



In [18]:

    
train_items.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 2756003 entries, 0 to 2756002
Data columns (total 44 columns):
lineID             int64
day                int64
pid                int64
adFlag             int64
availability       category
competitorPrice    float64
click              int64
basket             int64
order              int64
price              float64
revenue            float64
manufacturer       category
group              category
content            category
unit               category
pharmForm          category
genericProduct     int64
salesIndex         category
category           category
campaignIndex      category
rrp                float64
availability_1     uint8
availability_2     uint8
availability_3     uint8
availability_4     uint8
unit_CM            uint8
unit_G             uint8
unit_KG            uint8
unit_L             uint8
unit_M             uint8
unit_ML            uint8
unit_P             uint8
unit_ST            uint8
salesIndex_40      uint8
salesIndex_44      uint8
salesIndex_52      uint8
salesIndex_53      uint8
campaignIndex_A    uint8
campaignIndex_B    uint8
campaignIndex_C    uint8
campaignIndex_D    uint8
day_of_week        int64
discount           float64
compDiscount       float64
dtypes: category(9), float64(6), int64(9), uint8(20)
memory usage: 425.9 MB

So we've extended the number of collumns from 21 to 44 But we have some categories with high cardinality, it can't be transform the same way. It's possible to encode it using 'hashing trick' or scipy.sparse.coo_matrix but after several attempts I stopped replacing categorical data with average of target variable with same categories or number of times it appears in the train set and not only for one category but for intersection of two or three categories. Due to competition limitation click, basket, order can't be used in train, so we try to replace them too. Next function is the slowest in this project so I saved results into seperate file



In [19]:

    
import itertools as it
def solveCategorical(c1,df1,df2,flag):
    def f(x):
        tuples = [tuple(i) for i in x.values]
        tuples.reverse()
        res=[]
        for i in xrange(len(tuples)):
            try:
                res.append(t2[tuples.pop()])
            except:
                res.append(np.nan)
        return res
    columns=['group','content','pharmForm','category','manufacturer']
    for L in range(1, 4):
        for col in it.combinations(columns, L):
            print col
            t1=df1.groupby(list(col))
            t2=dict(t1[c1].mean()) 
            str1='_'.join(col)
            if len(col)==1:
                df2[c1+'_'+str1+'_mean']=df2[col[0]].map(t2)
            else:
                df2[c1+'_'+str1+'_mean']=f(df2[list(col)])
            if flag==1:
                t2=dict(t1[c1].count())
                if len(col)==1:
                    df2[c1+'_'+str1+'_count']=df2[col[0]].map(t2)
                else:
                    df2[c1+'_'+str1+'_count']=np.array(f(df2[list(col)]))/2756003.0
    return df2
#items = solveCategorical('revenue',train_items,items,1)
#items = solveCategorical('click',train_items,items,0)
#items = solveCategorical('basket',train_items,items,0)
#items = solveCategorical('order',train_items,items,0)
#items.to_csv('data/interim/items_v1')



In [20]:

    
items = pd.read_csv(data/interim/items_v1.csv')



In [21]:

    
items.info(max_cols=136)









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 22035 entries, 0 to 22034
Data columns (total 136 columns):
pid                                              22035 non-null int64
manufacturer                                     22035 non-null int64
group                                            22035 non-null object
content                                          22035 non-null object
unit                                             22035 non-null object
pharmForm                                        22035 non-null object
genericProduct                                   22035 non-null int64
salesIndex                                       22035 non-null int64
category                                         22035 non-null int64
campaignIndex                                    22035 non-null object
rrp                                              22035 non-null float64
revenue_group_mean                               22035 non-null float64
revenue_group_count                              22035 non-null int64
revenue_content_mean                             22035 non-null float64
revenue_content_count                            22035 non-null int64
revenue_pharmForm_mean                           22035 non-null float64
revenue_pharmForm_count                          22035 non-null int64
revenue_category_mean                            22035 non-null float64
revenue_category_count                           22035 non-null int64
revenue_manufacturer_mean                        22033 non-null float64
revenue_manufacturer_count                       22033 non-null float64
revenue_group_content_mean                       22034 non-null float64
revenue_group_content_count                      22034 non-null float64
revenue_group_pharmForm_mean                     22031 non-null float64
revenue_group_pharmForm_count                    22031 non-null float64
revenue_group_category_mean                      22033 non-null float64
revenue_group_category_count                     22033 non-null float64
revenue_group_manufacturer_mean                  22023 non-null float64
revenue_group_manufacturer_count                 22023 non-null float64
revenue_content_pharmForm_mean                   22032 non-null float64
revenue_content_pharmForm_count                  22032 non-null float64
revenue_content_category_mean                    22033 non-null float64
revenue_content_category_count                   22033 non-null float64
revenue_content_manufacturer_mean                22025 non-null float64
revenue_content_manufacturer_count               22025 non-null float64
revenue_pharmForm_category_mean                  22034 non-null float64
revenue_pharmForm_category_count                 22034 non-null float64
revenue_pharmForm_manufacturer_mean              22021 non-null float64
revenue_pharmForm_manufacturer_count             22021 non-null float64
revenue_category_manufacturer_mean               22026 non-null float64
revenue_category_manufacturer_count              22026 non-null float64
revenue_group_content_pharmForm_mean             22024 non-null float64
revenue_group_content_pharmForm_count            22024 non-null float64
revenue_group_content_category_mean              22024 non-null float64
revenue_group_content_category_count             22024 non-null float64
revenue_group_content_manufacturer_mean          22009 non-null float64
revenue_group_content_manufacturer_count         22009 non-null float64
revenue_group_pharmForm_category_mean            22027 non-null float64
revenue_group_pharmForm_category_count           22027 non-null float64
revenue_group_pharmForm_manufacturer_mean        22010 non-null float64
revenue_group_pharmForm_manufacturer_count       22010 non-null float64
revenue_group_category_manufacturer_mean         22011 non-null float64
revenue_group_category_manufacturer_count        22011 non-null float64
revenue_content_pharmForm_category_mean          22021 non-null float64
revenue_content_pharmForm_category_count         22021 non-null float64
revenue_content_pharmForm_manufacturer_mean      22004 non-null float64
revenue_content_pharmForm_manufacturer_count     22004 non-null float64
revenue_content_category_manufacturer_mean       22007 non-null float64
revenue_content_category_manufacturer_count      22007 non-null float64
revenue_pharmForm_category_manufacturer_mean     22006 non-null float64
revenue_pharmForm_category_manufacturer_count    22006 non-null float64
click_group_mean                                 22035 non-null float64
click_content_mean                               22035 non-null float64
click_pharmForm_mean                             22035 non-null float64
click_category_mean                              22035 non-null float64
click_manufacturer_mean                          22033 non-null float64
click_group_content_mean                         22034 non-null float64
click_group_pharmForm_mean                       22031 non-null float64
click_group_category_mean                        22033 non-null float64
click_group_manufacturer_mean                    22023 non-null float64
click_content_pharmForm_mean                     22032 non-null float64
click_content_category_mean                      22033 non-null float64
click_content_manufacturer_mean                  22025 non-null float64
click_pharmForm_category_mean                    22034 non-null float64
click_pharmForm_manufacturer_mean                22021 non-null float64
click_category_manufacturer_mean                 22026 non-null float64
click_group_content_pharmForm_mean               22024 non-null float64
click_group_content_category_mean                22024 non-null float64
click_group_content_manufacturer_mean            22009 non-null float64
click_group_pharmForm_category_mean              22027 non-null float64
click_group_pharmForm_manufacturer_mean          22010 non-null float64
click_group_category_manufacturer_mean           22011 non-null float64
click_content_pharmForm_category_mean            22021 non-null float64
click_content_pharmForm_manufacturer_mean        22004 non-null float64
click_content_category_manufacturer_mean         22007 non-null float64
click_pharmForm_category_manufacturer_mean       22006 non-null float64
basket_group_mean                                22035 non-null float64
basket_content_mean                              22035 non-null float64
basket_pharmForm_mean                            22035 non-null float64
basket_category_mean                             22035 non-null float64
basket_manufacturer_mean                         22033 non-null float64
basket_group_content_mean                        22034 non-null float64
basket_group_pharmForm_mean                      22031 non-null float64
basket_group_category_mean                       22033 non-null float64
basket_group_manufacturer_mean                   22023 non-null float64
basket_content_pharmForm_mean                    22032 non-null float64
basket_content_category_mean                     22033 non-null float64
basket_content_manufacturer_mean                 22025 non-null float64
basket_pharmForm_category_mean                   22034 non-null float64
basket_pharmForm_manufacturer_mean               22021 non-null float64
basket_category_manufacturer_mean                22026 non-null float64
basket_group_content_pharmForm_mean              22024 non-null float64
basket_group_content_category_mean               22024 non-null float64
basket_group_content_manufacturer_mean           22009 non-null float64
basket_group_pharmForm_category_mean             22027 non-null float64
basket_group_pharmForm_manufacturer_mean         22010 non-null float64
basket_group_category_manufacturer_mean          22011 non-null float64
basket_content_pharmForm_category_mean           22021 non-null float64
basket_content_pharmForm_manufacturer_mean       22004 non-null float64
basket_content_category_manufacturer_mean        22007 non-null float64
basket_pharmForm_category_manufacturer_mean      22006 non-null float64
order_group_mean                                 22035 non-null float64
order_content_mean                               22035 non-null float64
order_pharmForm_mean                             22035 non-null float64
order_category_mean                              22035 non-null float64
order_manufacturer_mean                          22033 non-null float64
order_group_content_mean                         22034 non-null float64
order_group_pharmForm_mean                       22031 non-null float64
order_group_category_mean                        22033 non-null float64
order_group_manufacturer_mean                    22023 non-null float64
order_content_pharmForm_mean                     22032 non-null float64
order_content_category_mean                      22033 non-null float64
order_content_manufacturer_mean                  22025 non-null float64
order_pharmForm_category_mean                    22034 non-null float64
order_pharmForm_manufacturer_mean                22021 non-null float64
order_category_manufacturer_mean                 22026 non-null float64
order_group_content_pharmForm_mean               22024 non-null float64
order_group_content_category_mean                22024 non-null float64
order_group_content_manufacturer_mean            22009 non-null float64
order_group_pharmForm_category_mean              22027 non-null float64
order_group_pharmForm_manufacturer_mean          22010 non-null float64
order_group_category_manufacturer_mean           22011 non-null float64
order_content_pharmForm_category_mean            22021 non-null float64
order_content_pharmForm_manufacturer_mean        22004 non-null float64
order_content_category_manufacturer_mean         22007 non-null float64
order_pharmForm_category_manufacturer_mean       22006 non-null float64
dtypes: float64(122), int64(9), object(5)
memory usage: 23.0+ MB

Not every intersection is present for all products but it's also true for test dataset, so there is no need to fix this. After some test revenue group turned out to be better for data representation so we add it to train_items in the following code.



In [22]:

    
items_pred=list(items.columns)
t1=['pid']
for p in items_pred:
    if 'revenue' in p:
        t1.append(p)
items_pred=t1



In [23]:

    
train_items=pd.merge(train_items,items[items_pred],on='pid')

This way we get the columns for the next stage



In [24]:

    
predictors=[ #'lineID',
             #'day',
             #'pid',
             'adFlag',
             #'availability',
             #'competitorPrice',
             #'click',
             #'basket',
             #'order',
             #'price',
             #'revenue',
             #'manufacturer',
             #'group',
             #'content',
             #'unit',
             #'pharmForm',
             'genericProduct',
             #'salesIndex',
             #'category',
             #'campaignIndex',
             'rrp',
             'availability_1',
             'availability_2',
             'availability_3',
             'availability_4',
             'unit_CM',
             'unit_G',
             'unit_KG',
             'unit_L',
             'unit_M',
             'unit_ML',
             'unit_P',
             'unit_ST',
             'salesIndex_40',
             'salesIndex_44',
             'salesIndex_52',
             'salesIndex_53',
             'campaignIndex_A',
             'campaignIndex_B',
             'campaignIndex_C',
             'day_of_week',
             'discount',
             'compDiscount',
             'revenue_group_mean',
             'revenue_group_count',
             'revenue_content_mean',
             'revenue_content_count',
             'revenue_pharmForm_mean',
             'revenue_pharmForm_count',
             'revenue_category_mean',
             'revenue_category_count',
             'revenue_manufacturer_mean',
             'revenue_manufacturer_count',
             'revenue_group_content_mean',
             'revenue_group_content_count',
             'revenue_group_pharmForm_mean',
             'revenue_group_pharmForm_count',
             'revenue_group_category_mean',
             'revenue_group_category_count',
             'revenue_group_manufacturer_mean',
             'revenue_group_manufacturer_count',
             'revenue_content_pharmForm_mean',
             'revenue_content_pharmForm_count',
             'revenue_content_category_mean',
             'revenue_content_category_count',
             'revenue_content_manufacturer_mean',
             'revenue_content_manufacturer_count',
             'revenue_pharmForm_category_mean',
             'revenue_pharmForm_category_count',
             'revenue_pharmForm_manufacturer_mean',
             'revenue_pharmForm_manufacturer_count',
             'revenue_category_manufacturer_mean',
             'revenue_category_manufacturer_count',
             'revenue_group_content_pharmForm_mean',
             'revenue_group_content_pharmForm_count',
             'revenue_group_content_category_mean',
             'revenue_group_content_category_count',
             'revenue_group_content_manufacturer_mean',
             'revenue_group_content_manufacturer_count',
             'revenue_group_pharmForm_category_mean',
             'revenue_group_pharmForm_category_count',
             'revenue_group_pharmForm_manufacturer_mean',
             'revenue_group_pharmForm_manufacturer_count',
             'revenue_group_category_manufacturer_mean',
             'revenue_group_category_manufacturer_count',
             'revenue_content_pharmForm_category_mean',
             'revenue_content_pharmForm_category_count',
             'revenue_content_pharmForm_manufacturer_mean',
             'revenue_content_pharmForm_manufacturer_count',
             'revenue_content_category_manufacturer_mean',
             'revenue_content_category_manufacturer_count',
             'revenue_pharmForm_category_manufacturer_mean',
             'revenue_pharmForm_category_manufacturer_count']

Due to the limited RAM we reduce our dataframe



In [25]:

    
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(train_items, train_size=0.1, test_size=0, random_state=42)



In [26]:

    
y_train = train_set['revenue']
x_train = train_set[predictors]

del train_items, items, test_set
gc.collect









    Out[26]:





<function gc.collect>



In [27]:

    
x_train.head(1)









    Out[27]:







  
    
      
      adFlag
      genericProduct
      rrp
      availability_1
      availability_2
      availability_3
      availability_4
      unit_CM
      unit_G
      unit_KG
      ...
      revenue_group_category_manufacturer_mean
      revenue_group_category_manufacturer_count
      revenue_content_pharmForm_category_mean
      revenue_content_pharmForm_category_count
      revenue_content_pharmForm_manufacturer_mean
      revenue_content_pharmForm_manufacturer_count
      revenue_content_category_manufacturer_mean
      revenue_content_category_manufacturer_count
      revenue_pharmForm_category_manufacturer_mean
      revenue_pharmForm_category_manufacturer_count
    
  
  
    
      309086
      0
      0
      0.66
      1
      0
      0
      0
      0
      1
      0
      ...
      0.847468
      0.000198
      0.847468
      0.000198
      0.847468
      0.000198
      0.847468
      0.000198
      0.847468
      0.000198
    
  

1 rows × 75 columns

The range of every column is very different, so we need to use scaler.



In [28]:

    
scaler = MinMaxScaler()
x_train=scaler.fit_transform(x_train)

Models

Now we are going to use regularized linear regression models from the scikit learn module.



In [29]:

    
from sklearn.model_selection import cross_val_score
def rmse_cv(model):
    rmse= np.sqrt(-cross_val_score(model, x_train, y_train, scoring="neg_mean_squared_error", cv = 5))
    return(rmse)
model_ridge = linear_model.Ridge()
alphas = [1, 4, 5, 6, 7, 8, 9, 10, 11, 15]
cv_ridge = [rmse_cv(linear_model.Ridge(alpha = alpha)).mean() for alpha in alphas]



In [30]:

    
cv_ridge = pd.Series(cv_ridge, index = alphas)
cv_ridge.plot()
plt.xlabel("alpha")
plt.ylabel("rmse")









    Out[30]:





<matplotlib.text.Text at 0x7f5d14d20850>

A value of alpha = 6 is about right based on the plot above. Let' try out the Lasso model. We will do a slightly different approach here and use the built in Lasso CV to figure out the best alpha for us. For some reason the alphas in Lasso CV are really the inverse or the alphas in Ridge.



In [31]:

    
model_ridge = linear_model.Ridge(alpha=6, fit_intercept=True, max_iter=10000)
model_ridge.fit(x_train, y_train)
model_lasso = linear_model.LassoCV(alphas = [1, 0.16, 0.1, 0.001, 0.0005]).fit(x_train, y_train)



In [32]:

    
rmse_cv(model_lasso).mean()









    



/home/roman/anaconda2/lib/python2.7/site-packages/sklearn/linear_model/coordinate_descent.py:484: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)






    Out[32]:





9.7217194633925317



In [33]:

    
cv_ridge.min()









    Out[33]:





9.7217060828070796

Ridge shows better result, but they are very close.



In [34]:

    
c = pd.Series(model_lasso.coef_, index = train_set[predictors].columns)
print("Lasso picked " + str(sum(c != 0)) + " variables and eliminated the other " +  str(sum(c == 0)) + " variables")









    



Lasso picked 50 variables and eliminated the other 25 variables



In [35]:

    
imp_coef = pd.concat([c.sort_values().head(10), c.sort_values().tail(10)])
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Coefficients in the Lasso Model")









    Out[35]:





<matplotlib.text.Text at 0x7f5ce13e1890>



In [36]:

    
c = pd.Series(model_ridge.coef_, index = train_set[predictors].columns)
imp_coef = pd.concat([c.sort_values().head(10), c.sort_values().tail(10)])
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Coefficients in the Ridge Model")









    Out[36]:





<matplotlib.text.Text at 0x7f5ce124d110>



In [37]:

    
del train_set, x_train, y_train
gc.collect









    Out[37]:





<function gc.collect>

Now let's return to full dataset and perform preprocessing using previous functions



In [38]:

    
items = pd.read_csv('data/raw/items.csv',sep='|')
train = pd.read_csv('data/raw/train.csv',sep='|')
train_items=pd.merge(train,items,on='pid')
del train
gc.collect









    Out[38]:





<function gc.collect>



In [39]:

    
train_items=toCategorical(train_items)
train_items=solveNA(train_items,train_items,coef_competitorPrice_to_rrp,0)
train_items=Dummies(train_items)
train_items=moreFeautures(train_items)

items = pd.read_csv('data/interim/items_v1.csv')
train_items=pd.merge(train_items,items[items_pred],on='pid')
y_train = train_items['revenue']
x_train = train_items[predictors]

del train_items, items
gc.collect









    Out[39]:





<function gc.collect>



In [41]:

    
scaler = MinMaxScaler()
x_train=scaler.fit_transform(x_train)

After preprocessing we can start fitting our models with best alpha from previous step



In [42]:

    
model_lasso = linear_model.LassoCV(alphas = [1, 0.16, 0.1, 0.001, 0.0005]).fit(x_train, y_train)



In [43]:

    
model_ridge = linear_model.Ridge(alpha=6, fit_intercept=True, max_iter=10000)
model_ridge.fit(x_train, y_train)









    Out[43]:





Ridge(alpha=6, copy_X=True, fit_intercept=True, max_iter=10000,
   normalize=False, random_state=None, solver='auto', tol=0.001)



In [44]:

    
del y_train,x_train
gc.collect









    Out[44]:





<function gc.collect>

So now we have fitted models, let's load and preprocess our class data.



In [45]:

    
items = pd.read_csv(
'data/raw/items.csv',sep='|')
clas = pd.read_csv(
'data/raw/class.csv',sep='|')
clas_items=pd.merge(clas,items,on='pid')
clas_items=toCategorical(clas_items)
clas_items=solveNA(clas_items,clas_items,coef_competitorPrice_to_rrp,0)
clas_items=Dummies(clas_items)
clas_items=moreFeautures(clas_items)
items = pd.read_csv(
'data/interim/items_v1.csv')
clas_items=pd.merge(clas_items,items[items_pred],on='pid')

submission = pd.DataFrame({
        "lineID": clas_items["lineID"],
        "revenue": np.zeros(shape=(1210767,))
})

x_test = clas_items[predictors]
del clas_items,clas,items
gc.collect









    Out[45]:





<function gc.collect>

In case of missing data appearance we can conclude that it's the new manufacturer that was not in the training set but it can appear in test



In [46]:

    
x_test = x_test.fillna(x_test.mean())
x_test = scaler.fit_transform(x_test)

And finaly we can make prediction of revenue



In [47]:

    
lasso_preds = model_lasso.predict(x_test)
ridge_preds = model_ridge.predict(x_test)

In postprocessing we fill in all negative values with zero, because of business logic. After that we mix output from two models, sort and save our submision.



In [49]:

    
predictions = pd.DataFrame({"ridge":ridge_preds, "lasso":lasso_preds})
predictions[predictions['ridge']<0]=0
predictions[predictions['lasso']<0]=0

lasso_preds = predictions['lasso']
ridge_preds = predictions['ridge']
com_pred = (lasso_preds + ridge_preds) / 2.0

submission['revenue']=com_pred
submission_sorted=submission.sort_values("lineID");
submission_sorted.to_csv("data/Uni_Polytechnic_Lviv_1.csv", index=False, sep='|')



In [50]:

    
submission_sorted









    Out[50]:







  
    
      
      lineID
      revenue
    
  
  
    
      0
      1
      2.918315
    
    
      314
      2
      0.085515
    
    
      688
      3
      1.153730
    
    
      1137
      4
      0.289552
    
    
      1292
      5
      1.545264
    
    
      1779
      6
      2.567740
    
    
      1883
      7
      0.551230
    
    
      2950
      8
      1.073107
    
    
      3333
      9
      3.508954
    
    
      3701
      10
      4.274228
    
    
      4317
      11
      0.913259
    
    
      5045
      12
      1.136892
    
    
      5435
      13
      1.164143
    
    
      5838
      14
      2.212649
    
    
      6114
      15
      1.134349
    
    
      8729
      16
      11.082580
    
    
      8807
      17
      3.028414
    
    
      9076
      18
      5.441399
    
    
      10345
      19
      0.462155
    
    
      10595
      20
      2.880210
    
    
      10956
      21
      3.223039
    
    
      11011
      22
      6.188713
    
    
      11041
      23
      2.005597
    
    
      11063
      24
      8.900715
    
    
      11130
      25
      1.641799
    
    
      13209
      26
      1.250267
    
    
      13299
      27
      3.518609
    
    
      15137
      28
      5.461279
    
    
      18499
      29
      5.780450
    
    
      19023
      30
      2.679875
    
    
      ...
      ...
      ...
    
    
      483406
      1210738
      1.601418
    
    
      1046815
      1210739
      2.388170
    
    
      43369
      1210740
      0.220250
    
    
      111016
      1210741
      1.419405
    
    
      610455
      1210742
      2.097951
    
    
      1159101
      1210743
      1.811689
    
    
      678894
      1210744
      2.091872
    
    
      97966
      1210745
      2.416438
    
    
      807746
      1210746
      1.729460
    
    
      1193039
      1210747
      1.695809
    
    
      131637
      1210748
      8.831485
    
    
      368371
      1210749
      3.000849
    
    
      204785
      1210750
      0.200165
    
    
      283733
      1210751
      0.000000
    
    
      524147
      1210752
      6.826647
    
    
      3332
      1210753
      1.156671
    
    
      675250
      1210754
      7.610855
    
    
      225324
      1210755
      5.195631
    
    
      125268
      1210756
      2.151450
    
    
      634274
      1210757
      5.014182
    
    
      203936
      1210758
      3.267051
    
    
      43370
      1210759
      0.220250
    
    
      513052
      1210760
      1.856578
    
    
      122835
      1210761
      0.354526
    
    
      1068349
      1210762
      3.249229
    
    
      737912
      1210763
      3.362747
    
    
      266596
      1210764
      4.080373
    
    
      955362
      1210765
      2.334744
    
    
      131163
      1210766
      3.869969
    
    
      111017
      1210767
      1.419394
    
  

1210767 rows × 2 columns

	lineID	day	pid	adFlag	availability	competitorPrice	click	basket	order	price	revenue
0	1	1	6570	0	2	14.60	1	0	0	16.89	0.00
1	2	1	14922	1	1	8.57	0	1	0	8.75	0.00
2	3	1	16382	0	1	14.77	0	1	0	16.06	0.00
3	4	1	1145	1	1	6.59	0	0	1	6.55	6.55
4	5	1	3394	0	1	4.39	0	0	1	4.14	4.14

	pid	manufacturer	group	content	unit	pharmForm	salesIndex	category	campaignIndex	rrp
0	1	1	2FOI	80	ST	TAB	40	NaN	NaN	10.89
1	2	1	2FOI	80	ST	Tab	40	NaN	NaN	10.89
2	3	1	2FOI	10	G	GLO	40	NaN	NaN	16.45
3	4	1	2FOI	80	ST	TAB	40	NaN	NaN	10.89
4	5	2	10OJ03JS	8	ST	PUL	40	1.0	NaN	22.53

	lineID	revenue
0	1	2.918315
314	2	0.085515
688	3	1.153730
1137	4	0.289552
1292	5	1.545264
1779	6	2.567740
1883	7	0.551230
2950	8	1.073107
3333	9	3.508954
3701	10	4.274228
4317	11	0.913259
5045	12	1.136892
5435	13	1.164143
5838	14	2.212649
6114	15	1.134349
8729	16	11.082580
8807	17	3.028414
9076	18	5.441399
10345	19	0.462155
10595	20	2.880210
10956	21	3.223039
11011	22	6.188713
11041	23	2.005597
11063	24	8.900715
11130	25	1.641799
13209	26	1.250267
13299	27	3.518609
15137	28	5.461279
18499	29	5.780450
19023	30	2.679875
...	...	...
483406	1210738	1.601418
1046815	1210739	2.388170
43369	1210740	0.220250
111016	1210741	1.419405
610455	1210742	2.097951
1159101	1210743	1.811689
678894	1210744	2.091872
97966	1210745	2.416438
807746	1210746	1.729460
1193039	1210747	1.695809
131637	1210748	8.831485
368371	1210749	3.000849
204785	1210750	0.200165
283733	1210751	0.000000
524147	1210752	6.826647
3332	1210753	1.156671
675250	1210754	7.610855
225324	1210755	5.195631
125268	1210756	2.151450
634274	1210757	5.014182
203936	1210758	3.267051
43370	1210759	0.220250
513052	1210760	1.856578
122835	1210761	0.354526
1068349	1210762	3.249229
737912	1210763	3.362747
266596	1210764	4.080373
955362	1210765	2.334744
131163	1210766	3.869969
111017	1210767	1.419394