Explore data

This project, participant of Data Mining Cup 2017, analyzes the set in order to recognize existing relationships of dynamic prices strategies and creates the model that can predict revenues given certain training set. Let's start by importing some libraries and examining the data.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import gc
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn import linear_model
import os

#chanhe this manualy
path = '/media/roman/Main/Programing/contest/dmc2017/dmc-2017/'
os.chdir(path)

In [2]:
items = pd.read_csv('data/raw/items.csv',sep='|')
train = pd.read_csv('data/raw/train.csv',sep='|')

In [3]:
train.head(5)


Out[3]:
lineID day pid adFlag availability competitorPrice click basket order price revenue
0 1 1 6570 0 2 14.60 1 0 0 16.89 0.00
1 2 1 14922 1 1 8.57 0 1 0 8.75 0.00
2 3 1 16382 0 1 14.77 0 1 0 16.06 0.00
3 4 1 1145 1 1 6.59 0 0 1 6.55 6.55
4 5 1 3394 0 1 4.39 0 0 1 4.14 4.14

In [4]:
items.head(5)


Out[4]:
pid manufacturer group content unit pharmForm genericProduct salesIndex category campaignIndex rrp
0 1 1 2FOI 80 ST TAB 0 40 NaN NaN 10.89
1 2 1 2FOI 80 ST Tab 0 40 NaN NaN 10.89
2 3 1 2FOI 10 G GLO 0 40 NaN NaN 16.45
3 4 1 2FOI 80 ST TAB 0 40 NaN NaN 10.89
4 5 2 10OJ03JS 8 ST PUL 0 40 1.0 NaN 22.53

In [5]:
train.info(null_counts=True)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2756003 entries, 0 to 2756002
Data columns (total 11 columns):
lineID             2756003 non-null int64
day                2756003 non-null int64
pid                2756003 non-null int64
adFlag             2756003 non-null int64
availability       2756003 non-null int64
competitorPrice    2655316 non-null float64
click              2756003 non-null int64
basket             2756003 non-null int64
order              2756003 non-null int64
price              2756003 non-null float64
revenue            2756003 non-null float64
dtypes: float64(3), int64(8)
memory usage: 231.3 MB

In [6]:
items.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22035 entries, 0 to 22034
Data columns (total 11 columns):
pid               22035 non-null int64
manufacturer      22035 non-null int64
group             22035 non-null object
content           22035 non-null object
unit              22035 non-null object
pharmForm         19708 non-null object
genericProduct    22035 non-null int64
salesIndex        22035 non-null int64
category          17408 non-null float64
campaignIndex     1338 non-null object
rrp               22035 non-null float64
dtypes: float64(2), int64(4), object(5)
memory usage: 1.8+ MB
Items dataframe stores additional information about product and can be connected with main dataframe train using column 'pid', after which we can delete train

In [7]:
train_items=pd.merge(train,items,on='pid')
del train
gc.collect


Out[7]:
<function gc.collect>
At this stage we can see some problems: - some categorical feautures stored as int64 or object - missing data in some columns

Preprocessing

Let's proceed by transforming some columns to categories.

In [8]:
def toCategorical(df):
    columns=['availability','group','content','unit','pharmForm',
             'campaignIndex','salesIndex', 'category', 'manufacturer']
    for col in columns:
        if col in df.columns:
            df[col]=df[col].astype('category')
    return df

In [9]:
train_items=toCategorical(train_items)
Next step - dealing with missing data competitorPrice have missing data, it can be filled simply by mean, but rrp(recomended retail price) or price can describe rrp better than simple mean. Due to higher correlation betwen rrp and competitorPrice, we'll choose this pair for filling in missing data.

In [10]:
t1=train_items['rrp'].as_matrix()
t2=train_items['competitorPrice'].as_matrix()
t2_div_t1=t2/t1
t2_div_t1=t2_div_t1[np.logical_not(np.isnan(t2_div_t1))]
coef_competitorPrice_to_rrp=t2_div_t1.mean()
print coef_competitorPrice_to_rrp


0.713654175692
So we have mean ration between this parameters. Function below can work with both items and train_items, by adding certain flags as parameter. It's designed for RAM usage optimizarion. To fill in missing competitorPrice we multiply rrp by our precalculated ratio. For missing categories we'll add new ones.

In [11]:
def solveNA(df,df2,coef,flag):
    if flag==1:  
        df['pharmForm'] = df['pharmForm'].fillna('no_pharmForm')   
        df['category'] = df['category'].fillna(410)
        df['campaignIndex'] = df['campaignIndex'].fillna('D')
    else:  
        df['competitorPrice'] = df['competitorPrice'].fillna(df2['rrp']*coef)
        if 'pharmForm' in df.columns:
            df['pharmForm'] = df['pharmForm'].cat.add_categories(['no_pharmForm'])
            df['pharmForm'] = df['pharmForm'].fillna('no_pharmForm')
        if 'category' in df.columns:
            df['category'] = df['category'].cat.add_categories([410])
            df['category'] = df['category'].fillna(410)
        if 'campaignIndex' in df.columns:
            df['campaignIndex'] = df['campaignIndex'].cat.add_categories(['D'])
            df['campaignIndex'] = df['campaignIndex'].fillna('D')
    columns2=['category', 'manufacturer']
    for col2 in columns2:
        if col2 in df.columns:
            df[col2]=df[col2].astype('int')
            df[col2]=df[col2].astype('category')
    return df

In [12]:
items=solveNA(items,train_items,coef_competitorPrice_to_rrp,1)
train_items=solveNA(train_items,train_items,coef_competitorPrice_to_rrp,0)

In [13]:
train_items.info(null_counts=True)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 2756003 entries, 0 to 2756002
Data columns (total 21 columns):
lineID             2756003 non-null int64
day                2756003 non-null int64
pid                2756003 non-null int64
adFlag             2756003 non-null int64
availability       2756003 non-null category
competitorPrice    2756003 non-null float64
click              2756003 non-null int64
basket             2756003 non-null int64
order              2756003 non-null int64
price              2756003 non-null float64
revenue            2756003 non-null float64
manufacturer       2756003 non-null category
group              2756003 non-null category
content            2756003 non-null category
unit               2756003 non-null category
pharmForm          2756003 non-null category
genericProduct     2756003 non-null int64
salesIndex         2756003 non-null category
category           2756003 non-null category
campaignIndex      2756003 non-null category
rrp                2756003 non-null float64
dtypes: category(9), float64(4), int64(8)
memory usage: 310.3 MB
So all categories have right datatypes and there is no missing data Next step is to encode categories with low cardinality as dummies

In [14]:
def Dummies(df):
    columns=['availability','unit','salesIndex','campaignIndex']
    dumm=pd.get_dummies(df[columns])
    df=pd.concat([df, dumm], axis=1)
    return df

In [15]:
train_items=Dummies(train_items)
Let's create some new feautures from existing

In [16]:
def moreFeautures(df):
    df['day_of_week']=df['day']%7
    df['discount']=df['price']/df['rrp']
    df['compDiscount']=df['competitorPrice']/df['price']
    return df

In [17]:
train_items=moreFeautures(train_items)

In [18]:
train_items.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 2756003 entries, 0 to 2756002
Data columns (total 44 columns):
lineID             int64
day                int64
pid                int64
adFlag             int64
availability       category
competitorPrice    float64
click              int64
basket             int64
order              int64
price              float64
revenue            float64
manufacturer       category
group              category
content            category
unit               category
pharmForm          category
genericProduct     int64
salesIndex         category
category           category
campaignIndex      category
rrp                float64
availability_1     uint8
availability_2     uint8
availability_3     uint8
availability_4     uint8
unit_CM            uint8
unit_G             uint8
unit_KG            uint8
unit_L             uint8
unit_M             uint8
unit_ML            uint8
unit_P             uint8
unit_ST            uint8
salesIndex_40      uint8
salesIndex_44      uint8
salesIndex_52      uint8
salesIndex_53      uint8
campaignIndex_A    uint8
campaignIndex_B    uint8
campaignIndex_C    uint8
campaignIndex_D    uint8
day_of_week        int64
discount           float64
compDiscount       float64
dtypes: category(9), float64(6), int64(9), uint8(20)
memory usage: 425.9 MB
So we've extended the number of collumns from 21 to 44 But we have some categories with high cardinality, it can't be transform the same way. It's possible to encode it using 'hashing trick' or scipy.sparse.coo_matrix but after several attempts I stopped replacing categorical data with average of target variable with same categories or number of times it appears in the train set and not only for one category but for intersection of two or three categories. Due to competition limitation click, basket, order can't be used in train, so we try to replace them too. Next function is the slowest in this project so I saved results into seperate file

In [19]:
import itertools as it
def solveCategorical(c1,df1,df2,flag):
    def f(x):
        tuples = [tuple(i) for i in x.values]
        tuples.reverse()
        res=[]
        for i in xrange(len(tuples)):
            try:
                res.append(t2[tuples.pop()])
            except:
                res.append(np.nan)
        return res
    columns=['group','content','pharmForm','category','manufacturer']
    for L in range(1, 4):
        for col in it.combinations(columns, L):
            print col
            t1=df1.groupby(list(col))
            t2=dict(t1[c1].mean()) 
            str1='_'.join(col)
            if len(col)==1:
                df2[c1+'_'+str1+'_mean']=df2[col[0]].map(t2)
            else:
                df2[c1+'_'+str1+'_mean']=f(df2[list(col)])
            if flag==1:
                t2=dict(t1[c1].count())
                if len(col)==1:
                    df2[c1+'_'+str1+'_count']=df2[col[0]].map(t2)
                else:
                    df2[c1+'_'+str1+'_count']=np.array(f(df2[list(col)]))/2756003.0
    return df2
#items = solveCategorical('revenue',train_items,items,1)
#items = solveCategorical('click',train_items,items,0)
#items = solveCategorical('basket',train_items,items,0)
#items = solveCategorical('order',train_items,items,0)
#items.to_csv('data/interim/items_v1')

In [20]:
items = pd.read_csv(data/interim/items_v1.csv')

In [21]:
items.info(max_cols=136)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 22035 entries, 0 to 22034
Data columns (total 136 columns):
pid                                              22035 non-null int64
manufacturer                                     22035 non-null int64
group                                            22035 non-null object
content                                          22035 non-null object
unit                                             22035 non-null object
pharmForm                                        22035 non-null object
genericProduct                                   22035 non-null int64
salesIndex                                       22035 non-null int64
category                                         22035 non-null int64
campaignIndex                                    22035 non-null object
rrp                                              22035 non-null float64
revenue_group_mean                               22035 non-null float64
revenue_group_count                              22035 non-null int64
revenue_content_mean                             22035 non-null float64
revenue_content_count                            22035 non-null int64
revenue_pharmForm_mean                           22035 non-null float64
revenue_pharmForm_count                          22035 non-null int64
revenue_category_mean                            22035 non-null float64
revenue_category_count                           22035 non-null int64
revenue_manufacturer_mean                        22033 non-null float64
revenue_manufacturer_count                       22033 non-null float64
revenue_group_content_mean                       22034 non-null float64
revenue_group_content_count                      22034 non-null float64
revenue_group_pharmForm_mean                     22031 non-null float64
revenue_group_pharmForm_count                    22031 non-null float64
revenue_group_category_mean                      22033 non-null float64
revenue_group_category_count                     22033 non-null float64
revenue_group_manufacturer_mean                  22023 non-null float64
revenue_group_manufacturer_count                 22023 non-null float64
revenue_content_pharmForm_mean                   22032 non-null float64
revenue_content_pharmForm_count                  22032 non-null float64
revenue_content_category_mean                    22033 non-null float64
revenue_content_category_count                   22033 non-null float64
revenue_content_manufacturer_mean                22025 non-null float64
revenue_content_manufacturer_count               22025 non-null float64
revenue_pharmForm_category_mean                  22034 non-null float64
revenue_pharmForm_category_count                 22034 non-null float64
revenue_pharmForm_manufacturer_mean              22021 non-null float64
revenue_pharmForm_manufacturer_count             22021 non-null float64
revenue_category_manufacturer_mean               22026 non-null float64
revenue_category_manufacturer_count              22026 non-null float64
revenue_group_content_pharmForm_mean             22024 non-null float64
revenue_group_content_pharmForm_count            22024 non-null float64
revenue_group_content_category_mean              22024 non-null float64
revenue_group_content_category_count             22024 non-null float64
revenue_group_content_manufacturer_mean          22009 non-null float64
revenue_group_content_manufacturer_count         22009 non-null float64
revenue_group_pharmForm_category_mean            22027 non-null float64
revenue_group_pharmForm_category_count           22027 non-null float64
revenue_group_pharmForm_manufacturer_mean        22010 non-null float64
revenue_group_pharmForm_manufacturer_count       22010 non-null float64
revenue_group_category_manufacturer_mean         22011 non-null float64
revenue_group_category_manufacturer_count        22011 non-null float64
revenue_content_pharmForm_category_mean          22021 non-null float64
revenue_content_pharmForm_category_count         22021 non-null float64
revenue_content_pharmForm_manufacturer_mean      22004 non-null float64
revenue_content_pharmForm_manufacturer_count     22004 non-null float64
revenue_content_category_manufacturer_mean       22007 non-null float64
revenue_content_category_manufacturer_count      22007 non-null float64
revenue_pharmForm_category_manufacturer_mean     22006 non-null float64
revenue_pharmForm_category_manufacturer_count    22006 non-null float64
click_group_mean                                 22035 non-null float64
click_content_mean                               22035 non-null float64
click_pharmForm_mean                             22035 non-null float64
click_category_mean                              22035 non-null float64
click_manufacturer_mean                          22033 non-null float64
click_group_content_mean                         22034 non-null float64
click_group_pharmForm_mean                       22031 non-null float64
click_group_category_mean                        22033 non-null float64
click_group_manufacturer_mean                    22023 non-null float64
click_content_pharmForm_mean                     22032 non-null float64
click_content_category_mean                      22033 non-null float64
click_content_manufacturer_mean                  22025 non-null float64
click_pharmForm_category_mean                    22034 non-null float64
click_pharmForm_manufacturer_mean                22021 non-null float64
click_category_manufacturer_mean                 22026 non-null float64
click_group_content_pharmForm_mean               22024 non-null float64
click_group_content_category_mean                22024 non-null float64
click_group_content_manufacturer_mean            22009 non-null float64
click_group_pharmForm_category_mean              22027 non-null float64
click_group_pharmForm_manufacturer_mean          22010 non-null float64
click_group_category_manufacturer_mean           22011 non-null float64
click_content_pharmForm_category_mean            22021 non-null float64
click_content_pharmForm_manufacturer_mean        22004 non-null float64
click_content_category_manufacturer_mean         22007 non-null float64
click_pharmForm_category_manufacturer_mean       22006 non-null float64
basket_group_mean                                22035 non-null float64
basket_content_mean                              22035 non-null float64
basket_pharmForm_mean                            22035 non-null float64
basket_category_mean                             22035 non-null float64
basket_manufacturer_mean                         22033 non-null float64
basket_group_content_mean                        22034 non-null float64
basket_group_pharmForm_mean                      22031 non-null float64
basket_group_category_mean                       22033 non-null float64
basket_group_manufacturer_mean                   22023 non-null float64
basket_content_pharmForm_mean                    22032 non-null float64
basket_content_category_mean                     22033 non-null float64
basket_content_manufacturer_mean                 22025 non-null float64
basket_pharmForm_category_mean                   22034 non-null float64
basket_pharmForm_manufacturer_mean               22021 non-null float64
basket_category_manufacturer_mean                22026 non-null float64
basket_group_content_pharmForm_mean              22024 non-null float64
basket_group_content_category_mean               22024 non-null float64
basket_group_content_manufacturer_mean           22009 non-null float64
basket_group_pharmForm_category_mean             22027 non-null float64
basket_group_pharmForm_manufacturer_mean         22010 non-null float64
basket_group_category_manufacturer_mean          22011 non-null float64
basket_content_pharmForm_category_mean           22021 non-null float64
basket_content_pharmForm_manufacturer_mean       22004 non-null float64
basket_content_category_manufacturer_mean        22007 non-null float64
basket_pharmForm_category_manufacturer_mean      22006 non-null float64
order_group_mean                                 22035 non-null float64
order_content_mean                               22035 non-null float64
order_pharmForm_mean                             22035 non-null float64
order_category_mean                              22035 non-null float64
order_manufacturer_mean                          22033 non-null float64
order_group_content_mean                         22034 non-null float64
order_group_pharmForm_mean                       22031 non-null float64
order_group_category_mean                        22033 non-null float64
order_group_manufacturer_mean                    22023 non-null float64
order_content_pharmForm_mean                     22032 non-null float64
order_content_category_mean                      22033 non-null float64
order_content_manufacturer_mean                  22025 non-null float64
order_pharmForm_category_mean                    22034 non-null float64
order_pharmForm_manufacturer_mean                22021 non-null float64
order_category_manufacturer_mean                 22026 non-null float64
order_group_content_pharmForm_mean               22024 non-null float64
order_group_content_category_mean                22024 non-null float64
order_group_content_manufacturer_mean            22009 non-null float64
order_group_pharmForm_category_mean              22027 non-null float64
order_group_pharmForm_manufacturer_mean          22010 non-null float64
order_group_category_manufacturer_mean           22011 non-null float64
order_content_pharmForm_category_mean            22021 non-null float64
order_content_pharmForm_manufacturer_mean        22004 non-null float64
order_content_category_manufacturer_mean         22007 non-null float64
order_pharmForm_category_manufacturer_mean       22006 non-null float64
dtypes: float64(122), int64(9), object(5)
memory usage: 23.0+ MB
Not every intersection is present for all products but it's also true for test dataset, so there is no need to fix this. After some test revenue group turned out to be better for data representation so we add it to train_items in the following code.

In [22]:
items_pred=list(items.columns)
t1=['pid']
for p in items_pred:
    if 'revenue' in p:
        t1.append(p)
items_pred=t1

In [23]:
train_items=pd.merge(train_items,items[items_pred],on='pid')
This way we get the columns for the next stage

In [24]:
predictors=[ #'lineID',
             #'day',
             #'pid',
             'adFlag',
             #'availability',
             #'competitorPrice',
             #'click',
             #'basket',
             #'order',
             #'price',
             #'revenue',
             #'manufacturer',
             #'group',
             #'content',
             #'unit',
             #'pharmForm',
             'genericProduct',
             #'salesIndex',
             #'category',
             #'campaignIndex',
             'rrp',
             'availability_1',
             'availability_2',
             'availability_3',
             'availability_4',
             'unit_CM',
             'unit_G',
             'unit_KG',
             'unit_L',
             'unit_M',
             'unit_ML',
             'unit_P',
             'unit_ST',
             'salesIndex_40',
             'salesIndex_44',
             'salesIndex_52',
             'salesIndex_53',
             'campaignIndex_A',
             'campaignIndex_B',
             'campaignIndex_C',
             'day_of_week',
             'discount',
             'compDiscount',
             'revenue_group_mean',
             'revenue_group_count',
             'revenue_content_mean',
             'revenue_content_count',
             'revenue_pharmForm_mean',
             'revenue_pharmForm_count',
             'revenue_category_mean',
             'revenue_category_count',
             'revenue_manufacturer_mean',
             'revenue_manufacturer_count',
             'revenue_group_content_mean',
             'revenue_group_content_count',
             'revenue_group_pharmForm_mean',
             'revenue_group_pharmForm_count',
             'revenue_group_category_mean',
             'revenue_group_category_count',
             'revenue_group_manufacturer_mean',
             'revenue_group_manufacturer_count',
             'revenue_content_pharmForm_mean',
             'revenue_content_pharmForm_count',
             'revenue_content_category_mean',
             'revenue_content_category_count',
             'revenue_content_manufacturer_mean',
             'revenue_content_manufacturer_count',
             'revenue_pharmForm_category_mean',
             'revenue_pharmForm_category_count',
             'revenue_pharmForm_manufacturer_mean',
             'revenue_pharmForm_manufacturer_count',
             'revenue_category_manufacturer_mean',
             'revenue_category_manufacturer_count',
             'revenue_group_content_pharmForm_mean',
             'revenue_group_content_pharmForm_count',
             'revenue_group_content_category_mean',
             'revenue_group_content_category_count',
             'revenue_group_content_manufacturer_mean',
             'revenue_group_content_manufacturer_count',
             'revenue_group_pharmForm_category_mean',
             'revenue_group_pharmForm_category_count',
             'revenue_group_pharmForm_manufacturer_mean',
             'revenue_group_pharmForm_manufacturer_count',
             'revenue_group_category_manufacturer_mean',
             'revenue_group_category_manufacturer_count',
             'revenue_content_pharmForm_category_mean',
             'revenue_content_pharmForm_category_count',
             'revenue_content_pharmForm_manufacturer_mean',
             'revenue_content_pharmForm_manufacturer_count',
             'revenue_content_category_manufacturer_mean',
             'revenue_content_category_manufacturer_count',
             'revenue_pharmForm_category_manufacturer_mean',
             'revenue_pharmForm_category_manufacturer_count']
Due to the limited RAM we reduce our dataframe

In [25]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(train_items, train_size=0.1, test_size=0, random_state=42)

In [26]:
y_train = train_set['revenue']
x_train = train_set[predictors]

del train_items, items, test_set
gc.collect


Out[26]:
<function gc.collect>

In [27]:
x_train.head(1)


Out[27]:
adFlag genericProduct rrp availability_1 availability_2 availability_3 availability_4 unit_CM unit_G unit_KG ... revenue_group_category_manufacturer_mean revenue_group_category_manufacturer_count revenue_content_pharmForm_category_mean revenue_content_pharmForm_category_count revenue_content_pharmForm_manufacturer_mean revenue_content_pharmForm_manufacturer_count revenue_content_category_manufacturer_mean revenue_content_category_manufacturer_count revenue_pharmForm_category_manufacturer_mean revenue_pharmForm_category_manufacturer_count
309086 0 0 0.66 1 0 0 0 0 1 0 ... 0.847468 0.000198 0.847468 0.000198 0.847468 0.000198 0.847468 0.000198 0.847468 0.000198

1 rows × 75 columns

The range of every column is very different, so we need to use scaler.

In [28]:
scaler = MinMaxScaler()
x_train=scaler.fit_transform(x_train)

Models

Now we are going to use regularized linear regression models from the scikit learn module.

In [29]:
from sklearn.model_selection import cross_val_score
def rmse_cv(model):
    rmse= np.sqrt(-cross_val_score(model, x_train, y_train, scoring="neg_mean_squared_error", cv = 5))
    return(rmse)
model_ridge = linear_model.Ridge()
alphas = [1, 4, 5, 6, 7, 8, 9, 10, 11, 15]
cv_ridge = [rmse_cv(linear_model.Ridge(alpha = alpha)).mean() for alpha in alphas]

In [30]:
cv_ridge = pd.Series(cv_ridge, index = alphas)
cv_ridge.plot()
plt.xlabel("alpha")
plt.ylabel("rmse")


Out[30]:
<matplotlib.text.Text at 0x7f5d14d20850>
A value of alpha = 6 is about right based on the plot above. Let' try out the Lasso model. We will do a slightly different approach here and use the built in Lasso CV to figure out the best alpha for us. For some reason the alphas in Lasso CV are really the inverse or the alphas in Ridge.

In [31]:
model_ridge = linear_model.Ridge(alpha=6, fit_intercept=True, max_iter=10000)
model_ridge.fit(x_train, y_train)
model_lasso = linear_model.LassoCV(alphas = [1, 0.16, 0.1, 0.001, 0.0005]).fit(x_train, y_train)

In [32]:
rmse_cv(model_lasso).mean()


/home/roman/anaconda2/lib/python2.7/site-packages/sklearn/linear_model/coordinate_descent.py:484: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
Out[32]:
9.7217194633925317

In [33]:
cv_ridge.min()


Out[33]:
9.7217060828070796
Ridge shows better result, but they are very close.

In [34]:
c = pd.Series(model_lasso.coef_, index = train_set[predictors].columns)
print("Lasso picked " + str(sum(c != 0)) + " variables and eliminated the other " +  str(sum(c == 0)) + " variables")


Lasso picked 50 variables and eliminated the other 25 variables

In [35]:
imp_coef = pd.concat([c.sort_values().head(10), c.sort_values().tail(10)])
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Coefficients in the Lasso Model")


Out[35]:
<matplotlib.text.Text at 0x7f5ce13e1890>

In [36]:
c = pd.Series(model_ridge.coef_, index = train_set[predictors].columns)
imp_coef = pd.concat([c.sort_values().head(10), c.sort_values().tail(10)])
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Coefficients in the Ridge Model")


Out[36]:
<matplotlib.text.Text at 0x7f5ce124d110>

In [37]:
del train_set, x_train, y_train
gc.collect


Out[37]:
<function gc.collect>
Now let's return to full dataset and perform preprocessing using previous functions

In [38]:
items = pd.read_csv('data/raw/items.csv',sep='|')
train = pd.read_csv('data/raw/train.csv',sep='|')
train_items=pd.merge(train,items,on='pid')
del train
gc.collect


Out[38]:
<function gc.collect>

In [39]:
train_items=toCategorical(train_items)
train_items=solveNA(train_items,train_items,coef_competitorPrice_to_rrp,0)
train_items=Dummies(train_items)
train_items=moreFeautures(train_items)

items = pd.read_csv('data/interim/items_v1.csv')
train_items=pd.merge(train_items,items[items_pred],on='pid')
y_train = train_items['revenue']
x_train = train_items[predictors]

del train_items, items
gc.collect


Out[39]:
<function gc.collect>

In [41]:
scaler = MinMaxScaler()
x_train=scaler.fit_transform(x_train)
After preprocessing we can start fitting our models with best alpha from previous step

In [42]:
model_lasso = linear_model.LassoCV(alphas = [1, 0.16, 0.1, 0.001, 0.0005]).fit(x_train, y_train)

In [43]:
model_ridge = linear_model.Ridge(alpha=6, fit_intercept=True, max_iter=10000)
model_ridge.fit(x_train, y_train)


Out[43]:
Ridge(alpha=6, copy_X=True, fit_intercept=True, max_iter=10000,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [44]:
del y_train,x_train
gc.collect


Out[44]:
<function gc.collect>
So now we have fitted models, let's load and preprocess our class data.

In [45]:
items = pd.read_csv(
'data/raw/items.csv',sep='|')
clas = pd.read_csv(
'data/raw/class.csv',sep='|')
clas_items=pd.merge(clas,items,on='pid')
clas_items=toCategorical(clas_items)
clas_items=solveNA(clas_items,clas_items,coef_competitorPrice_to_rrp,0)
clas_items=Dummies(clas_items)
clas_items=moreFeautures(clas_items)
items = pd.read_csv(
'data/interim/items_v1.csv')
clas_items=pd.merge(clas_items,items[items_pred],on='pid')

submission = pd.DataFrame({
        "lineID": clas_items["lineID"],
        "revenue": np.zeros(shape=(1210767,))
})

x_test = clas_items[predictors]
del clas_items,clas,items
gc.collect


Out[45]:
<function gc.collect>
In case of missing data appearance we can conclude that it's the new manufacturer that was not in the training set but it can appear in test

In [46]:
x_test = x_test.fillna(x_test.mean())
x_test = scaler.fit_transform(x_test)
And finaly we can make prediction of revenue

In [47]:
lasso_preds = model_lasso.predict(x_test)
ridge_preds = model_ridge.predict(x_test)
In postprocessing we fill in all negative values with zero, because of business logic. After that we mix output from two models, sort and save our submision.

In [49]:
predictions = pd.DataFrame({"ridge":ridge_preds, "lasso":lasso_preds})
predictions[predictions['ridge']<0]=0
predictions[predictions['lasso']<0]=0

lasso_preds = predictions['lasso']
ridge_preds = predictions['ridge']
com_pred = (lasso_preds + ridge_preds) / 2.0

submission['revenue']=com_pred
submission_sorted=submission.sort_values("lineID");
submission_sorted.to_csv("data/Uni_Polytechnic_Lviv_1.csv", index=False, sep='|')

In [50]:
submission_sorted


Out[50]:
lineID revenue
0 1 2.918315
314 2 0.085515
688 3 1.153730
1137 4 0.289552
1292 5 1.545264
1779 6 2.567740
1883 7 0.551230
2950 8 1.073107
3333 9 3.508954
3701 10 4.274228
4317 11 0.913259
5045 12 1.136892
5435 13 1.164143
5838 14 2.212649
6114 15 1.134349
8729 16 11.082580
8807 17 3.028414
9076 18 5.441399
10345 19 0.462155
10595 20 2.880210
10956 21 3.223039
11011 22 6.188713
11041 23 2.005597
11063 24 8.900715
11130 25 1.641799
13209 26 1.250267
13299 27 3.518609
15137 28 5.461279
18499 29 5.780450
19023 30 2.679875
... ... ...
483406 1210738 1.601418
1046815 1210739 2.388170
43369 1210740 0.220250
111016 1210741 1.419405
610455 1210742 2.097951
1159101 1210743 1.811689
678894 1210744 2.091872
97966 1210745 2.416438
807746 1210746 1.729460
1193039 1210747 1.695809
131637 1210748 8.831485
368371 1210749 3.000849
204785 1210750 0.200165
283733 1210751 0.000000
524147 1210752 6.826647
3332 1210753 1.156671
675250 1210754 7.610855
225324 1210755 5.195631
125268 1210756 2.151450
634274 1210757 5.014182
203936 1210758 3.267051
43370 1210759 0.220250
513052 1210760 1.856578
122835 1210761 0.354526
1068349 1210762 3.249229
737912 1210763 3.362747
266596 1210764 4.080373
955362 1210765 2.334744
131163 1210766 3.869969
111017 1210767 1.419394

1210767 rows × 2 columns