In [1]:
%pylab inline
%matplotlib inline
import pandas as pd
import sklearn
In [2]:
print pd.version.short_version
print sklearn.__version__
In this notebook, we'll try to tackle a Kaggle problem. The object is to predict annual restaurant sales based on objective measurements.
This is an interesting problem to me because it's set up in a classic regression setting. So all classic regression techniques should be applicable to the problem. The trick though, as like all machine learning problem, is to carefully test the various models and select the most generalizable one.
With over 1,200 quick service restaurants across the globe, TFI is the company behind some of the world's most well-known brands: Burger King, Sbarro, Popeyes, Usta Donerci, and Arby’s. They employ over 20,000 people in Europe and Asia and make significant daily investments in developing new restaurant sites.
Right now, deciding when and where to open new restaurants is largely a subjective process based on the personal judgement and experience of development teams. This subjective data is difficult to accurately extrapolate across geographies and cultures.
New restaurant sites take large investments of time and capital to get up and running. When the wrong location for a restaurant brand is chosen, the site closes within 18 months and operating losses are incurred.
Finding a mathematical model to increase the effectiveness of investments in new restaurant sites would allow TFI to invest more in other important business areas, like sustainability, innovation, and training for new employees. Using demographic, real estate, and commercial data, this competition challenges you to predict the annual restaurant sales of 100,000 regional locations.
In [3]:
train_df = pd.read_csv('train.csv', parse_dates=[1])
#print train_df.head(n=5)
In [4]:
train_df.City = train_df.City.astype('category')
train_df.Type = train_df.Type.astype('category')
train_df['City Group'] = train_df['City Group'].astype('category')
In [21]:
#train_df.dtypes
Out[21]:
In [5]:
#d = train_df['Open Date']
#d.map( lambda x : x.year )
train_df['Open_Year'] = train_df['Open Date'].map( lambda x : x.year)
train_df['Open_Month'] = train_df['Open Date'].map( lambda x : x.month)
print train_df.head()
In [6]:
train_df['City Group'].unique()
Out[6]:
In [7]:
train_df_grouped_by_city = train_df.groupby(by='City')
city_avg = train_df_grouped_by_city.revenue.aggregate(np.average)
city_avg.plot(kind='bar')
Out[7]:
In [8]:
train_df_grouped_by_city = train_df.groupby(by='City')
city_count = train_df_grouped_by_city.revenue.aggregate(np.count_nonzero)
#print city_count
city_count.plot(kind='bar')
Out[8]:
In [9]:
train_df_grouped_by_city_group = train_df.groupby(by='City Group')
city_group_avg = train_df_grouped_by_city_group.revenue.aggregate(np.average)
print city_group_avg
print train_df_grouped_by_city_group.City.value_counts()
city_group_avg.plot(kind='bar')
Out[9]:
In [25]:
train_df_grouped_by_year = train_df.groupby(by='Open_Year')
year_avg = train_df_grouped_by_year.revenue.aggregate(np.average)
print train_df_grouped_by_year.Open_Year.value_counts()
year_avg.plot(kind='line')
Out[25]:
In [11]:
print train_df[ train_df.Open_Year == 2000 ]
In [18]:
print train_df.iloc[[16, 85]]['P29']
In [19]:
train_df.P29.plot(kind='box')
Out[19]:
In [22]:
train_df.P29.plot()
Out[22]:
In [23]:
train_df.P29.plot(kind='hist')
Out[23]:
In [27]:
train_df.plot(x='P29', y='revenue', kind='scatter')
Out[27]:
Now this is a very interesting observation. Record 16 and 85 are identical in predictors(EXCEPT FOR ...) but give very different revenue amounts
The question we face here is, do we want to exclude the outlier or do we want to include it?
In [44]:
train_df[ train_df.City == 'İstanbul' ].revenue.plot('box')
Out[44]:
In [35]:
train_df_grouped_by_type = train_df.groupby(by='Type')
type_avg = train_df_grouped_by_type.revenue.aggregate(np.average)
type_avg.plot(kind='bar')
Out[35]:
In [7]:
trainingdf = pd.read_csv('train.csv')
print trainingdf.columns
convert date strings to days from the first restaurant opened
In [8]:
import datetime
opendate = trainingdf['Open Date']
dates = [datetime.datetime.strptime(date, '%m/%d/%Y') for date in opendate]
mindate = min(dates)
ageByDays = [(item-mindate).days for item in dates]
print ageByDays[:10]
convert string categories to integer categories
In [9]:
import numpy as np
city = trainingdf['City']
b,c = np.unique(city, return_inverse=True)
print c[:10]
citygroup = trainingdf['City Group']
b,citygroupcat = np.unique(citygroup, return_inverse=True)
print citygroupcat[:10]
t = trainingdf['Type']
b,tcat = np.unique(t, return_inverse=True)
print tcat[:10]
In [11]:
trainingdf['Open Date'] = ageByDays
trainingdf['City'] = c
trainingdf['City Group'] = citygroupcat
trainingdf['Type'] = tcat
X_train = trainingdf[trainingdf.columns[1:-1]]
y_train = trainingdf['revenue']
print X_train.head()
In [6]:
train_df_without_id = pd.read_csv('train.csv', parse_dates=[1], index_col=0)
print train_df_without_id
In [12]:
train_df_without_id_revenue = train_df_without_id.drop('revenue', axis=1)
print train_df_without_id_revenue.shape
In [11]:
train_df_without_id_revenue_unique = train_df_without_id_revenue.drop_duplicates()
print train_df_without_id_revenue_unique.shape
In [28]:
two_row = train_df_without_id_revenue.iloc[[16,85],:]
p_list = ['P' + str(i + 1) for i in xrange(37)]
print two_row.drop_duplicates(subset=['Open Date', 'City', 'Type'] + p_list)
Shit... They are **de
In [17]:
from sklearn import linear_model
In [18]:
lin_reg = linear_model.LinearRegression()
In [19]:
lin_reg_fit = lin_reg.fit(X_train, y_train)
In [25]:
in_sample_diff = (lin_reg_fit.predict(X_train) - y_train)
In [27]:
in_sample_diff2 = in_sample_diff.map(lambda x : x * x)
In [28]:
np.sqrt(in_sample_diff2.mean())
Out[28]:
In [12]:
from sklearn.linear_model import LinearRegression
k = 5
features = []
columns = list(X_train.columns)
for i in range(k):
rSquare = []
for j in range(len(columns)):
model = LinearRegression()
X = X_train[X_train.columns[j]]
f = pd.DataFrame(X_train, columns=features)
f['newX'] = X
model.fit(f, y_train)
rscore = model.score(f, y_train)
rSquare.append(rscore)
feature = X_train.columns[np.argmax(map(abs,rSquare))]
features.append(feature)
columns.remove(feature)
print features
In [13]:
import statsmodels.formula.api as sm
df = pd.DataFrame({'y':y_train, 'x':X_train['Open Date']})
result = sm.ols(formula="y ~ x", data = df).fit()
print result.summary()
In [14]:
import matplotlib.pyplot as plt
model = LinearRegression()
X = standardize(X_train[feature1])
model.fit(X, standardize(y_train))
predy = model.predict(X) * y_train.std() + y_train.mean()
plt.scatter(X_train['Open Date'], y_train, color='black')
plt.plot(X_train['Open Date'], predy, color='blue', linewidth=3)
plt.title('regress revenue on opendate')
plt.xlabel('open date diff first open date')
plt.ylabel('revenue')
In [15]:
from sklearn.decomposition.pca import PCA
import statsmodels.formula.api as sm
pcolumns = ['P'+str(i) for i in range(1,38)]
X_P = pd.DataFrame(X_train, columns=pcolumns)
pca = PCA(n_components=3)
train_xp = pca.fit_transform(X_P)
print 'the explained variance ratio of pca is ', pca.explained_variance_ratio_
df = pd.DataFrame({'y':y_train, 'x1':train_xp[:,0], 'x2':train_xp[:,1], 'x3':train_xp[:,2]})
result = sm.ols(formula='y~x1+x2+x3',data=df).fit()
print result.summary()
In [ ]:
# todo
# looks like P1-P37 can be represented by two or three variables by PCA
# and also column City has 37 categories, which also need to be classfied into two or three categories,
# so that there are not so much dummy variables to be added
# select five features by forward stepwise selection(aic? bic? adjusted r square?)
feature selection in scikit learn package
In [16]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
model = LinearRegression()
rfe = RFE(model, 5)
rfe = rfe.fit(X_train, y_train)
print rfe.support_
print rfe.ranking_
Type, P6, P8, P9, P13
In [ ]: