In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [2]:
customers = pd.read_csv('Ecommerce Customers')
Checking the head of customers, and checking out its info() and describe() methods
In [3]:
customers.head()
Out[3]:
In [4]:
customers.describe()
Out[4]:
In [5]:
customers.info()
In [6]:
sns.jointplot(x='Time on Website', y='Yearly Amount Spent', data=customers, color='green')
Out[6]:
The same for time on app
In [7]:
sns.jointplot(x='Time on App', y='Yearly Amount Spent', data=customers, color='green')
Out[7]:
Using jointplot to create a 2D hex bin plot comparing Time on App and Length of Membership
In [8]:
sns.jointplot(x='Time on App', y='Length of Membership', data=customers, kind='hex')
Out[8]:
Using a pairplot to explore patterns in the entire dataset
In [9]:
sns.pairplot(customers)
Out[9]:
Length of membership seems to be the most correlated feature with Yearly Amount Spent
Creating a linear model plot of Yearly Amount Spent vs. Length of Membership
In [10]:
sns.lmplot(x='Length of Membership', y='Yearly Amount Spent', data=customers)
Out[10]:
In [ ]:
customers.columns
In [24]:
X = customers[['Avg. Session Length', 'Time on App', 'Time on Website', 'Length of Membership']]
In [25]:
y = customers['Yearly Amount Spent']
In [26]:
from sklearn.cross_validation import train_test_split
In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
In [28]:
from sklearn.linear_model import LinearRegression
In [29]:
lm = LinearRegression()
In [30]:
lm.fit(X_train, y_train)
Out[30]:
Print out the coefficients of the model
In [31]:
print(lm.coef_)
In [34]:
predictions = lm.predict(X_test)
In [35]:
predictions
Out[35]:
Create a scatterplot of the real test values versus the predicted values.
In [39]:
plt.scatter(predictions, y_test)
plt.xlabel('Y test')
plt.ylabel('Predicted Y')
Out[39]:
In [40]:
from sklearn import metrics
In [41]:
print('MAE: ', metrics.mean_absolute_error(y_test, predictions))
print('MSE: ', metrics.mean_squared_error(y_test, predictions))
print('RMSE: ', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
In [46]:
sns.distplot((y_test - predictions), bins=50)
Out[46]:
In [47]:
df = pd.DataFrame(lm.coef_, X.columns, columns=['Coefficient'])
In [48]:
df
Out[48]:
How can you interpret these coefficients?
Should the company focus more on their mobile app or on their website?
There are 2 ways to think about this: Develop the website to catch up to the performance of the mobile app, or just develop the app more since that's what is working better. The answer to this question really depends on the company. As a Data Scientist, it's great to explore the relationship between the Length of Membership and Mobile App (or Website) before draw a conclusion.