In [1]:
# While the team is exploring the data, they notice something unexpected. It looks like there's a positive correlation between shoe sales and the customer’s rating. You might expect that the shoes with the highest ratings might have higher sales. The data science team notices, however, that shoes with any ratings had higher sales. The lowest selling shoes were the ones with no rating at all.
data = []
for i in range(1000):
if rd.random() <= .3:
rating = rd.randint(1,5)
else:
rating = 0
if rating == 0:
sales = rd.random()*500
else:
sales = rd.random()*5000
data.append([i, rating, sales])
df = pd.DataFrame(data, columns=['shoe-id', 'rating', 'sales'])
# df.to_csv('csv_output/ch14_fig1.csv', index=False)
df = pd.read_csv('csv_output/ch14_fig1.csv')
df.head()
Out[1]:
In [2]:
df = pd.read_csv('csv_output/ch14_fig1.csv')
%matplotlib inline
sns.set_style("white")
f = sns.pairplot(df.iloc[:,1:], hue="rating", palette="husl", size=4)
f.savefig('svg_output/ch14_fig1.svg', format='svg')
Assume 0 is no rating, rest are the real ratings. Most items with total sales dollars close to 0 has no ratings. The rest, however, does not indicate that higher ratings yeilds higher sales. In fact, the reason might be the opposite, because the sales are higher, there more people actually bought the item are willing to leave feedbacks on the website. On the bottom, however, if we ignore those that has no ratings, there's almost no correlation at all.