Ch14 Figure1


In [1]:
# While the team is exploring the data, they notice something unexpected. It looks like there's a positive correlation between shoe sales and the customer’s rating. You might expect that the shoes with the highest ratings might have higher sales. The data science team notices, however, that shoes with any ratings had higher sales. The lowest selling shoes were the ones with no rating at all.

data = []

for i in range(1000):
    
    if rd.random() <= .3:
        rating = rd.randint(1,5)
    else:
        rating = 0
        
    if rating == 0:
        sales = rd.random()*500
        
    else:
        sales = rd.random()*5000
    
    data.append([i, rating, sales])
    
df = pd.DataFrame(data, columns=['shoe-id', 'rating', 'sales'])
# df.to_csv('csv_output/ch14_fig1.csv', index=False)
df = pd.read_csv('csv_output/ch14_fig1.csv')
df.head()


Out[1]:
shoe-id rating sales
0 0 0 363.446179
1 1 0 382.957004
2 2 0 388.188213
3 3 0 262.239804
4 4 0 320.690896

In [2]:
df = pd.read_csv('csv_output/ch14_fig1.csv')

%matplotlib inline
sns.set_style("white")

f = sns.pairplot(df.iloc[:,1:], hue="rating", palette="husl", size=4)
f.savefig('svg_output/ch14_fig1.svg', format='svg')


Assume 0 is no rating, rest are the real ratings. Most items with total sales dollars close to 0 has no ratings. The rest, however, does not indicate that higher ratings yeilds higher sales. In fact, the reason might be the opposite, because the sales are higher, there more people actually bought the item are willing to leave feedbacks on the website. On the bottom, however, if we ignore those that has no ratings, there's almost no correlation at all.