In [1]:
# Think about your running shoe website. A data analyst should have little trouble finding websites that referred customers to the store. Let's say that most of your customers came from Twitter, Google and Facebook. There were also quite a few customers that came from running shoe websites. A good data analyst easily creates a report of the top 50 websites. These are websites that people visited just before buying. Trying to find out where people are coming from is a good analytics question. It's about gathering up the data, counting it and displaying it in a nice report.
referrals = ['facebook', 'google', 'amazon', 'twitter', 'slickdeals', 'instagram', 'pinterest', 'ebates', 'fitness magazine', 'discovery', 'youtube', 'messenger']
referral_type = ['organic', 'paid']
data = []
for i in range(1000):
data.append([i, random_date(), referrals[rd.randint(0, len(referrals)-1)], referral_type[rd.randint(0, len(referral_type)-1)]])
df = pd.DataFrame(data, columns = ['id', 'timestamp', 'referral-site', 'referral-type'])
# df.to_csv('csv_output/ch6_fig1.csv', index=False)
df = pd.read_csv('csv_output/ch6_fig1.csv')
df.head()
Out[1]:
In [2]:
df = pd.read_csv('csv_output/ch6_fig1.csv')
site_count = df.groupby('referral-site').id.count().reset_index()
site_type_count = df.groupby(['referral-site', 'referral-type']).id.count().reset_index()
%matplotlib inline
sns.set_style("whitegrid")
f, ax = plt.subplots(1,2, figsize=(8,6))
sns.barplot(x='id', y='referral-site', data=site_count, ax=ax[0], color='cornflowerblue');
ax[0].set_title('refferal-site total visits')
ax[0].set_xlabel('')
sns.barplot(x='id', y='referral-site', hue='referral-type', data=site_type_count, ax=ax[1]);
ax[1].legend(loc='center right', bbox_to_anchor=(1.5, .9));
ax[1].set_title('referral-site by referral-type')
ax[1].set_xlabel('')
f.tight_layout()
f.savefig('svg_output/ch6_fig1.svg', format='svg', bbox_inches='tight')
Even though facebook, twitter and instagram seem to bring great traffic, but in terms of paid versus organic traffic, while pinterest drives comparable amount of traffic as other sites, but about half of them are paid advertisement.
In [3]:
%load_ext rpy2.ipython
In [4]:
%%R -w 600 -h 300 -u px
require(dplyr)
df = read.csv('csv_output/ch6_fig1.csv')
head(df)
df$timestamp.formated = strptime(df$timestamp, "%Y-%m-%d %H:%M:%S")
df$timestamp.day = as.Date(format(df$timestamp.formated, "%Y-%m-%d"), "%Y-%m-%d")
df = df %>% select(-timestamp.formated)
by_day = df %>% group_by(timestamp.day, referral.type)
by_day_by_type = summarise(by_day, count=n())
require(ggplot2)
ggplot(by_day_by_type, aes(x=timestamp.day, y=count, group=referral.type, colour=referral.type)) + geom_line(size=1) + geom_point(size=3) + scale_x_date(date_labels = "%b %d") + ggtitle('referral click over time by referral type') + theme_bw() + theme(axis.text.x = element_text(angle = 30, hjust= 1))
# ggsave("svg_output/ch6_fig1_R.svg")