The impetus for the analysis came from planning a road trip around Europe. This time I’m going to take my dog with me. Will it be a headache to find a place to stay? From this question I moved further and enquired whether pets are equally welcomed in different parts of the world.
This notebook describes some analysis of the data I gathered on Airbnb.com. The goal was to find the most pet friendly countries to travel with a pet based on ease of accommodation search.
I'll begin with a summary of what I think I've learned about countries' attitude to pets based on their willingness to host one. Then I'll walk you through the analysis process.
To equalize countries I used ratio of PFL to all listings as a main criteria for comparison. The map shows how countries vary in their attitude to pets.
The bar chart shows the distribution of listings (total and PFL) and ratio of PFL in each region.
Asia and Oceania would be less suitable for travelling pets. Only 14% of listings there are pet friendly.
North America doesn't stand out as the friendly region for a pet owner but as the most expensive region.
The box-and-whisker plot shows the distribution of prices for PFL in regions.
Prices for a PFL in 50% of countries in North America (above median) would be higher than in the most expensive countries in Africa, Europe and Oceania.
In Asia, hosts could be accused of price discrimination. Pet allowance increases the average price by 41% reaching $110.
I'll keep this findings in mind when planning my road trip!
Now I'll show you the code behind the above revelations.
In [2]:
from __future__ import division
from IPython.display import display
import pandas as pd
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import humanize
I'll skip the data gathering step. I used a CSV file with the list of countries and regions to parse Airbnb.com website and saved the data as a sqlite dictionary. That is why I use the sqlitedict library here.
In [3]:
from sqlitedict import SqliteDict
db = SqliteDict('./pet_friendly.sqlite')
I'll also format floating point numbers.
In [4]:
pd.set_option('float_format', '{:.2f}'.format)
In [5]:
def read_db_in_pandas(db):
""" Read the DB."""
# Transpose data to have keys as index.
df = pd.DataFrame(dict(db)).T
# Remove rows with no data for a country.
df = df[df['apt_total'] != 0]
# Convert columns into numbers.
df = df.convert_objects(convert_numeric=True)
return df
df = read_db_in_pandas(db)
print ("There are {0} countries in the dataset."
.format(len(db)))
print ("{0} countries have no listings published."
.format(len(db) - len(df)))
print ("We are left with {0} countries for closer examination."
.format(len(df)))
print "\nBelow is a random sample of the DataFrame:"
df.sample(n=5)
Out[5]:
I collected data on total number of listings located in each country ('apt_total') and number of pet friendly listings ('apt_pets'), or in short PFL.
I also parsed the average price per night in both categories ('apt_total_price' & 'apt_pets_price').
Prices are in USD.
In [6]:
# Number of listings.
apt_sum = df['apt_total'].sum()
# Number of PFL.
pets_sum = df['apt_pets'].sum()
print ("There are {0} listings all over the world on Airbnb."
.format(humanize.intword(apt_sum)))
print ("Pets would be welcomed only in {0:.0%} of all listings."
.format(pets_sum / apt_sum))
In [7]:
# Average price per night.
price_mean = round(df['apt_total_price'].mean())
pet_mean = round(df['apt_pets_price'].mean())
# Difference in price.
diff = (price_mean - pet_mean) / price_mean
print ("The average price for a listing is ${0} per night."
.format(int(price_mean)))
print ("A PFL would cost {0:.0%} less, or ${1}."
.format(diff, int(pet_mean)))
A quick glance at the whole dataset reveals that countries with few listings do not tell much about their attitude towards pets. Let's look at the top-5 friendliest countries.
In [8]:
# Ratio of PFL
compare = df['apt_pets']/df['apt_total']
compare.sort(ascending=False)
print "The friendliest countries:"
compare[:5]
Out[8]:
This chart tells us that 100% of listings in Western Sahara and Wallis and Futuna are pet friendly. But that's only because there is only one listing in the country, and pets are allowed there.
In [9]:
df.loc[['Western Sahara', 'Wallis and Futuna']][['apt_total', 'apt_pets']]
Out[9]:
Is Antarctica in the top-5 chart? Do penguins rent out their nests?
Apparently, there's some mistake on Airbnb. 17 listings located in different parts of the globe were mistakenly marked off as based in Antarctica.
I'll fix the dataset a bit for further analysis. I'll add a threshold of 20 listings and remove Antarctica.
In [10]:
# Add a threshold of 20 listings.
df_cut = df[df['apt_total'] > 20]
# Remove Antarctica.
df_cut = df_cut[df_cut.index != 'Antarctica']
print ("These changes leave us {0} countries to examine."
.format(len(df_cut)))
# Calculate ratio of PFL and add to the DataFrame.
df_cut['apt_ratio'] = (df_cut['apt_pets'] / df_cut['apt_total'])
print "\nTop-5 countries with the highest ratio of PFL:"
df_cut.sort('apt_ratio', ascending=False)[['apt_ratio', 'apt_total', 'apt_pets', 'region']][:5]
Out[10]:
Andorra is the most welcoming country with 46% of 405 listings willing to accommodate a pet.
Though Italy is lagging behind in ratio comparison with 37% share of PFL, in raw numbers this country overtakes Andorra. Italy provides almost 400 times more PFL than Andorra. But Italy is also 640 times bigger in its size.
To equalize countries I'll use ratio of listings as a main criteria for comparison.
Let's look at the bottom-5 countries.
In [11]:
print "Bottom-5 countries with the lowest ratio of PFL: "
df_cut.sort('apt_ratio')[['apt_ratio', 'apt_total', 'apt_pets', 'region']][:5]
Out[11]:
Japan might be popular for the lovely Shiba Inu dog breed but not for willingness to accommodate a pet. Only 4% of listings would suit a pet owner.
It seems that in Oceania and Asia travelling with a pet can be challenging. Let's examine regions in detail.
In [12]:
print ("Number of countries in the dataset "
"grouped by region:")
df_cut.groupby(['region']).size()
Out[12]:
First, I'll look at the distribution of accommodation listings among regions.
In [13]:
# All listings
region_total = (df_cut['apt_total']
.groupby(df_cut['region']).sum()
/ df_cut['apt_total'].sum())
# PFL listings
region_pets = (df_cut['apt_pets']
.groupby(df_cut['region']).sum()
/ df_cut['apt_pets'].sum())
# PFL ratio
region_ratio = (df_cut['apt_pets']
.groupby(df_cut['region']).sum()
/ df_cut['apt_total']
.groupby(df_cut['region']).sum())
region_listings = pd.concat([region_total, region_pets],
axis=1)
region_listings['ratio'] = region_ratio
print "Share of regions in Airbnb listings: "
region_listings.sort('apt_total', ascending=False)
Out[13]:
It seems, Airbnb is very popular in Europe. I'll plot the data.
In [14]:
# Make a bar chart
matplotlib.style.use('bmh')
plt.figure();
region_listings.plot(kind='bar', rot=0, figsize=(8, 6))
plt.xlabel('')
plt.ylabel('Percentage')
plt.title('The distribution of accommodation '
'listings among regions \n')
plt.show()
The plot and the table above reveal that the highest percentage of Airbnb listings (54%) are located in Europe. 63% of PFL are located in Europe. The highest ratio of PFL (0.24) is recorded in Europe as well.
North America lags behind with 22% share of listings. 18% of hosts in this region would be happy to accommodate a pet.
South America and Africa, though poorly represented on Airbnb, are very welcoming in hosting pets. 20% of hosts open their doors to pet owners.
Asia and Oceania prove themselves as less suitable for travelling pets. Just 14% of their listings are pet friendly.
Not only listings are distributed unevenly. The average price per night differs greatly in each region.
In [15]:
# Mean price per region
region_price = df_cut.groupby(df_cut['region']).mean()[['apt_total_price',
'apt_pets_price']]
# Calculate ratio of PFL price to general listing price.
region_price['ratio'] = (region_price['apt_pets_price'] /
region_price['apt_total_price'])
print "Average price in region, USD."
print "Ratio of PFL price to average price."
region_price.sort('apt_total_price')
Out[15]:
The cheapest region in terms of accommodation is South America. On average, day rent would cost \$66. Though the cheapest stay with a pet would be possible in Africa, just \$67 a night.
In Asia, price for a listing is rather low and equals \$78 a night, however pet allowance adds 41% to the average price and reaches \$110. That's more expensive than a PFL in Europe.
In Europe, stay with a pet would cost \$100, or 15% more than the average price per night. But in any case this price is lower than in Oceania and North America.
North America is the most expensive region to rent an apartment. The average price reaches \$268.5. Luxury apartments in this region close doors to pet owners. That lowers the average price for a PFL by 36% but it's still the highest comparing to other regions.
In [16]:
# Make a boxplot
color = dict(boxes='DarkGreen', whiskers='DarkOrange',
medians='DarkBlue', caps='Gray')
plt.figure();
# Split the data on 'apt_pets_price' into regional groups
region_groups = df_cut.groupby(['region'])[['apt_pets_price']]
# List of regions
regions = df_cut['region'].unique()
data = []
# Add the data from the DF
for item in regions:
a = region_groups.get_group(item)
data.append(a)
# Make a new DF for regions
box_df = pd.concat(data, ignore_index=True, axis=1)
box_df.columns = regions
# Plot the data
box_df.plot(kind='box', sym='r+', color=color,
figsize=(8, 6)).set_ylim([0,300])
plt.ylabel('Average price per night, USD')
plt.title('The distribution of PFL prices in regions')
plt.show()
The box-and-whisker plot shows the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum. This plot does not include outliers.
North America stands out as the most expensive region. Prices in this region are spread out over a wide range of values from \$40 to almost \$300.
Prices for a PFL in 50% of countries in North America (above median) would be higher than in the most expensive countries in Africa, Europe and Oceania.
Africa and Oceania are the most homogeneous regions. Prices do not differ greatly within each region.
Other observation is that in South America 50% of countries below the median are very uniform in listing prices and other half above the median, on the contrary, is widely distributed.
The mentioned wide distribution in the top (or most expensive) part of the box (above the third quartile) is true for all regions except Oceania. Oceania is uniform when it comes to high prices.
In [17]:
# Share of each country in total listings
apt_share = (df_cut['apt_total'] / df_cut['apt_total'].sum() * 100).round(2)
apt_share.sort(ascending=False)
# Count 80% of countries
quantile_count = int(len(df_cut) * 0.80)
print ("80%, or {0} countries in the dataset "
"account for {1}% of all listings available."
.format(quantile_count, apt_share[-1*quantile_count: ].sum()))
In [18]:
print ("These 5 countries account for {0}% "
"of all listings published on Airbnb: "
.format(apt_share[:5].sum()))
apt_share[:5]
Out[18]:
The same calculations for PFL reveal even lower numbers.
In [19]:
# Share of each country in PFL
pet_share = (df_cut['apt_pets'] /
df_cut['apt_pets'].sum() * 100).round(2)
pet_share.sort(ascending=False)
print ("80%, or {0} countries represent just "
"{1}% of PFL.".format(quantile_count,
pet_share[-1 * quantile_count: ].sum()))
In [20]:
print ("These 5 countries account for {0}% "
"of all listings published on Airbnb: "
.format(pet_share[:5].sum()))
pet_share[:5]
Out[20]:
I'll plot countries that account for more than 1% of PFL.
In [21]:
# Add country codes to the DF.
pet_share_df = pd.concat(
[pet_share, df['country_code']],
axis=1).set_index('country_code').sort('apt_pets',
ascending=False)
# Make a bar chart
matplotlib.style.use('ggplot')
plt.figure();
share_plot = pet_share_df[pet_share_df['apt_pets']>1].plot(kind='bar',
legend=False,
rot=45,
title='Countries with share of all PFL > 1%\n')
share_plot.set_ylabel('Share, %')
share_plot.set_xlabel('')
plt.show()
But, again, these are raw numbers. I'll show the top 5 most friendly countries in each region.
In [22]:
for region in regions:
print ("Top-5 friendliest countries in {}:"
.format(region))
display(df_cut[df_cut['region'] == region].sort('apt_ratio', ascending=False)[['apt_ratio']][:5])
I'll map the ratio of PFL. First, I'll split the data into clusters using k-means. I have a module to choose the number of k.
Here I plot the sum of squared errors (between each point and the mean of its cluster) as a function of k and look at where the graph “bends”.
In [23]:
import choosing_k as chk
data = df_cut['apt_ratio'].values.tolist()
chk.plot_errors(data)
It looks like 5 would be a right number of clusters.
Now I use another module to split the data into 5 clusters.
In [29]:
import kmeans_calc as kmn
threshold_scale = [0.0] + kmn.split_into_groups(data, 5)
threshold_scale
Out[29]:
To map the data I'll use the folium library.
In [30]:
import json
import folium
from folium.utilities import split_six
In [31]:
columns = ['country_code', 'apt_ratio']
color_data = df_cut.set_index(columns[0])[columns[1]].to_dict()
In [32]:
geo_json_data = json.load(open('countries.geo.json'))
In [36]:
from folium.utilities import split_six, color_brewer
from folium.features import ColorScale
m = folium.Map([32, -45], tiles='Mapbox',
API_key='wrobstory.map-12345678',
zoom_start=2)
# Pass own threshold scale
color_domain = threshold_scale
# Choose color
fill_color = 'YlGnBu'
color_range = color_brewer(fill_color, n=len(color_domain))
key_on = 'id'
# I made a fix for the folium library to colour with white countries with no data.
def get_by_key(obj, key):
if len(key.split('.')) <= 1:
return obj.get(key, None)
def color_scale_fun(x):
try:
r = [u for u in color_domain if u <= color_data[get_by_key(x, key_on)]]
return color_range[len(r)]
except KeyError:
return '#FFFFFF'
# Make a map
folium.GeoJson(geo_json_data,
style_function=lambda feature: {
'fillColor': color_scale_fun(feature),
'color': 'black',
'weight': 1,
'legend_name' : 'Unemployment Rate (%)',
'fillOpacity': 0.8
}).add_to(m)
color_scale = ColorScale(color_domain, fill_color, caption="Ratio of pet friendly listings")
m.add_children(color_scale)
print "Ratio of pet friendly listings to all listings"
m
Out[36]:
This is the map I'm going to use when planning my road trip!