In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
df = pd.read_csv('train.csv')
In [3]:
df.head()
Out[3]:
In [4]:
df['AnimalType'].unique()
Out[4]:
In [5]:
df.groupby(['AnimalType']).get_group('Cat').shape[0]
Out[5]:
In [6]:
df.groupby(['AnimalType']).get_group('Dog').shape[0]
Out[6]:
In [7]:
df['OutcomeType'].unique()
Out[7]:
In [8]:
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 4))
sns.countplot(x="OutcomeType", data=df, ax=ax1)
sns.countplot(x="AnimalType", hue="OutcomeType", data=df, ax=ax2)
Out[8]:
Overall it seems not many animals died of natural causes.
Doesn't seem like cats have nine lives unfortunately. Probably because of their shitty attitude and general evilness they are likely to get transferred. Dogs have tricked their masters with their sad puppy face to get returned more. Also they are told to be more loyal.
In [9]:
sns.countplot(x="SexuponOutcome", hue="OutcomeType", data=df)
Out[9]:
Overall sex likely does not play a big role in outcome, but spayed/neutered population is bigger they are more likely to get adopted
In [10]:
dfCat = df.groupby(['AnimalType']).get_group('Cat')
dfDog = df.groupby(['AnimalType']).get_group('Dog')
In [11]:
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 4))
sns.countplot(x="SexuponOutcome", hue="OutcomeType", data=dfCat, ax=ax1)
sns.countplot(x="SexuponOutcome", hue="OutcomeType", data=dfDog, ax=ax2)
Out[11]:
Cats and dogs have different probability distributions for outcome
In [12]:
dfCat['Color'].describe()
Out[12]:
In [13]:
dfDog['Color'].describe()
Out[13]:
As expected there are too many colors that makes it difficult to properly visualize without discarding a majority of colors. Thinking a bit, it makes more sense to have a combination of both color and breed to make a pet to be more appealing/attractive.
In [14]:
df['AgeuponOutcome'].unique()
Out[14]:
As expected there are animals over a wide spectrum of ages. Age should play a major role deciding the outcome.
In [15]:
df['NameIsPresent'] = df['Name'].isnull()
In [16]:
sns.countplot(x="NameIsPresent", hue="OutcomeType", data=df)
Out[16]:
Animals that didn't have names or their names were lost, as is evident from the graph above, that their outcome probability distribution would be very different. Named animals seem to be more popular for adoption. Named animals could mean that they had previous owners and possible stories.
In [17]:
df[df['NameIsPresent'] == True].shape[0]
Out[17]:
In [18]:
df[df['NameIsPresent'] == False].shape[0]
Out[18]:
We can see that out of the animals present in training set more than 2/3 had names and roughly about half of them got adopted.
In [19]:
df['OutcomeSubtype'].unique()
Out[19]:
In [20]:
sns.set_context("poster")
sns.countplot(x="OutcomeSubtype", hue="AnimalType", data=df)
Out[20]:
In [25]:
df['DateTime']
Out[25]: