Newbe to Kaggle and datascience in general. As I am creating my first kernel for the Titanic Survival prediction model (in python) I wrote down everything that I thought was unclear at first. The goal I set for myself was to get everythng working, create a model that predicted at least somewhat better than chance alone, and upload it to Kaggle.
I hope this is useful to newbe kagglers like myself.
-First: the decision to either work 'within' Kaggle (the kaggle kernel) or use use your own downloadable platform. -Second: 'Python or R?'
What is meant by 'own platform'? You can do all of this outside of Kaggle and then upload (/copy paste code) to kaggle when done. Obviously this requires some setup but the advantages are that you know where you stand when doing datascience projects oustide of kaggle. The setup is made quite easy by means of an application called 'anaconda navigator' which also has other benefits so I stronlgy suggest using this if going for your own setup.
Discussion has been going forever. Rivalling the tabs vs spaces debate :P #piedpiper Being horrible at the javascript syntax I choose python because of its syntaxical ease but both have their advantages and disadvantages. Most important take-away; eiter will be fine, if starting choose the language your (desired) job/company uses.
First things first. Even before looking at your data you want to orient yourself (at least you want to do this if you are new to jupyter notebook and kaggle kernels like me). See 'where are working from' and if necesary change your working directory to where you have saved your data files (csv's).
To do this you need:
#If you are working fully within a kaggle kernel you can skip this. But is might be good to do this anyway for potential troubleshooting purposes in the future.
In [3]:
import os
os.getcwd()
Out[3]:
So we see we are currently set in Users/steven. This is not where I want to be because I have not -and do not want to- save my data files (csv's) here.
So I look up were the folder is that contains the CSV's (train.csv & test.csv downloaded from kaggle) I intend to use.
You do this outside Jupyter, by just looking on your computer and looking at the path. For me it is /Users/steven/Documents/Kaggle/Titanic so this will be used in the following command.
# This is case sensitive so pay attention to whether your folders on your PC start with or without uppercase!
In [4]:
os.chdir('/Users/steven/Documents/Kaggle/Titanic')
Now we check by using same command as before (and we see it is correct because it prints out the directory we wanted):
In [5]:
os.getcwd()
Out[5]:
So now you are ready to start. You want to look a bit at the data. Two (there are a lot more) basic ways are;
Let's assume you have already done number 1 (opening in excel and looking in the data, ideally using a pivot table). If you are starting at kaggle I am going to assume you are familiar with excel basics.. For number 2 (data in jupyter notebooke/kaggle kernel) first step in to import a library that helps you work with csv files called 'pandas' (this is your 'csv-reader' and allows you to create dataframes :
In [6]:
import pandas as pd
In [7]:
%pylab inline
# the %pylab statement here is just something to make sure visualizations we will make later on
# will be printed within this window..
In [8]:
train_df = pd.read_csv('train.csv', header=0)
#above is the basic command to 'import' your csv data. Df is the new name for your 'imported' data
#(df is short for dataframe, you can name this anyway you want, but including 'df' in your name is convention)
test_df = pd.read_csv('test.csv', header=0)
#you don't have to use the test set but I am doing this to eveluate the model without uploading. You can slip this.
train_df.head(2)
#with df.head(2) we can 'test' by previewing the first(head) 2 rows(2) of the dataframe(df)
#You can see the final x by using 'df.tail(x) (replace x with number of rows)
Out[8]:
In [9]:
train_df
#show full dataset (df). (if very large this can be very inconvenient but with our trainset it's ok)
#notice it adds a rows total and columns total underneath (troubleshooting: if you do not see these totals
# you can seperatly create this by using 'df.shape')
Out[9]:
In [10]:
#let's get slightly more meta. (data about the data, like what type is eacht variable ('column')?)
train_df.info()
#especially the information on the right is usefull at this point (the clomuns with values 'int64 and 'object' etc)
# These values describing each variable should be identical to that of the testset (which in this case being
# the Titanic datasets from Kaggle) they are. To test this you could repeat this procedure but use the test set instead
# of the train set.
About these datatype names:
There is a lot of ambiguity when expressing what type a variables is. In statistics there is a measurement level hierarchy which I think is quite helpful. The confusion arises because in some fields 'categorical' the an overlapping clustering contaning (a.o.) nominal,ordinal and ratio, whil in other fields categorical is synonomous for the ordinal measurement level.
In [11]:
# More, More more!
#Let's fully dive in this meta data description of the variables:
train_df.describe()
# Notice that the decribe funcinon only gives back non-'object' variables..
Out[11]:
Looking at our 'dependent variable' (i.e. 'survived') we see the average (mean) of .38 survival rate.
We know from the introduction on the problem (kaggle description) "killing 1502 out of 2224" so this seems about right because 1 - 1502/2224 = roughly 1/3 (.33).
Knowing our .38 is based on the training part of the data set while the given description is based on the total set it is close enough to state this is an honest sample.
Let's say you'd want the actual total of people who survived and the total of people who died (without calculating this back from the mean):
In [12]:
train_df.Survived.value_counts()
#the variable name (Survived) is with a capital letter because it has a capital letter in data set.
#'value_counts' is the 'smart' part, the function.
Out[12]:
(So we know in our train data set, high over survival change = (342/891) = .38)
In [13]:
train_df.Sex.value_counts().plot(kind='bar')
#you can replace the variable with any of the 12 (for some with more visual succes than for others..)
Out[13]:
So we have come quite far now.
It is time to really start segmenting and we have already made a start with gender (Sex).
We choose this to start with because A) it's easy and B) by looking at the data in excel(see beginning) we should have some reasonable suspicion that survival rate is not equal for men and woman. So this would make be sensible start for segmenting.
Let's show the data for woman only:
In [14]:
train_df[train_df.Sex=='female']
# double == because were making a comparison not setting up for creating)
Out[14]:
In [15]:
#before continuing let's do a quick check for 'missing values' (rows where gender is unknown)
# by using the 'isnull' built-in function:
train_df[train_df.Sex.isnull()]
Out[15]:
which shows up empty so that is good news, saves us time
In [16]:
# Let's visualize the number of survival amongst woman and later the number of survival amongst men to campare.
train_df[train_df.Sex=='female'].Survived.value_counts().plot(kind='bar', title='Survival among female')
Out[16]:
Now we copy and past the command from above, and delete the 'fe' from 'female' (don't forget title)
In [17]:
train_df[train_df.Sex=='male'].Survived.value_counts().plot(kind='bar', title='Survival among male')
Out[17]:
In [18]:
# The same can be done for age. Here it can also be interesting to combine age with sex;
train_df[(train_df.Age<11) & (train_df.Sex=='female')].Survived.value_counts().plot(kind='bar')
# '11' is just an arbitrarily chosen bumner for age.
Out[18]:
As you can see the combination of a low age and female (i.e. little girls) have a quite different (higher) survivalrate compared to the total trainset average survival rate (.38)
Let's see if childeren regardless of age also have better chances than .38:
In [19]:
train_df[(train_df.Age<11)].Survived.value_counts().plot(kind='bar')
Out[19]:
We can clearly see childeren(< 11 years) in general have better chances than the overal population (trainset) but not as good as childeren (< 11 years) that are also girls.
Import the python library seaborn (as sns)
You can also choose to do this in the beginning when importing pandas, so you get a list in the beginning with every library to be imported. To do it at the beginning is convention (and makes it easy for outsiders to see al used libraries at once) but for the purpose of this getting started tutorial I think it is better like this.
In [20]:
import seaborn as sns
# I don't know why seaborn is abbreviated as sns but you can choose anything you like as long as it is not used
# by anything else. Sns seems to be convention.
With seaborn we can more easily make barplots that show us more at once. For example we can take the variable Pclass and defining it as the x-axis, while makeing our dependent variable 'Survived' the y-axis, while differentiating between men and women (defining Sex as hue):
In [21]:
sns.barplot(x="Pclass", y="Survived", hue="Sex", data=train_df);
In [22]:
#If we don't mind stereotypes ;p we could change the colors so that we don't have to look at the legend
# to remind us of the colorcoding for Sex:
#Just use the same command but add ' palette={"male": "blue", "female": "pink"} '
In [23]:
sns.barplot(x="Pclass", y="Survived", hue="Sex", data=train_df, palette={"male": "blue", "female": "pink"});
In reality we would have to repeat these visualisation command for all variables to see which variables might be intereseting for our model. For now we let's say we have repeated this for all variables. This will result in you wanting to keep: Sex, Age, Pclass, Cabin & Fares (and passengerID)
Now that we have gotten familiarized with our data, we have to get the data in such a shape that we can use it.
This means dropping some features (variables) we don't want to use (Ticket, Name and Embarked), creating bins for other variables (Age and Fares) or changing some values of a variable to only the first letter (Cabins).
In [24]:
# Let's firs remove the variables we don't want:
def drop_features(df):
return df.drop(['Ticket', 'Name', 'Embarked'], axis=1)
In [25]:
# make bins for ages and name them for ease:
def simplify_ages(df):
df.Age = df.Age.fillna(-0.5)
bins = (-1, 0, 5, 12, 18, 25, 35, 60, 120)
group_names = ['Unknown', 'Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Senior']
categories = pd.cut(df.Age, bins, labels=group_names)
df.Age = categories
return df
#keep only the first letter (similar effect as making bins/clusters):
def simplify_cabins(df):
df.Cabin = df.Cabin.fillna('N')
df.Cabin = df.Cabin.apply(lambda x: x[0])
return df
# make bins for fare prices and name them:
def simplify_fares(df):
df.Fare = df.Fare.fillna(-0.5)
bins = (-1, 0, 8, 15, 31, 1000)
group_names = ['Unknown', '1_quartile', '2_quartile', '3_quartile', '4_quartile']
categories = pd.cut(df.Fare, bins, labels=group_names)
df.Fare = categories
return df
# createa all in transform_features function to be called later:
def transform_features(df):
df = simplify_ages(df)
df = simplify_cabins(df)
df = simplify_fares(df)
df = drop_features(df)
return df
#create new dataframe with different name:
train_df2 = transform_features(train_df)
test_df2 = transform_features(test_df)
Let's see what it looks like, see if everything has gone as planned:
In [26]:
train_df2
Out[26]:
In [27]:
sns.barplot(x="Age", y="Survived", hue="Sex", data=train_df2, palette={"male": "blue", "female": "pink"});
In [28]:
sns.barplot(x="Cabin", y="Survived", hue="Sex", data=train_df2, palette={"male": "blue", "female": "pink"});
In [29]:
sns.barplot(x="Pclass", y="Survived", hue="Sex", data=train_df2, palette={"male": "blue", "female": "pink"});
In [30]:
sns.barplot(x="Fare", y="Survived", hue="Sex", data=train_df2, palette={"male": "blue", "female": "pink"});
In [31]:
from sklearn import preprocessing
def encode_features(df_train, df_test):
features = ['Fare', 'Cabin', 'Age', 'Sex']
df_combined = pd.concat([df_train[features], df_test[features]])
for feature in features:
le = preprocessing.LabelEncoder()
le = le.fit(df_combined[feature])
df_train[feature] = le.transform(df_train[feature])
df_test[feature] = le.transform(df_test[feature])
return df_train, df_test
train_df2, test_df2 = encode_features(train_df2, test_df2)
train_df2.head()
Out[31]:
In [32]:
train_df2.info()
In [33]:
X_train = train_df2.drop(["Survived", "PassengerId"], axis=1)
Y_train = train_df2["Survived"]
X_test = test_df2.drop("PassengerId", axis=1).copy()
# I initially did not drop PassengerID. Keeping 8 variables ('features') in x-train and x-test. However, later on
# (during the modelling part) this resulted in an accuracy (for the random forests and classification trees) of 1.00
# Most likely I think it keeping PassengerId in this manner caused some form of label leakage.
# After dropping this in both sets accuracy results were more realistic..
X_train.shape, Y_train.shape , X_test.shape
Out[33]:
In [34]:
X_train.head()
Out[34]:
In [35]:
Y_train.head()
Out[35]:
In [36]:
# Logistic Regression
# Import from the the scikit-learn library (sklearn is the abbreviation for scikit-learn)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log
Out[36]:
In [37]:
# Decision Tree
# Import from the the scikit-learn library (sklearn is the abbreviation for scikit-learn)
from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree
Out[37]:
In [38]:
# Random Forest
# Import from the the scikit-learn library (sklearn is the abbreviation for scikit-learn)
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest
Out[38]:
In [39]:
#Creating a csv with the predicted scores (Y as 0 and 1's for survival)
submission = pd.DataFrame({
"PassengerId": test_df["PassengerId"],
"Survived": Y_pred
})
# But let's print it first to see if we don't see anythin weird:
In [40]:
submission.describe()
Out[40]:
In [57]:
submission.to_csv('../submission.csv', index=False)
Aaaaaaand we have a 0.7512 score. Not too bad for a first upload.
Not too well either because guessing at random would not be .5 but .62. ('mean as model')
But we haven't done any futher optimization like feature engineering yet. Also keep in mind that most likely the >.95 scores are overfitted either by uploading on trial and error basis, or by using the test set data to train the model on (it is publicly avalaible data after all..)
Now it is time to start improving this 'base' model. (// the titles in the passenger name variable would be a good start.)
Please feel free to comment below, I will read and if possible incorporate them and please give tips were you think needed. I will try to learn from users' suggestions and tips:)
In [ ]: