Pandas is a Python library that contains high-level data structures and manipulation tools designed for data analysis. Think of Pandas as a Python version of Excel. Scikit-learn, on the other hand, is an open-source machine learning library for Python.
While Scikit-learn does a lot of the heavy lifting, what's equally important is ensuring that raw data is processed in such a way that we are able to 'feed' it to Scikit-learn. Hence, the ability to manipulate raw data with Pandas makes it an indispensible part of our toolkit.
Kaggle is the leading platform for data science competitions. Participants compete for cash prizes by submitting the best predictive model to problems posted on the competition website.
Learning machine learning via Kaggle problems allows us to take a highly-directed approach because:
In the following set of exercises, we will be reviewing the data from the Kaggle Titanic competition. Our aim is to make predictions on whether or not specific passengers on the Titanic survived, based on characteristics such as age, sex and class.
We will start by processing the training data, after which we will be able to use to 'train' (or 'fit') our model. With the trained model, we apply it to the test data to make the predictions. Finally, we output our predictions into a .csv file to make a submission to Kaggle and see how well they perform.
It is very common to encounter missing values in a data set. In this section, we will take the simplest (or perhaps, simplistic) approach of ignoring the whole row if any part of it contains an NaN value. We will build on this approach in later sections.
First, we load the training data from a .csv file. This is the similar to the data found on the Kaggle website:
In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('../data/train.csv')
We then review a selection of the data.
In [2]:
df.head(10)
Out[2]:
We notice that the columns describe features of the Titanic passengers, such as age, sex, and class. Of particular interest is the column Survived, which describes whether or not the passenger survived. When training our model, what we are essentially doing is assessing how each feature impacts whether or not the passenger survived (or if the feature makes an impact at all).
Exercise:
We observe that the columns Name, Ticket and Cabin are, for our current purposes, irrelevant. We proceed to remove them from our data set.
In [3]:
df = df.drop(['Name', 'Ticket', 'Cabin'], axis=1)
Next, we review the type of data in the columns, and their respective counts.
In [4]:
df.info()
We notice that the columns Age and Embarked have NaNs or missing values. As previously discussed, we take the approach of simply removing the rows with missing values.
In [5]:
df = df.dropna()
Question
Scikit-learn only takes numerical arrays as inputs. As such, we would need to convert the categorical columns Sex and Embarked into numerical ones. We first review the range of values for the column Sex, and create a new column that represents the data as numbers.
In [6]:
df['Sex'].unique()
Out[6]:
In [7]:
df['Gender'] = df['Sex'].map({'female': 0, 'male':1}).astype(int)
Similarly for Embarked, we review the range of values and create a new column called Port that represents, as a numerical value, where each passenger embarks from.
In [8]:
df['Embarked'].unique()
Out[8]:
In [9]:
df['Port'] = df['Embarked'].map({'C':1, 'S':2, 'Q':3}).astype(int)
Question
Now that we have numerical columns that encapsulate the information provided by the columns Sex and Embarked, we can proceed to drop them from our data set.
In [10]:
df = df.drop(['Sex', 'Embarked'], axis=1)
We review the columns our final, processed data set.
In [11]:
cols = df.columns.tolist()
print(cols)
For convenience, we move the column Survived to the left-most column. We note that the left-most column is indexed as 0.
In [12]:
cols = [cols[1]] + cols[0:1] + cols[2:]
df = df[cols]
In our final review of our training data, we check that (1) the column Survived is the left-most column (2) there are no NaN values, and (3) all the values are in numerical form.
In [13]:
df.head(10)
Out[13]:
In [14]:
df.info()
Finally, we convert the processed training data from a Pandas dataframe into a numerical (Numpy) array.
In [15]:
train_data = df.values
In this section, we'll simply use the model as a black box. We'll review more sophisticated techniques in later sections.
Here we'll be using the Random Forest model. The intuition is as follows: each feature is reviewed to see how much impact it makes to the outcome. The most prominent feature is segmented into a 'branch'. A collection of branches is a 'tree'. The Random Forest model, broadly speaking, creates a 'forest' of trees and aggregates the results.
In [16]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators = 100)
We use the processed training data to 'train' (or 'fit') our model. The column Survived will be our second input, and the set of other features (with the column PassengerId omitted) as the first.
In [17]:
model = model.fit(train_data[0:,2:], train_data[0:,0])
We first load the test data.
In [18]:
df_test = pd.read_csv('../data/test.csv')
We then review a selection of the data.
In [19]:
df_test.head(10)
Out[19]:
We notice that test data has columns similar to our training data, but not the column Survived. We'll use our trained model to predict values for the column Survived.
As before, we process the test data in a similar fashion to what we did to the training data.
In [20]:
df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)
df_test = df_test.dropna()
df_test['Gender'] = df_test['Sex'].map({'female': 0, 'male':1})
df_test['Port'] = df_test['Embarked'].map({'C':1, 'S':2, 'Q':3})
df_test = df_test.drop(['Sex', 'Embarked'], axis=1)
test_data = df_test.values
We now apply the trained model to the test data (omitting the column PassengerId) to produce an output of predictions.
In [21]:
output = model.predict(test_data[:,1:])
We simply create a Pandas dataframe by combining the index from the test data with the output of predictions.
In [22]:
result = np.c_[test_data[:,0].astype(int), output.astype(int)]
df_result = pd.DataFrame(result[:,0:2], columns=['PassengerId', 'Survived'])
We briefly review our predictions.
In [23]:
df_result.head(10)
Out[23]:
Finally, we output our results to a .csv file.
In [24]:
df_result.to_csv('../results/titanic_1-0.csv', index=False)
However, it appears that we have a problem. The Kaggle submission website expects "the solution file to have 418 predictions."
https://www.kaggle.com/c/titanic-gettingStarted/submissions/attach
We compare this to our result.
In [25]:
df_result.shape
Out[25]:
Since we eliminated the rows containing NaNs, we end up with a set of predictions with a smaller number of rows compared to the test data. As Kaggle requires all 418 predictions, we are unable to make a submission.
In this section, we took the simplest approach of ignoring missing values, but fail to produce a complete set of predictions. We look to build on this approach in Section 1-1.