We'll be practicing with the Kaggle 'Titanic: Machine Learning from Disaster' dataset available here: http://www.kaggle.com/c/titanic-gettingStarted/data. Go ahead and download the 'train.csv' file and use the Python commands described below to help you get started exploring the dataset!
You will need to enter the path to tell Python where to read the csv file from. Locate where you saved the train.csv file, and grab the entire path starting with "C:/..." and ending with ".../train.csv". If you put the file in the same directory as this notebook, then you can simply load it as "train.csv"
This version of the example uses an IPython notebook. You can run the code in the cell the cursor is in by pressing Shift-Enter, or you can run everything in one go by selecting "Run All" from the Cell menu above.
First we import the libraries we need for data analysis; these are all included with Anaconda
In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [39]:
titanicdata = pd.read_csv('train.csv')
titanicdata
Out[39]:
In [3]:
titanicdata[0:10]
Out[3]:
In [4]:
titanicdata.shape
Out[4]:
In [5]:
titanicdata.describe()
Out[5]:
In [8]:
titanicdata = titanicdata.sort(columns='Fare', ascending=False)
titanicdata[0:5]
Out[8]:
Looks like PC 177755 was a really swanky ticket!
Now let's create a new column with a boolean value indicating whether the passenger was a minor (under 18) and view a few of them.
In [38]:
titanicdata['isMinor'] = (titanicdata['Age'] < 18.0)
titanicdata[titanicdata['isMinor'] == True][0:5]
Out[38]:
In [122]:
# Start with the default boxplot style
ax = titanicdata.boxplot(column='Age', by='Survived', return_type='axes', figsize=(10,5))
# Now add the actual data points, with some jitter to reduce overplotting, color coded by sex
color = titanicdata.Sex.apply(dict(male='red', female='green').get)
x = titanicdata.Survived + np.random.normal(0, 0.01, len(titanicdata.Age)) + 1
plt.scatter(x, titanicdata.Age, c=color, alpha=0.2)
ax['Age'].set_xlabel('Passenger Status')
ax['Age'].set_ylabel('Passenger Age')
ax['Age'].set_title('Boxplot of Age vs. Survival Status')
ax['Age'].xaxis.set_ticklabels(['Died', 'Survived'])
Out[122]: