Newbe to Kaggle and datascience in general. As I am creating my first kernel for the Titanic Survival prediction model (in python) I wrote down everything that I thought was unclear at first. The goal I set for myself was to get everythng working, create a model that predicted at least somewhat better than chance alone, and upload it to Kaggle.
I hope this is useful to newbe kagglers like myself.

1 Decisions to make before starting your first datascience project (kaggle titanic..)

-First: the decision to either work 'within' Kaggle (the kaggle kernel) or use use your own downloadable platform. -Second: 'Python or R?'

First decision:

What is meant by 'own platform'? You can do all of this outside of Kaggle and then upload (/copy paste code) to kaggle when done. Obviously this requires some setup but the advantages are that you know where you stand when doing datascience projects oustide of kaggle. The setup is made quite easy by means of an application called 'anaconda navigator' which also has other benefits so I stronlgy suggest using this if going for your own setup.

Second decision:

Discussion has been going forever. Rivalling the tabs vs spaces debate :P #piedpiper Being horrible at the javascript syntax I choose python because of its syntaxical ease but both have their advantages and disadvantages. Most important take-away; eiter will be fine, if starting choose the language your (desired) job/company uses.

2 Getting started:

2A: Orienting Yourself
2B: Having a looksy at the data (or Exploratory Data Analysis (EDA))

3 Data prepping & further visualization

4 Modelling

5 Submission

1A: Getting started: Orienting Yourself

First things first. Even before looking at your data you want to orient yourself (at least you want to do this if you are new to jupyter notebook and kaggle kernels like me). See 'where are working from' and if necesary change your working directory to where you have saved your data files (csv's).
To do this you need:

import os #to be able to use the os. commands

os.getcwd() #to find were you are working from now (like 'pwd' in unix)

os.chdir('....path..') #to change from you current direcotory to you desired directory (e.g. where your data csv's are). Like 'cd' in unix.

#If you are working fully within a kaggle kernel you can skip this. But is might be good to do this anyway for potential troubleshooting purposes in the future.



In [3]:

    
import os
os.getcwd()









    Out[3]:





'/Users/steven'

So we see we are currently set in Users/steven. This is not where I want to be because I have not -and do not want to- save my data files (csv's) here.

So I look up were the folder is that contains the CSV's (train.csv & test.csv downloaded from kaggle) I intend to use.

You do this outside Jupyter, by just looking on your computer and looking at the path. For me it is /Users/steven/Documents/Kaggle/Titanic so this will be used in the following command.

# This is case sensitive so pay attention to whether your folders on your PC start with or without uppercase!



In [4]:

    
os.chdir('/Users/steven/Documents/Kaggle/Titanic')

Now we check by using same command as before (and we see it is correct because it prints out the directory we wanted):



In [5]:

    
os.getcwd()









    Out[5]:





'/Users/steven/Documents/Kaggle/Titanic'

1B: Getting started: Having a looksy at the data ("EDA")

So now you are ready to start. You want to look a bit at the data. Two (there are a lot more) basic ways are;

1: open your data files (csv's) with excel, save a copy as .xls and in excel go to the data tab and use the text to data wizard to use the comma's to seperate fields.
the advantage of this is that after having a quick look you can you pivot tables functionality in excel to look deeper
2: show data in your jupyter notebook

Let's assume you have already done number 1 (opening in excel and looking in the data, ideally using a pivot table). If you are starting at kaggle I am going to assume you are familiar with excel basics.. For number 2 (data in jupyter notebooke/kaggle kernel) first step in to import a library that helps you work with csv files called 'pandas' (this is your 'csv-reader' and allows you to create dataframes :



In [6]:

    
import pandas as pd



In [7]:

    
%pylab inline  
# the %pylab statement here is just something to make sure visualizations we will make later on 
# will be printed within this window..









    



Populating the interactive namespace from numpy and matplotlib



In [8]:

    
train_df = pd.read_csv('train.csv', header=0)
#above is the basic command to 'import' your csv data. Df is the new name for your 'imported' data 
#(df is short for dataframe, you can name this anyway you want, but including 'df' in your name is convention)

test_df = pd.read_csv('test.csv', header=0)
#you don't have to use the test set but I am doing this to eveluate the model without uploading. You can slip this.

train_df.head(2)
#with df.head(2) we can 'test' by previewing the first(head) 2 rows(2) of the dataframe(df)
#You can see the final x by using 'df.tail(x)  (replace x with number of rows)









    Out[8]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C



In [9]:

    
train_df
#show full dataset (df). (if very large this can be very inconvenient but with our trainset it's ok)
#notice it adds a rows total and columns total underneath (troubleshooting: if you do not see these totals
# you can seperatly create this by using 'df.shape')









    Out[9]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S
    
    
      5
      6
      0
      3
      Moran, Mr. James
      male
      NaN
      0
      0
      330877
      8.4583
      NaN
      Q
    
    
      6
      7
      0
      1
      McCarthy, Mr. Timothy J
      male
      54.0
      0
      0
      17463
      51.8625
      E46
      S
    
    
      7
      8
      0
      3
      Palsson, Master. Gosta Leonard
      male
      2.0
      3
      1
      349909
      21.0750
      NaN
      S
    
    
      8
      9
      1
      3
      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
      female
      27.0
      0
      2
      347742
      11.1333
      NaN
      S
    
    
      9
      10
      1
      2
      Nasser, Mrs. Nicholas (Adele Achem)
      female
      14.0
      1
      0
      237736
      30.0708
      NaN
      C
    
    
      10
      11
      1
      3
      Sandstrom, Miss. Marguerite Rut
      female
      4.0
      1
      1
      PP 9549
      16.7000
      G6
      S
    
    
      11
      12
      1
      1
      Bonnell, Miss. Elizabeth
      female
      58.0
      0
      0
      113783
      26.5500
      C103
      S
    
    
      12
      13
      0
      3
      Saundercock, Mr. William Henry
      male
      20.0
      0
      0
      A/5. 2151
      8.0500
      NaN
      S
    
    
      13
      14
      0
      3
      Andersson, Mr. Anders Johan
      male
      39.0
      1
      5
      347082
      31.2750
      NaN
      S
    
    
      14
      15
      0
      3
      Vestrom, Miss. Hulda Amanda Adolfina
      female
      14.0
      0
      0
      350406
      7.8542
      NaN
      S
    
    
      15
      16
      1
      2
      Hewlett, Mrs. (Mary D Kingcome)
      female
      55.0
      0
      0
      248706
      16.0000
      NaN
      S
    
    
      16
      17
      0
      3
      Rice, Master. Eugene
      male
      2.0
      4
      1
      382652
      29.1250
      NaN
      Q
    
    
      17
      18
      1
      2
      Williams, Mr. Charles Eugene
      male
      NaN
      0
      0
      244373
      13.0000
      NaN
      S
    
    
      18
      19
      0
      3
      Vander Planke, Mrs. Julius (Emelia Maria Vande...
      female
      31.0
      1
      0
      345763
      18.0000
      NaN
      S
    
    
      19
      20
      1
      3
      Masselmani, Mrs. Fatima
      female
      NaN
      0
      0
      2649
      7.2250
      NaN
      C
    
    
      20
      21
      0
      2
      Fynney, Mr. Joseph J
      male
      35.0
      0
      0
      239865
      26.0000
      NaN
      S
    
    
      21
      22
      1
      2
      Beesley, Mr. Lawrence
      male
      34.0
      0
      0
      248698
      13.0000
      D56
      S
    
    
      22
      23
      1
      3
      McGowan, Miss. Anna "Annie"
      female
      15.0
      0
      0
      330923
      8.0292
      NaN
      Q
    
    
      23
      24
      1
      1
      Sloper, Mr. William Thompson
      male
      28.0
      0
      0
      113788
      35.5000
      A6
      S
    
    
      24
      25
      0
      3
      Palsson, Miss. Torborg Danira
      female
      8.0
      3
      1
      349909
      21.0750
      NaN
      S
    
    
      25
      26
      1
      3
      Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...
      female
      38.0
      1
      5
      347077
      31.3875
      NaN
      S
    
    
      26
      27
      0
      3
      Emir, Mr. Farred Chehab
      male
      NaN
      0
      0
      2631
      7.2250
      NaN
      C
    
    
      27
      28
      0
      1
      Fortune, Mr. Charles Alexander
      male
      19.0
      3
      2
      19950
      263.0000
      C23 C25 C27
      S
    
    
      28
      29
      1
      3
      O'Dwyer, Miss. Ellen "Nellie"
      female
      NaN
      0
      0
      330959
      7.8792
      NaN
      Q
    
    
      29
      30
      0
      3
      Todoroff, Mr. Lalio
      male
      NaN
      0
      0
      349216
      7.8958
      NaN
      S
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      861
      862
      0
      2
      Giles, Mr. Frederick Edward
      male
      21.0
      1
      0
      28134
      11.5000
      NaN
      S
    
    
      862
      863
      1
      1
      Swift, Mrs. Frederick Joel (Margaret Welles Ba...
      female
      48.0
      0
      0
      17466
      25.9292
      D17
      S
    
    
      863
      864
      0
      3
      Sage, Miss. Dorothy Edith "Dolly"
      female
      NaN
      8
      2
      CA. 2343
      69.5500
      NaN
      S
    
    
      864
      865
      0
      2
      Gill, Mr. John William
      male
      24.0
      0
      0
      233866
      13.0000
      NaN
      S
    
    
      865
      866
      1
      2
      Bystrom, Mrs. (Karolina)
      female
      42.0
      0
      0
      236852
      13.0000
      NaN
      S
    
    
      866
      867
      1
      2
      Duran y More, Miss. Asuncion
      female
      27.0
      1
      0
      SC/PARIS 2149
      13.8583
      NaN
      C
    
    
      867
      868
      0
      1
      Roebling, Mr. Washington Augustus II
      male
      31.0
      0
      0
      PC 17590
      50.4958
      A24
      S
    
    
      868
      869
      0
      3
      van Melkebeke, Mr. Philemon
      male
      NaN
      0
      0
      345777
      9.5000
      NaN
      S
    
    
      869
      870
      1
      3
      Johnson, Master. Harold Theodor
      male
      4.0
      1
      1
      347742
      11.1333
      NaN
      S
    
    
      870
      871
      0
      3
      Balkic, Mr. Cerin
      male
      26.0
      0
      0
      349248
      7.8958
      NaN
      S
    
    
      871
      872
      1
      1
      Beckwith, Mrs. Richard Leonard (Sallie Monypeny)
      female
      47.0
      1
      1
      11751
      52.5542
      D35
      S
    
    
      872
      873
      0
      1
      Carlsson, Mr. Frans Olof
      male
      33.0
      0
      0
      695
      5.0000
      B51 B53 B55
      S
    
    
      873
      874
      0
      3
      Vander Cruyssen, Mr. Victor
      male
      47.0
      0
      0
      345765
      9.0000
      NaN
      S
    
    
      874
      875
      1
      2
      Abelson, Mrs. Samuel (Hannah Wizosky)
      female
      28.0
      1
      0
      P/PP 3381
      24.0000
      NaN
      C
    
    
      875
      876
      1
      3
      Najib, Miss. Adele Kiamie "Jane"
      female
      15.0
      0
      0
      2667
      7.2250
      NaN
      C
    
    
      876
      877
      0
      3
      Gustafsson, Mr. Alfred Ossian
      male
      20.0
      0
      0
      7534
      9.8458
      NaN
      S
    
    
      877
      878
      0
      3
      Petroff, Mr. Nedelio
      male
      19.0
      0
      0
      349212
      7.8958
      NaN
      S
    
    
      878
      879
      0
      3
      Laleff, Mr. Kristo
      male
      NaN
      0
      0
      349217
      7.8958
      NaN
      S
    
    
      879
      880
      1
      1
      Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)
      female
      56.0
      0
      1
      11767
      83.1583
      C50
      C
    
    
      880
      881
      1
      2
      Shelley, Mrs. William (Imanita Parrish Hall)
      female
      25.0
      0
      1
      230433
      26.0000
      NaN
      S
    
    
      881
      882
      0
      3
      Markun, Mr. Johann
      male
      33.0
      0
      0
      349257
      7.8958
      NaN
      S
    
    
      882
      883
      0
      3
      Dahlberg, Miss. Gerda Ulrika
      female
      22.0
      0
      0
      7552
      10.5167
      NaN
      S
    
    
      883
      884
      0
      2
      Banfield, Mr. Frederick James
      male
      28.0
      0
      0
      C.A./SOTON 34068
      10.5000
      NaN
      S
    
    
      884
      885
      0
      3
      Sutehall, Mr. Henry Jr
      male
      25.0
      0
      0
      SOTON/OQ 392076
      7.0500
      NaN
      S
    
    
      885
      886
      0
      3
      Rice, Mrs. William (Margaret Norton)
      female
      39.0
      0
      5
      382652
      29.1250
      NaN
      Q
    
    
      886
      887
      0
      2
      Montvila, Rev. Juozas
      male
      27.0
      0
      0
      211536
      13.0000
      NaN
      S
    
    
      887
      888
      1
      1
      Graham, Miss. Margaret Edith
      female
      19.0
      0
      0
      112053
      30.0000
      B42
      S
    
    
      888
      889
      0
      3
      Johnston, Miss. Catherine Helen "Carrie"
      female
      NaN
      1
      2
      W./C. 6607
      23.4500
      NaN
      S
    
    
      889
      890
      1
      1
      Behr, Mr. Karl Howell
      male
      26.0
      0
      0
      111369
      30.0000
      C148
      C
    
    
      890
      891
      0
      3
      Dooley, Mr. Patrick
      male
      32.0
      0
      0
      370376
      7.7500
      NaN
      Q
    
  

891 rows × 12 columns



In [10]:

    
#let's get slightly more meta. (data about the data, like what type is eacht variable ('column')?)
train_df.info()
#especially the information on the right is usefull at this point (the clomuns with values 'int64 and 'object' etc)
# These values describing each variable should be identical to that of the testset (which in this case being
# the Titanic datasets from Kaggle) they are. To test this you could repeat this procedure but use the test set instead
# of the train set.









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

About these datatype names:

int64 => whole numbers (can still be categorical)
object => string (can be categorical)
float64 => numeric with decimals (continous)

There is a lot of ambiguity when expressing what type a variables is. In statistics there is a measurement level hierarchy which I think is quite helpful. The confusion arises because in some fields 'categorical' the an overlapping clustering contaning (a.o.) nominal,ordinal and ratio, whil in other fields categorical is synonomous for the ordinal measurement level.

nominal (groups)
ordinal (groups with hierarchy)
interval (numbers with equal differences beteen them)
ratio (numbers with equal differences but also a absolute zero-point)



In [11]:

    
# More, More more!
#Let's fully dive in this meta data description of the variables:
train_df.describe()

# Notice that the decribe funcinon only gives back non-'object' variables..









    Out[11]:







  
    
      
      PassengerId
      Survived
      Pclass
      Age
      SibSp
      Parch
      Fare
    
  
  
    
      count
      891.000000
      891.000000
      891.000000
      714.000000
      891.000000
      891.000000
      891.000000
    
    
      mean
      446.000000
      0.383838
      2.308642
      29.699118
      0.523008
      0.381594
      32.204208
    
    
      std
      257.353842
      0.486592
      0.836071
      14.526497
      1.102743
      0.806057
      49.693429
    
    
      min
      1.000000
      0.000000
      1.000000
      0.420000
      0.000000
      0.000000
      0.000000
    
    
      25%
      223.500000
      0.000000
      2.000000
      20.125000
      0.000000
      0.000000
      7.910400
    
    
      50%
      446.000000
      0.000000
      3.000000
      28.000000
      0.000000
      0.000000
      14.454200
    
    
      75%
      668.500000
      1.000000
      3.000000
      38.000000
      1.000000
      0.000000
      31.000000
    
    
      max
      891.000000
      1.000000
      3.000000
      80.000000
      8.000000
      6.000000
      512.329200

Looking at our 'dependent variable' (i.e. 'survived') we see the average (mean) of .38 survival rate. We know from the introduction on the problem (kaggle description) "killing 1502 out of 2224" so this seems about right because 1 - 1502/2224 = roughly 1/3 (.33). Knowing our .38 is based on the training part of the data set while the given description is based on the total set it is close enough to state this is an honest sample.

Let's say you'd want the actual total of people who survived and the total of people who died (without calculating this back from the mean):



In [12]:

    
train_df.Survived.value_counts()
#the variable name (Survived) is with a capital letter because it has a capital letter in data set.
#'value_counts' is the 'smart' part, the function.









    Out[12]:





0    549
1    342
Name: Survived, dtype: int64

(So we know in our train data set, high over survival change = (342/891) = .38)



In [13]:

    
train_df.Sex.value_counts().plot(kind='bar')
#you can replace the variable with any of the 12 (for some with more visual succes than for others..)









    Out[13]:





<matplotlib.axes._subplots.AxesSubplot at 0x10b7f3f98>

So we have come quite far now. It is time to really start segmenting and we have already made a start with gender (Sex). We choose this to start with because A) it's easy and B) by looking at the data in excel(see beginning) we should have some reasonable suspicion that survival rate is not equal for men and woman. So this would make be sensible start for segmenting.

Let's show the data for woman only:



In [14]:

    
train_df[train_df.Sex=='female']
# double == because were making a comparison not setting up for creating)









    Out[14]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      8
      9
      1
      3
      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
      female
      27.0
      0
      2
      347742
      11.1333
      NaN
      S
    
    
      9
      10
      1
      2
      Nasser, Mrs. Nicholas (Adele Achem)
      female
      14.0
      1
      0
      237736
      30.0708
      NaN
      C
    
    
      10
      11
      1
      3
      Sandstrom, Miss. Marguerite Rut
      female
      4.0
      1
      1
      PP 9549
      16.7000
      G6
      S
    
    
      11
      12
      1
      1
      Bonnell, Miss. Elizabeth
      female
      58.0
      0
      0
      113783
      26.5500
      C103
      S
    
    
      14
      15
      0
      3
      Vestrom, Miss. Hulda Amanda Adolfina
      female
      14.0
      0
      0
      350406
      7.8542
      NaN
      S
    
    
      15
      16
      1
      2
      Hewlett, Mrs. (Mary D Kingcome)
      female
      55.0
      0
      0
      248706
      16.0000
      NaN
      S
    
    
      18
      19
      0
      3
      Vander Planke, Mrs. Julius (Emelia Maria Vande...
      female
      31.0
      1
      0
      345763
      18.0000
      NaN
      S
    
    
      19
      20
      1
      3
      Masselmani, Mrs. Fatima
      female
      NaN
      0
      0
      2649
      7.2250
      NaN
      C
    
    
      22
      23
      1
      3
      McGowan, Miss. Anna "Annie"
      female
      15.0
      0
      0
      330923
      8.0292
      NaN
      Q
    
    
      24
      25
      0
      3
      Palsson, Miss. Torborg Danira
      female
      8.0
      3
      1
      349909
      21.0750
      NaN
      S
    
    
      25
      26
      1
      3
      Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...
      female
      38.0
      1
      5
      347077
      31.3875
      NaN
      S
    
    
      28
      29
      1
      3
      O'Dwyer, Miss. Ellen "Nellie"
      female
      NaN
      0
      0
      330959
      7.8792
      NaN
      Q
    
    
      31
      32
      1
      1
      Spencer, Mrs. William Augustus (Marie Eugenie)
      female
      NaN
      1
      0
      PC 17569
      146.5208
      B78
      C
    
    
      32
      33
      1
      3
      Glynn, Miss. Mary Agatha
      female
      NaN
      0
      0
      335677
      7.7500
      NaN
      Q
    
    
      38
      39
      0
      3
      Vander Planke, Miss. Augusta Maria
      female
      18.0
      2
      0
      345764
      18.0000
      NaN
      S
    
    
      39
      40
      1
      3
      Nicola-Yarred, Miss. Jamila
      female
      14.0
      1
      0
      2651
      11.2417
      NaN
      C
    
    
      40
      41
      0
      3
      Ahlin, Mrs. Johan (Johanna Persdotter Larsson)
      female
      40.0
      1
      0
      7546
      9.4750
      NaN
      S
    
    
      41
      42
      0
      2
      Turpin, Mrs. William John Robert (Dorothy Ann ...
      female
      27.0
      1
      0
      11668
      21.0000
      NaN
      S
    
    
      43
      44
      1
      2
      Laroche, Miss. Simonne Marie Anne Andree
      female
      3.0
      1
      2
      SC/Paris 2123
      41.5792
      NaN
      C
    
    
      44
      45
      1
      3
      Devaney, Miss. Margaret Delia
      female
      19.0
      0
      0
      330958
      7.8792
      NaN
      Q
    
    
      47
      48
      1
      3
      O'Driscoll, Miss. Bridget
      female
      NaN
      0
      0
      14311
      7.7500
      NaN
      Q
    
    
      49
      50
      0
      3
      Arnold-Franchi, Mrs. Josef (Josefine Franchi)
      female
      18.0
      1
      0
      349237
      17.8000
      NaN
      S
    
    
      52
      53
      1
      1
      Harper, Mrs. Henry Sleeper (Myna Haxtun)
      female
      49.0
      1
      0
      PC 17572
      76.7292
      D33
      C
    
    
      53
      54
      1
      2
      Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkin...
      female
      29.0
      1
      0
      2926
      26.0000
      NaN
      S
    
    
      56
      57
      1
      2
      Rugg, Miss. Emily
      female
      21.0
      0
      0
      C.A. 31026
      10.5000
      NaN
      S
    
    
      58
      59
      1
      2
      West, Miss. Constance Mirium
      female
      5.0
      1
      2
      C.A. 34651
      27.7500
      NaN
      S
    
    
      61
      62
      1
      1
      Icard, Miss. Amelie
      female
      38.0
      0
      0
      113572
      80.0000
      B28
      NaN
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      807
      808
      0
      3
      Pettersson, Miss. Ellen Natalia
      female
      18.0
      0
      0
      347087
      7.7750
      NaN
      S
    
    
      809
      810
      1
      1
      Chambers, Mrs. Norman Campbell (Bertha Griggs)
      female
      33.0
      1
      0
      113806
      53.1000
      E8
      S
    
    
      813
      814
      0
      3
      Andersson, Miss. Ebba Iris Alfrida
      female
      6.0
      4
      2
      347082
      31.2750
      NaN
      S
    
    
      816
      817
      0
      3
      Heininen, Miss. Wendla Maria
      female
      23.0
      0
      0
      STON/O2. 3101290
      7.9250
      NaN
      S
    
    
      820
      821
      1
      1
      Hays, Mrs. Charles Melville (Clara Jennings Gr...
      female
      52.0
      1
      1
      12749
      93.5000
      B69
      S
    
    
      823
      824
      1
      3
      Moor, Mrs. (Beila)
      female
      27.0
      0
      1
      392096
      12.4750
      E121
      S
    
    
      829
      830
      1
      1
      Stone, Mrs. George Nelson (Martha Evelyn)
      female
      62.0
      0
      0
      113572
      80.0000
      B28
      NaN
    
    
      830
      831
      1
      3
      Yasbeck, Mrs. Antoni (Selini Alexander)
      female
      15.0
      1
      0
      2659
      14.4542
      NaN
      C
    
    
      835
      836
      1
      1
      Compton, Miss. Sara Rebecca
      female
      39.0
      1
      1
      PC 17756
      83.1583
      E49
      C
    
    
      842
      843
      1
      1
      Serepeca, Miss. Augusta
      female
      30.0
      0
      0
      113798
      31.0000
      NaN
      C
    
    
      849
      850
      1
      1
      Goldenberg, Mrs. Samuel L (Edwiga Grabowska)
      female
      NaN
      1
      0
      17453
      89.1042
      C92
      C
    
    
      852
      853
      0
      3
      Boulos, Miss. Nourelain
      female
      9.0
      1
      1
      2678
      15.2458
      NaN
      C
    
    
      853
      854
      1
      1
      Lines, Miss. Mary Conover
      female
      16.0
      0
      1
      PC 17592
      39.4000
      D28
      S
    
    
      854
      855
      0
      2
      Carter, Mrs. Ernest Courtenay (Lilian Hughes)
      female
      44.0
      1
      0
      244252
      26.0000
      NaN
      S
    
    
      855
      856
      1
      3
      Aks, Mrs. Sam (Leah Rosen)
      female
      18.0
      0
      1
      392091
      9.3500
      NaN
      S
    
    
      856
      857
      1
      1
      Wick, Mrs. George Dennick (Mary Hitchcock)
      female
      45.0
      1
      1
      36928
      164.8667
      NaN
      S
    
    
      858
      859
      1
      3
      Baclini, Mrs. Solomon (Latifa Qurban)
      female
      24.0
      0
      3
      2666
      19.2583
      NaN
      C
    
    
      862
      863
      1
      1
      Swift, Mrs. Frederick Joel (Margaret Welles Ba...
      female
      48.0
      0
      0
      17466
      25.9292
      D17
      S
    
    
      863
      864
      0
      3
      Sage, Miss. Dorothy Edith "Dolly"
      female
      NaN
      8
      2
      CA. 2343
      69.5500
      NaN
      S
    
    
      865
      866
      1
      2
      Bystrom, Mrs. (Karolina)
      female
      42.0
      0
      0
      236852
      13.0000
      NaN
      S
    
    
      866
      867
      1
      2
      Duran y More, Miss. Asuncion
      female
      27.0
      1
      0
      SC/PARIS 2149
      13.8583
      NaN
      C
    
    
      871
      872
      1
      1
      Beckwith, Mrs. Richard Leonard (Sallie Monypeny)
      female
      47.0
      1
      1
      11751
      52.5542
      D35
      S
    
    
      874
      875
      1
      2
      Abelson, Mrs. Samuel (Hannah Wizosky)
      female
      28.0
      1
      0
      P/PP 3381
      24.0000
      NaN
      C
    
    
      875
      876
      1
      3
      Najib, Miss. Adele Kiamie "Jane"
      female
      15.0
      0
      0
      2667
      7.2250
      NaN
      C
    
    
      879
      880
      1
      1
      Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)
      female
      56.0
      0
      1
      11767
      83.1583
      C50
      C
    
    
      880
      881
      1
      2
      Shelley, Mrs. William (Imanita Parrish Hall)
      female
      25.0
      0
      1
      230433
      26.0000
      NaN
      S
    
    
      882
      883
      0
      3
      Dahlberg, Miss. Gerda Ulrika
      female
      22.0
      0
      0
      7552
      10.5167
      NaN
      S
    
    
      885
      886
      0
      3
      Rice, Mrs. William (Margaret Norton)
      female
      39.0
      0
      5
      382652
      29.1250
      NaN
      Q
    
    
      887
      888
      1
      1
      Graham, Miss. Margaret Edith
      female
      19.0
      0
      0
      112053
      30.0000
      B42
      S
    
    
      888
      889
      0
      3
      Johnston, Miss. Catherine Helen "Carrie"
      female
      NaN
      1
      2
      W./C. 6607
      23.4500
      NaN
      S
    
  

314 rows × 12 columns



In [15]:

    
#before continuing let's do a quick check for 'missing values' (rows where gender is unknown)
# by using the 'isnull' built-in function:
train_df[train_df.Sex.isnull()]









    Out[15]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked

which shows up empty so that is good news, saves us time



In [16]:

    
# Let's visualize the number of survival amongst woman and later the number of survival amongst men to campare.
train_df[train_df.Sex=='female'].Survived.value_counts().plot(kind='bar', title='Survival among female')









    Out[16]:





<matplotlib.axes._subplots.AxesSubplot at 0x10ed76710>

Now we copy and past the command from above, and delete the 'fe' from 'female' (don't forget title)



In [17]:

    
train_df[train_df.Sex=='male'].Survived.value_counts().plot(kind='bar', title='Survival among male')









    Out[17]:





<matplotlib.axes._subplots.AxesSubplot at 0x10ed65a90>

Note that the bars are swapped. Our suspicion is confirmed (womand are more likely to have survived)

(meaning that in female segment survival occurred far more often than death. In male segment it is the other way around.)



In [18]:

    
# The same can be done for age. Here it can also be interesting to combine age with sex;
train_df[(train_df.Age<11) & (train_df.Sex=='female')].Survived.value_counts().plot(kind='bar')
# '11' is just an arbitrarily chosen bumner for age.









    Out[18]:





<matplotlib.axes._subplots.AxesSubplot at 0x10eec4550>

As you can see the combination of a low age and female (i.e. little girls) have a quite different (higher) survivalrate compared to the total trainset average survival rate (.38)

Let's see if childeren regardless of age also have better chances than .38:



In [19]:

    
train_df[(train_df.Age<11)].Survived.value_counts().plot(kind='bar')









    Out[19]:





<matplotlib.axes._subplots.AxesSubplot at 0x10ef286a0>

We can clearly see childeren(< 11 years) in general have better chances than the overal population (trainset) but not as good as childeren (< 11 years) that are also girls.

Visualize further using Seaborn

Import the python library seaborn (as sns)
You can also choose to do this in the beginning when importing pandas, so you get a list in the beginning with every library to be imported. To do it at the beginning is convention (and makes it easy for outsiders to see al used libraries at once) but for the purpose of this getting started tutorial I think it is better like this.



In [20]:

    
import seaborn as sns
# I don't know why seaborn is abbreviated as sns but you can choose anything you like as long as it is not used
# by anything else. Sns seems to be convention.

With seaborn we can more easily make barplots that show us more at once. For example we can take the variable Pclass and defining it as the x-axis, while makeing our dependent variable 'Survived' the y-axis, while differentiating between men and women (defining Sex as hue):



In [21]:

    
sns.barplot(x="Pclass", y="Survived", hue="Sex", data=train_df);



In [22]:

    
#If we don't mind stereotypes ;p we could change the colors so that we don't have to look at the legend
# to remind us of the colorcoding for Sex:
#Just use the same command but add ' palette={"male": "blue", "female": "pink"} '



In [23]:

    
sns.barplot(x="Pclass", y="Survived", hue="Sex", data=train_df, palette={"male": "blue", "female": "pink"});

In reality we would have to repeat these visualisation command for all variables to see which variables might be intereseting for our model. For now we let's say we have repeated this for all variables. This will result in you wanting to keep: Sex, Age, Pclass, Cabin & Fares (and passengerID)

3: Data prepping & further visualization

Now that we have gotten familiarized with our data, we have to get the data in such a shape that we can use it.

This means dropping some features (variables) we don't want to use (Ticket, Name and Embarked), creating bins for other variables (Age and Fares) or changing some values of a variable to only the first letter (Cabins).



In [24]:

    
# Let's firs remove the variables we don't want:
def drop_features(df):
    return df.drop(['Ticket', 'Name', 'Embarked'], axis=1)



In [25]:

    
# make bins for ages and name them for ease:
def simplify_ages(df):
    df.Age = df.Age.fillna(-0.5)
    bins = (-1, 0, 5, 12, 18, 25, 35, 60, 120)
    group_names = ['Unknown', 'Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Senior']
    categories = pd.cut(df.Age, bins, labels=group_names)
    df.Age = categories
    return df

#keep only the first letter (similar effect as making bins/clusters):
def simplify_cabins(df):
    df.Cabin = df.Cabin.fillna('N')
    df.Cabin = df.Cabin.apply(lambda x: x[0])
    return df

# make bins for fare prices and name them:
def simplify_fares(df):
    df.Fare = df.Fare.fillna(-0.5)
    bins = (-1, 0, 8, 15, 31, 1000)
    group_names = ['Unknown', '1_quartile', '2_quartile', '3_quartile', '4_quartile']
    categories = pd.cut(df.Fare, bins, labels=group_names)
    df.Fare = categories
    return df

# createa all in transform_features function to be called later:
def transform_features(df):
    df = simplify_ages(df)
    df = simplify_cabins(df)
    df = simplify_fares(df)
    df = drop_features(df)
    return df

#create new dataframe with different name:
train_df2 = transform_features(train_df)


test_df2 = transform_features(test_df)

Let's see what it looks like, see if everything has gone as planned:



In [26]:

    
train_df2









    Out[26]:







  
    
      
      PassengerId
      Survived
      Pclass
      Sex
      Age
      SibSp
      Parch
      Fare
      Cabin
    
  
  
    
      0
      1
      0
      3
      male
      Student
      1
      0
      1_quartile
      N
    
    
      1
      2
      1
      1
      female
      Adult
      1
      0
      4_quartile
      C
    
    
      2
      3
      1
      3
      female
      Young Adult
      0
      0
      1_quartile
      N
    
    
      3
      4
      1
      1
      female
      Young Adult
      1
      0
      4_quartile
      C
    
    
      4
      5
      0
      3
      male
      Young Adult
      0
      0
      2_quartile
      N
    
    
      5
      6
      0
      3
      male
      Unknown
      0
      0
      2_quartile
      N
    
    
      6
      7
      0
      1
      male
      Adult
      0
      0
      4_quartile
      E
    
    
      7
      8
      0
      3
      male
      Baby
      3
      1
      3_quartile
      N
    
    
      8
      9
      1
      3
      female
      Young Adult
      0
      2
      2_quartile
      N
    
    
      9
      10
      1
      2
      female
      Teenager
      1
      0
      3_quartile
      N
    
    
      10
      11
      1
      3
      female
      Baby
      1
      1
      3_quartile
      G
    
    
      11
      12
      1
      1
      female
      Adult
      0
      0
      3_quartile
      C
    
    
      12
      13
      0
      3
      male
      Student
      0
      0
      2_quartile
      N
    
    
      13
      14
      0
      3
      male
      Adult
      1
      5
      4_quartile
      N
    
    
      14
      15
      0
      3
      female
      Teenager
      0
      0
      1_quartile
      N
    
    
      15
      16
      1
      2
      female
      Adult
      0
      0
      3_quartile
      N
    
    
      16
      17
      0
      3
      male
      Baby
      4
      1
      3_quartile
      N
    
    
      17
      18
      1
      2
      male
      Unknown
      0
      0
      2_quartile
      N
    
    
      18
      19
      0
      3
      female
      Young Adult
      1
      0
      3_quartile
      N
    
    
      19
      20
      1
      3
      female
      Unknown
      0
      0
      1_quartile
      N
    
    
      20
      21
      0
      2
      male
      Young Adult
      0
      0
      3_quartile
      N
    
    
      21
      22
      1
      2
      male
      Young Adult
      0
      0
      2_quartile
      D
    
    
      22
      23
      1
      3
      female
      Teenager
      0
      0
      2_quartile
      N
    
    
      23
      24
      1
      1
      male
      Young Adult
      0
      0
      4_quartile
      A
    
    
      24
      25
      0
      3
      female
      Child
      3
      1
      3_quartile
      N
    
    
      25
      26
      1
      3
      female
      Adult
      1
      5
      4_quartile
      N
    
    
      26
      27
      0
      3
      male
      Unknown
      0
      0
      1_quartile
      N
    
    
      27
      28
      0
      1
      male
      Student
      3
      2
      4_quartile
      C
    
    
      28
      29
      1
      3
      female
      Unknown
      0
      0
      1_quartile
      N
    
    
      29
      30
      0
      3
      male
      Unknown
      0
      0
      1_quartile
      N
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      861
      862
      0
      2
      male
      Student
      1
      0
      2_quartile
      N
    
    
      862
      863
      1
      1
      female
      Adult
      0
      0
      3_quartile
      D
    
    
      863
      864
      0
      3
      female
      Unknown
      8
      2
      4_quartile
      N
    
    
      864
      865
      0
      2
      male
      Student
      0
      0
      2_quartile
      N
    
    
      865
      866
      1
      2
      female
      Adult
      0
      0
      2_quartile
      N
    
    
      866
      867
      1
      2
      female
      Young Adult
      1
      0
      2_quartile
      N
    
    
      867
      868
      0
      1
      male
      Young Adult
      0
      0
      4_quartile
      A
    
    
      868
      869
      0
      3
      male
      Unknown
      0
      0
      2_quartile
      N
    
    
      869
      870
      1
      3
      male
      Baby
      1
      1
      2_quartile
      N
    
    
      870
      871
      0
      3
      male
      Young Adult
      0
      0
      1_quartile
      N
    
    
      871
      872
      1
      1
      female
      Adult
      1
      1
      4_quartile
      D
    
    
      872
      873
      0
      1
      male
      Young Adult
      0
      0
      1_quartile
      B
    
    
      873
      874
      0
      3
      male
      Adult
      0
      0
      2_quartile
      N
    
    
      874
      875
      1
      2
      female
      Young Adult
      1
      0
      3_quartile
      N
    
    
      875
      876
      1
      3
      female
      Teenager
      0
      0
      1_quartile
      N
    
    
      876
      877
      0
      3
      male
      Student
      0
      0
      2_quartile
      N
    
    
      877
      878
      0
      3
      male
      Student
      0
      0
      1_quartile
      N
    
    
      878
      879
      0
      3
      male
      Unknown
      0
      0
      1_quartile
      N
    
    
      879
      880
      1
      1
      female
      Adult
      0
      1
      4_quartile
      C
    
    
      880
      881
      1
      2
      female
      Student
      0
      1
      3_quartile
      N
    
    
      881
      882
      0
      3
      male
      Young Adult
      0
      0
      1_quartile
      N
    
    
      882
      883
      0
      3
      female
      Student
      0
      0
      2_quartile
      N
    
    
      883
      884
      0
      2
      male
      Young Adult
      0
      0
      2_quartile
      N
    
    
      884
      885
      0
      3
      male
      Student
      0
      0
      1_quartile
      N
    
    
      885
      886
      0
      3
      female
      Adult
      0
      5
      3_quartile
      N
    
    
      886
      887
      0
      2
      male
      Young Adult
      0
      0
      2_quartile
      N
    
    
      887
      888
      1
      1
      female
      Student
      0
      0
      3_quartile
      B
    
    
      888
      889
      0
      3
      female
      Unknown
      1
      2
      3_quartile
      N
    
    
      889
      890
      1
      1
      male
      Young Adult
      0
      0
      3_quartile
      C
    
    
      890
      891
      0
      3
      male
      Young Adult
      0
      0
      1_quartile
      N
    
  

891 rows × 9 columns

Now let's do some seaborn visualizations with our new dataset:

(3vars per plot)



In [27]:

    
sns.barplot(x="Age", y="Survived", hue="Sex", data=train_df2, palette={"male": "blue", "female": "pink"});



In [28]:

    
sns.barplot(x="Cabin", y="Survived", hue="Sex", data=train_df2, palette={"male": "blue", "female": "pink"});



In [29]:

    
sns.barplot(x="Pclass", y="Survived", hue="Sex", data=train_df2, palette={"male": "blue", "female": "pink"});



In [30]:

    
sns.barplot(x="Fare", y="Survived", hue="Sex", data=train_df2, palette={"male": "blue", "female": "pink"});



In [31]:

    
from sklearn import preprocessing
def encode_features(df_train, df_test):
    features = ['Fare', 'Cabin', 'Age', 'Sex']
    df_combined = pd.concat([df_train[features], df_test[features]])
    
    for feature in features:
        le = preprocessing.LabelEncoder()
        le = le.fit(df_combined[feature])
        df_train[feature] = le.transform(df_train[feature])
        df_test[feature] = le.transform(df_test[feature])
    return df_train, df_test
    
train_df2, test_df2 = encode_features(train_df2, test_df2)
train_df2.head()









    Out[31]:







  
    
      
      PassengerId
      Survived
      Pclass
      Sex
      Age
      SibSp
      Parch
      Fare
      Cabin
    
  
  
    
      0
      1
      0
      3
      1
      4
      1
      0
      0
      7
    
    
      1
      2
      1
      1
      0
      0
      1
      0
      3
      2
    
    
      2
      3
      1
      3
      0
      7
      0
      0
      0
      7
    
    
      3
      4
      1
      1
      0
      7
      1
      0
      3
      2
    
    
      4
      5
      0
      3
      1
      7
      0
      0
      1
      7



In [32]:

    
train_df2.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null int64
Age            891 non-null int64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null int64
Cabin          891 non-null int64
dtypes: int64(9)
memory usage: 62.7 KB



In [33]:

    
X_train = train_df2.drop(["Survived", "PassengerId"], axis=1)
Y_train = train_df2["Survived"]
X_test  = test_df2.drop("PassengerId", axis=1).copy()




# I initially did not drop PassengerID. Keeping 8 variables ('features') in x-train and x-test. However, later on
# (during the modelling part) this resulted in an accuracy (for the random forests and classification trees) of 1.00
# Most likely I think it keeping PassengerId in this manner caused some form of label leakage. 
# After dropping this in both sets accuracy results were more realistic..

X_train.shape, Y_train.shape , X_test.shape









    Out[33]:





((891, 7), (891,), (418, 7))

Make sure that X-train and X-test have the same amount of variables ('features'; in this example '7')..

Almost ready to try some models ('sci-kitlearn')

Being: 1) regression 2) decision tree and 3) random forests.

first a quick looksy at the X-train and Y-train:



In [34]:

    
X_train.head()



In [35]:

    
Y_train.head()









    Out[35]:





0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

4 Modelling

Ready to try some models:

1) logistic regression
2) decision tree
3) random forests



In [36]:

    
# Logistic Regression

# Import from the the scikit-learn library (sklearn is the abbreviation for scikit-learn)
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log









    Out[36]:





79.120000000000005



In [37]:

    
# Decision Tree

# Import from the the scikit-learn library (sklearn is the abbreviation for scikit-learn)
from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree









    Out[37]:





91.019999999999996



In [38]:

    
# Random Forest

# Import from the the scikit-learn library (sklearn is the abbreviation for scikit-learn)
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)

acc_random_forest









    Out[38]:





91.019999999999996



In [39]:

    
#Creating a csv with the predicted scores (Y as 0 and 1's for survival)
submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_pred
    })

# But let's print it first to see if we don't see anythin weird:



In [40]:

    
submission.describe()









    Out[40]:







  
    
      
      PassengerId
      Survived
    
  
  
    
      count
      418.000000
      418.000000
    
    
      mean
      1100.500000
      0.330144
    
    
      std
      120.810458
      0.470828
    
    
      min
      892.000000
      0.000000
    
    
      25%
      996.250000
      0.000000
    
    
      50%
      1100.500000
      0.000000
    
    
      75%
      1204.750000
      1.000000
    
    
      max
      1309.000000
      1.000000

5 Submission

All looks fine. Let's turn it into a CSV and save it in a logical local place and upload it to Kaggle to find out real score (i.e. the score on their test set)



In [57]:

    
submission.to_csv('../submission.csv', index=False)









    



---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-57-8bf5a4b9cd36> in <module>()
----> 1 submission.to_csv('../output/submission.csv', index=False)

/Users/steven/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, tupleize_cols, date_format, doublequote, escapechar, decimal)
   1411                                      doublequote=doublequote,
   1412                                      escapechar=escapechar, decimal=decimal)
-> 1413         formatter.save()
   1414 
   1415         if path_or_buf is None:

/Users/steven/anaconda/lib/python3.6/site-packages/pandas/io/formats/format.py in save(self)
   1566             f, handles = _get_handle(self.path_or_buf, self.mode,
   1567                                      encoding=self.encoding,
-> 1568                                      compression=self.compression)
   1569             close = True
   1570 

/Users/steven/anaconda/lib/python3.6/site-packages/pandas/io/common.py in _get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text)
    380         elif is_text:
    381             # Python 3 and no explicit encoding
--> 382             f = open(path_or_buf, mode, errors='replace')
    383         else:
    384             # Python 3 and binary mode

FileNotFoundError: [Errno 2] No such file or directory: '../output/submission.csv'

Aaaaaaand we have a 0.7512 score. Not too bad for a first upload.
Not too well either because guessing at random would not be .5 but .62. ('mean as model')

But we haven't done any futher optimization like feature engineering yet. Also keep in mind that most likely the >.95 scores are overfitted either by uploading on trial and error basis, or by using the test set data to train the model on (it is publicly avalaible data after all..)

Now it is time to start improving this 'base' model. (// the titles in the passenger name variable would be a good start.)

Please feel free to comment below, I will read and if possible incorporate them and please give tips were you think needed. I will try to learn from users' suggestions and tips:)



In [ ]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Cabin
0	1	0	3	male	Student	1	0	1_quartile	N
1	2	1	1	female	Adult	1	0	4_quartile	C
2	3	1	3	female	Young Adult	0	0	1_quartile	N
3	4	1	1	female	Young Adult	1	0	4_quartile	C
4	5	0	3	male	Young Adult	0	0	2_quartile	N
5	6	0	3	male	Unknown	0	0	2_quartile	N
6	7	0	1	male	Adult	0	0	4_quartile	E
7	8	0	3	male	Baby	3	1	3_quartile	N
8	9	1	3	female	Young Adult	0	2	2_quartile	N
9	10	1	2	female	Teenager	1	0	3_quartile	N
10	11	1	3	female	Baby	1	1	3_quartile	G
11	12	1	1	female	Adult	0	0	3_quartile	C
12	13	0	3	male	Student	0	0	2_quartile	N
13	14	0	3	male	Adult	1	5	4_quartile	N
14	15	0	3	female	Teenager	0	0	1_quartile	N
15	16	1	2	female	Adult	0	0	3_quartile	N
16	17	0	3	male	Baby	4	1	3_quartile	N
17	18	1	2	male	Unknown	0	0	2_quartile	N
18	19	0	3	female	Young Adult	1	0	3_quartile	N
19	20	1	3	female	Unknown	0	0	1_quartile	N
20	21	0	2	male	Young Adult	0	0	3_quartile	N
21	22	1	2	male	Young Adult	0	0	2_quartile	D
22	23	1	3	female	Teenager	0	0	2_quartile	N
23	24	1	1	male	Young Adult	0	0	4_quartile	A
24	25	0	3	female	Child	3	1	3_quartile	N
25	26	1	3	female	Adult	1	5	4_quartile	N
26	27	0	3	male	Unknown	0	0	1_quartile	N
27	28	0	1	male	Student	3	2	4_quartile	C
28	29	1	3	female	Unknown	0	0	1_quartile	N
29	30	0	3	male	Unknown	0	0	1_quartile	N
...	...	...	...	...	...	...	...	...	...
861	862	0	2	male	Student	1	0	2_quartile	N
862	863	1	1	female	Adult	0	0	3_quartile	D
863	864	0	3	female	Unknown	8	2	4_quartile	N
864	865	0	2	male	Student	0	0	2_quartile	N
865	866	1	2	female	Adult	0	0	2_quartile	N
866	867	1	2	female	Young Adult	1	0	2_quartile	N
867	868	0	1	male	Young Adult	0	0	4_quartile	A
868	869	0	3	male	Unknown	0	0	2_quartile	N
869	870	1	3	male	Baby	1	1	2_quartile	N
870	871	0	3	male	Young Adult	0	0	1_quartile	N
871	872	1	1	female	Adult	1	1	4_quartile	D
872	873	0	1	male	Young Adult	0	0	1_quartile	B
873	874	0	3	male	Adult	0	0	2_quartile	N
874	875	1	2	female	Young Adult	1	0	3_quartile	N
875	876	1	3	female	Teenager	0	0	1_quartile	N
876	877	0	3	male	Student	0	0	2_quartile	N
877	878	0	3	male	Student	0	0	1_quartile	N
878	879	0	3	male	Unknown	0	0	1_quartile	N
879	880	1	1	female	Adult	0	1	4_quartile	C
880	881	1	2	female	Student	0	1	3_quartile	N
881	882	0	3	male	Young Adult	0	0	1_quartile	N
882	883	0	3	female	Student	0	0	2_quartile	N
883	884	0	2	male	Young Adult	0	0	2_quartile	N
884	885	0	3	male	Student	0	0	1_quartile	N
885	886	0	3	female	Adult	0	5	3_quartile	N
886	887	0	2	male	Young Adult	0	0	2_quartile	N
887	888	1	1	female	Student	0	0	3_quartile	B
888	889	0	3	female	Unknown	1	2	3_quartile	N
889	890	1	1	male	Young Adult	0	0	3_quartile	C
890	891	0	3	male	Young Adult	0	0	1_quartile	N

	PassengerId	Survived
count	418.000000	418.000000
mean	1100.500000	0.330144
std	120.810458	0.470828
min	892.000000	0.000000
25%	996.250000	0.000000
50%	1100.500000	0.000000
75%	1204.750000	1.000000
max	1309.000000	1.000000