Newbe to Kaggle and datascience in general. As I am creating my first kernel for the Titanic Survival prediction model (in python) I wrote down everything that I thought was unclear at first. The goal I set for myself was to get everythng working, create a model that predicted at least somewhat better than chance alone, and upload it to Kaggle.
I hope this is useful to newbe kagglers like myself.


1 Decisions to make before starting your first datascience project (kaggle titanic..)

-First: the decision to either work 'within' Kaggle (the kaggle kernel) or use use your own downloadable platform. -Second: 'Python or R?'

First decision:

What is meant by 'own platform'? You can do all of this outside of Kaggle and then upload (/copy paste code) to kaggle when done. Obviously this requires some setup but the advantages are that you know where you stand when doing datascience projects oustide of kaggle. The setup is made quite easy by means of an application called 'anaconda navigator' which also has other benefits so I stronlgy suggest using this if going for your own setup.

Second decision:

Discussion has been going forever. Rivalling the tabs vs spaces debate :P #piedpiper Being horrible at the javascript syntax I choose python because of its syntaxical ease but both have their advantages and disadvantages. Most important take-away; eiter will be fine, if starting choose the language your (desired) job/company uses.

2 Getting started:

  • 2A: Orienting Yourself
  • 2B: Having a looksy at the data (or Exploratory Data Analysis (EDA))

3 Data prepping & further visualization

4 Modelling

5 Submission



1A: Getting started: Orienting Yourself

First things first. Even before looking at your data you want to orient yourself (at least you want to do this if you are new to jupyter notebook and kaggle kernels like me). See 'where are working from' and if necesary change your working directory to where you have saved your data files (csv's).
To do this you need:

  • import os #to be able to use the os. commands
  • os.getcwd() #to find were you are working from now (like 'pwd' in unix)
  • os.chdir('....path..') #to change from you current direcotory to you desired directory (e.g. where your data csv's are). Like 'cd' in unix.
  • #If you are working fully within a kaggle kernel you can skip this. But is might be good to do this anyway for potential troubleshooting purposes in the future.

    
    
    In [3]:
    import os
    os.getcwd()
    
    
    
    
    Out[3]:
    '/Users/steven'


    So we see we are currently set in Users/steven. This is not where I want to be because I have not -and do not want to- save my data files (csv's) here.

    So I look up were the folder is that contains the CSV's (train.csv & test.csv downloaded from kaggle) I intend to use.

    You do this outside Jupyter, by just looking on your computer and looking at the path. For me it is /Users/steven/Documents/Kaggle/Titanic so this will be used in the following command.

    # This is case sensitive so pay attention to whether your folders on your PC start with or without uppercase!

    
    
    In [4]:
    os.chdir('/Users/steven/Documents/Kaggle/Titanic')
    


    Now we check by using same command as before (and we see it is correct because it prints out the directory we wanted):

    
    
    In [5]:
    os.getcwd()
    
    
    
    
    Out[5]:
    '/Users/steven/Documents/Kaggle/Titanic'


    1B: Getting started: Having a looksy at the data ("EDA")

    So now you are ready to start. You want to look a bit at the data. Two (there are a lot more) basic ways are;

    • 1: open your data files (csv's) with excel, save a copy as .xls and in excel go to the data tab and use the text to data wizard to use the comma's to seperate fields.
      the advantage of this is that after having a quick look you can you pivot tables functionality in excel to look deeper
    • 2: show data in your jupyter notebook

    Let's assume you have already done number 1 (opening in excel and looking in the data, ideally using a pivot table). If you are starting at kaggle I am going to assume you are familiar with excel basics.. For number 2 (data in jupyter notebooke/kaggle kernel) first step in to import a library that helps you work with csv files called 'pandas' (this is your 'csv-reader' and allows you to create dataframes :

    
    
    In [6]:
    import pandas as pd
    
    
    
    In [7]:
    %pylab inline  
    # the %pylab statement here is just something to make sure visualizations we will make later on 
    # will be printed within this window..
    
    
    
    
    Populating the interactive namespace from numpy and matplotlib
    
    
    
    In [8]:
    train_df = pd.read_csv('train.csv', header=0)
    #above is the basic command to 'import' your csv data. Df is the new name for your 'imported' data 
    #(df is short for dataframe, you can name this anyway you want, but including 'df' in your name is convention)
    
    test_df = pd.read_csv('test.csv', header=0)
    #you don't have to use the test set but I am doing this to eveluate the model without uploading. You can slip this.
    
    train_df.head(2)
    #with df.head(2) we can 'test' by previewing the first(head) 2 rows(2) of the dataframe(df)
    #You can see the final x by using 'df.tail(x)  (replace x with number of rows)
    
    
    
    
    Out[8]:
    PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
    0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
    1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
    
    
    In [9]:
    train_df
    #show full dataset (df). (if very large this can be very inconvenient but with our trainset it's ok)
    #notice it adds a rows total and columns total underneath (troubleshooting: if you do not see these totals
    # you can seperatly create this by using 'df.shape')
    
    
    
    
    Out[9]:
    PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
    0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
    1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
    2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
    3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
    4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
    5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
    6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
    7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
    8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
    9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
    10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
    11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
    12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
    13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
    14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
    15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
    16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
    17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
    18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 NaN S
    19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
    20 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S
    21 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 D56 S
    22 23 1 3 McGowan, Miss. Anna "Annie" female 15.0 0 0 330923 8.0292 NaN Q
    23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S
    24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 NaN S
    25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 347077 31.3875 NaN S
    26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
    27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0000 C23 C25 C27 S
    28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q
    29 30 0 3 Todoroff, Mr. Lalio male NaN 0 0 349216 7.8958 NaN S
    ... ... ... ... ... ... ... ... ... ... ... ... ...
    861 862 0 2 Giles, Mr. Frederick Edward male 21.0 1 0 28134 11.5000 NaN S
    862 863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 0 0 17466 25.9292 D17 S
    863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.5500 NaN S
    864 865 0 2 Gill, Mr. John William male 24.0 0 0 233866 13.0000 NaN S
    865 866 1 2 Bystrom, Mrs. (Karolina) female 42.0 0 0 236852 13.0000 NaN S
    866 867 1 2 Duran y More, Miss. Asuncion female 27.0 1 0 SC/PARIS 2149 13.8583 NaN C
    867 868 0 1 Roebling, Mr. Washington Augustus II male 31.0 0 0 PC 17590 50.4958 A24 S
    868 869 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.5000 NaN S
    869 870 1 3 Johnson, Master. Harold Theodor male 4.0 1 1 347742 11.1333 NaN S
    870 871 0 3 Balkic, Mr. Cerin male 26.0 0 0 349248 7.8958 NaN S
    871 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 1 11751 52.5542 D35 S
    872 873 0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.0000 B51 B53 B55 S
    873 874 0 3 Vander Cruyssen, Mr. Victor male 47.0 0 0 345765 9.0000 NaN S
    874 875 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 0 P/PP 3381 24.0000 NaN C
    875 876 1 3 Najib, Miss. Adele Kiamie "Jane" female 15.0 0 0 2667 7.2250 NaN C
    876 877 0 3 Gustafsson, Mr. Alfred Ossian male 20.0 0 0 7534 9.8458 NaN S
    877 878 0 3 Petroff, Mr. Nedelio male 19.0 0 0 349212 7.8958 NaN S
    878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
    879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
    880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.0000 NaN S
    881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S
    882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S
    883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S
    884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
    885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
    886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
    887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
    888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
    889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
    890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

    891 rows × 12 columns

    
    
    In [10]:
    #let's get slightly more meta. (data about the data, like what type is eacht variable ('column')?)
    train_df.info()
    #especially the information on the right is usefull at this point (the clomuns with values 'int64 and 'object' etc)
    # These values describing each variable should be identical to that of the testset (which in this case being
    # the Titanic datasets from Kaggle) they are. To test this you could repeat this procedure but use the test set instead
    # of the train set.
    
    
    
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 891 entries, 0 to 890
    Data columns (total 12 columns):
    PassengerId    891 non-null int64
    Survived       891 non-null int64
    Pclass         891 non-null int64
    Name           891 non-null object
    Sex            891 non-null object
    Age            714 non-null float64
    SibSp          891 non-null int64
    Parch          891 non-null int64
    Ticket         891 non-null object
    Fare           891 non-null float64
    Cabin          204 non-null object
    Embarked       889 non-null object
    dtypes: float64(2), int64(5), object(5)
    memory usage: 83.6+ KB
    


    About these datatype names:

    • int64 => whole numbers (can still be categorical)
    • object => string (can be categorical)
    • float64 => numeric with decimals (continous)

    There is a lot of ambiguity when expressing what type a variables is. In statistics there is a measurement level hierarchy which I think is quite helpful. The confusion arises because in some fields 'categorical' the an overlapping clustering contaning (a.o.) nominal,ordinal and ratio, whil in other fields categorical is synonomous for the ordinal measurement level.

    • nominal (groups)
    • ordinal (groups with hierarchy)
    • interval (numbers with equal differences beteen them)
    • ratio (numbers with equal differences but also a absolute zero-point)
    
    
    In [11]:
    # More, More more!
    #Let's fully dive in this meta data description of the variables:
    train_df.describe()
    
    # Notice that the decribe funcinon only gives back non-'object' variables..
    
    
    
    
    Out[11]:
    PassengerId Survived Pclass Age SibSp Parch Fare
    count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
    mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
    std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
    min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
    25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
    50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
    75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
    max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200


    Looking at our 'dependent variable' (i.e. 'survived') we see the average (mean) of .38 survival rate. We know from the introduction on the problem (kaggle description) "killing 1502 out of 2224" so this seems about right because 1 - 1502/2224 = roughly 1/3 (.33). Knowing our .38 is based on the training part of the data set while the given description is based on the total set it is close enough to state this is an honest sample.

    Let's say you'd want the actual total of people who survived and the total of people who died (without calculating this back from the mean):

    
    
    In [12]:
    train_df.Survived.value_counts()
    #the variable name (Survived) is with a capital letter because it has a capital letter in data set.
    #'value_counts' is the 'smart' part, the function.
    
    
    
    
    Out[12]:
    0    549
    1    342
    Name: Survived, dtype: int64

    (So we know in our train data set, high over survival change = (342/891) = .38)

    
    
    In [13]:
    train_df.Sex.value_counts().plot(kind='bar')
    #you can replace the variable with any of the 12 (for some with more visual succes than for others..)
    
    
    
    
    Out[13]:
    <matplotlib.axes._subplots.AxesSubplot at 0x10b7f3f98>


    So we have come quite far now. It is time to really start segmenting and we have already made a start with gender (Sex). We choose this to start with because A) it's easy and B) by looking at the data in excel(see beginning) we should have some reasonable suspicion that survival rate is not equal for men and woman. So this would make be sensible start for segmenting.

    Let's show the data for woman only:

    
    
    In [14]:
    train_df[train_df.Sex=='female']
    # double == because were making a comparison not setting up for creating)
    
    
    
    
    Out[14]:
    PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
    1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
    2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
    3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
    8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
    9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
    10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
    11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
    14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
    15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
    18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 NaN S
    19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
    22 23 1 3 McGowan, Miss. Anna "Annie" female 15.0 0 0 330923 8.0292 NaN Q
    24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 NaN S
    25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 347077 31.3875 NaN S
    28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q
    31 32 1 1 Spencer, Mrs. William Augustus (Marie Eugenie) female NaN 1 0 PC 17569 146.5208 B78 C
    32 33 1 3 Glynn, Miss. Mary Agatha female NaN 0 0 335677 7.7500 NaN Q
    38 39 0 3 Vander Planke, Miss. Augusta Maria female 18.0 2 0 345764 18.0000 NaN S
    39 40 1 3 Nicola-Yarred, Miss. Jamila female 14.0 1 0 2651 11.2417 NaN C
    40 41 0 3 Ahlin, Mrs. Johan (Johanna Persdotter Larsson) female 40.0 1 0 7546 9.4750 NaN S
    41 42 0 2 Turpin, Mrs. William John Robert (Dorothy Ann ... female 27.0 1 0 11668 21.0000 NaN S
    43 44 1 2 Laroche, Miss. Simonne Marie Anne Andree female 3.0 1 2 SC/Paris 2123 41.5792 NaN C
    44 45 1 3 Devaney, Miss. Margaret Delia female 19.0 0 0 330958 7.8792 NaN Q
    47 48 1 3 O'Driscoll, Miss. Bridget female NaN 0 0 14311 7.7500 NaN Q
    49 50 0 3 Arnold-Franchi, Mrs. Josef (Josefine Franchi) female 18.0 1 0 349237 17.8000 NaN S
    52 53 1 1 Harper, Mrs. Henry Sleeper (Myna Haxtun) female 49.0 1 0 PC 17572 76.7292 D33 C
    53 54 1 2 Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkin... female 29.0 1 0 2926 26.0000 NaN S
    56 57 1 2 Rugg, Miss. Emily female 21.0 0 0 C.A. 31026 10.5000 NaN S
    58 59 1 2 West, Miss. Constance Mirium female 5.0 1 2 C.A. 34651 27.7500 NaN S
    61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0000 B28 NaN
    ... ... ... ... ... ... ... ... ... ... ... ... ...
    807 808 0 3 Pettersson, Miss. Ellen Natalia female 18.0 0 0 347087 7.7750 NaN S
    809 810 1 1 Chambers, Mrs. Norman Campbell (Bertha Griggs) female 33.0 1 0 113806 53.1000 E8 S
    813 814 0 3 Andersson, Miss. Ebba Iris Alfrida female 6.0 4 2 347082 31.2750 NaN S
    816 817 0 3 Heininen, Miss. Wendla Maria female 23.0 0 0 STON/O2. 3101290 7.9250 NaN S
    820 821 1 1 Hays, Mrs. Charles Melville (Clara Jennings Gr... female 52.0 1 1 12749 93.5000 B69 S
    823 824 1 3 Moor, Mrs. (Beila) female 27.0 0 1 392096 12.4750 E121 S
    829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0000 B28 NaN
    830 831 1 3 Yasbeck, Mrs. Antoni (Selini Alexander) female 15.0 1 0 2659 14.4542 NaN C
    835 836 1 1 Compton, Miss. Sara Rebecca female 39.0 1 1 PC 17756 83.1583 E49 C
    842 843 1 1 Serepeca, Miss. Augusta female 30.0 0 0 113798 31.0000 NaN C
    849 850 1 1 Goldenberg, Mrs. Samuel L (Edwiga Grabowska) female NaN 1 0 17453 89.1042 C92 C
    852 853 0 3 Boulos, Miss. Nourelain female 9.0 1 1 2678 15.2458 NaN C
    853 854 1 1 Lines, Miss. Mary Conover female 16.0 0 1 PC 17592 39.4000 D28 S
    854 855 0 2 Carter, Mrs. Ernest Courtenay (Lilian Hughes) female 44.0 1 0 244252 26.0000 NaN S
    855 856 1 3 Aks, Mrs. Sam (Leah Rosen) female 18.0 0 1 392091 9.3500 NaN S
    856 857 1 1 Wick, Mrs. George Dennick (Mary Hitchcock) female 45.0 1 1 36928 164.8667 NaN S
    858 859 1 3 Baclini, Mrs. Solomon (Latifa Qurban) female 24.0 0 3 2666 19.2583 NaN C
    862 863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 0 0 17466 25.9292 D17 S
    863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.5500 NaN S
    865 866 1 2 Bystrom, Mrs. (Karolina) female 42.0 0 0 236852 13.0000 NaN S
    866 867 1 2 Duran y More, Miss. Asuncion female 27.0 1 0 SC/PARIS 2149 13.8583 NaN C
    871 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 1 11751 52.5542 D35 S
    874 875 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 0 P/PP 3381 24.0000 NaN C
    875 876 1 3 Najib, Miss. Adele Kiamie "Jane" female 15.0 0 0 2667 7.2250 NaN C
    879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
    880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.0000 NaN S
    882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S
    885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
    887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
    888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S

    314 rows × 12 columns

    
    
    In [15]:
    #before continuing let's do a quick check for 'missing values' (rows where gender is unknown)
    # by using the 'isnull' built-in function:
    train_df[train_df.Sex.isnull()]
    
    
    
    
    Out[15]:
    PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked


    which shows up empty so that is good news, saves us time

    
    
    In [16]:
    # Let's visualize the number of survival amongst woman and later the number of survival amongst men to campare.
    train_df[train_df.Sex=='female'].Survived.value_counts().plot(kind='bar', title='Survival among female')
    
    
    
    
    Out[16]:
    <matplotlib.axes._subplots.AxesSubplot at 0x10ed76710>


    Now we copy and past the command from above, and delete the 'fe' from 'female' (don't forget title)

    
    
    In [17]:
    train_df[train_df.Sex=='male'].Survived.value_counts().plot(kind='bar', title='Survival among male')
    
    
    
    
    Out[17]:
    <matplotlib.axes._subplots.AxesSubplot at 0x10ed65a90>

    Note that the bars are swapped. Our suspicion is confirmed (womand are more likely to have survived)

    (meaning that in female segment survival occurred far more often than death. In male segment it is the other way around.)

    
    
    In [18]:
    # The same can be done for age. Here it can also be interesting to combine age with sex;
    train_df[(train_df.Age<11) & (train_df.Sex=='female')].Survived.value_counts().plot(kind='bar')
    # '11' is just an arbitrarily chosen bumner for age.
    
    
    
    
    Out[18]:
    <matplotlib.axes._subplots.AxesSubplot at 0x10eec4550>


    As you can see the combination of a low age and female (i.e. little girls) have a quite different (higher) survivalrate compared to the total trainset average survival rate (.38)

    Let's see if childeren regardless of age also have better chances than .38:

    
    
    In [19]:
    train_df[(train_df.Age<11)].Survived.value_counts().plot(kind='bar')
    
    
    
    
    Out[19]:
    <matplotlib.axes._subplots.AxesSubplot at 0x10ef286a0>

    We can clearly see childeren(< 11 years) in general have better chances than the overal population (trainset) but not as good as childeren (< 11 years) that are also girls.


    Visualize further using Seaborn

    Import the python library seaborn (as sns)
    You can also choose to do this in the beginning when importing pandas, so you get a list in the beginning with every library to be imported. To do it at the beginning is convention (and makes it easy for outsiders to see al used libraries at once) but for the purpose of this getting started tutorial I think it is better like this.

    
    
    In [20]:
    import seaborn as sns
    # I don't know why seaborn is abbreviated as sns but you can choose anything you like as long as it is not used
    # by anything else. Sns seems to be convention.
    


    With seaborn we can more easily make barplots that show us more at once. For example we can take the variable Pclass and defining it as the x-axis, while makeing our dependent variable 'Survived' the y-axis, while differentiating between men and women (defining Sex as hue):

    
    
    In [21]:
    sns.barplot(x="Pclass", y="Survived", hue="Sex", data=train_df);
    
    
    
    
    
    
    In [22]:
    #If we don't mind stereotypes ;p we could change the colors so that we don't have to look at the legend
    # to remind us of the colorcoding for Sex:
    #Just use the same command but add ' palette={"male": "blue", "female": "pink"} '
    
    
    
    In [23]:
    sns.barplot(x="Pclass", y="Survived", hue="Sex", data=train_df, palette={"male": "blue", "female": "pink"});
    
    
    
    

    In reality we would have to repeat these visualisation command for all variables to see which variables might be intereseting for our model. For now we let's say we have repeated this for all variables. This will result in you wanting to keep: Sex, Age, Pclass, Cabin & Fares (and passengerID)

    3: Data prepping & further visualization

    Now that we have gotten familiarized with our data, we have to get the data in such a shape that we can use it.

    This means dropping some features (variables) we don't want to use (Ticket, Name and Embarked), creating bins for other variables (Age and Fares) or changing some values of a variable to only the first letter (Cabins).

    
    
    In [24]:
    # Let's firs remove the variables we don't want:
    def drop_features(df):
        return df.drop(['Ticket', 'Name', 'Embarked'], axis=1)
    
    
    
    In [25]:
    # make bins for ages and name them for ease:
    def simplify_ages(df):
        df.Age = df.Age.fillna(-0.5)
        bins = (-1, 0, 5, 12, 18, 25, 35, 60, 120)
        group_names = ['Unknown', 'Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Senior']
        categories = pd.cut(df.Age, bins, labels=group_names)
        df.Age = categories
        return df
    
    #keep only the first letter (similar effect as making bins/clusters):
    def simplify_cabins(df):
        df.Cabin = df.Cabin.fillna('N')
        df.Cabin = df.Cabin.apply(lambda x: x[0])
        return df
    
    # make bins for fare prices and name them:
    def simplify_fares(df):
        df.Fare = df.Fare.fillna(-0.5)
        bins = (-1, 0, 8, 15, 31, 1000)
        group_names = ['Unknown', '1_quartile', '2_quartile', '3_quartile', '4_quartile']
        categories = pd.cut(df.Fare, bins, labels=group_names)
        df.Fare = categories
        return df
    
    # createa all in transform_features function to be called later:
    def transform_features(df):
        df = simplify_ages(df)
        df = simplify_cabins(df)
        df = simplify_fares(df)
        df = drop_features(df)
        return df
    
    #create new dataframe with different name:
    train_df2 = transform_features(train_df)
    
    
    test_df2 = transform_features(test_df)
    


    Let's see what it looks like, see if everything has gone as planned:

    
    
    In [26]:
    train_df2
    
    
    
    
    Out[26]:
    PassengerId Survived Pclass Sex Age SibSp Parch Fare Cabin
    0 1 0 3 male Student 1 0 1_quartile N
    1 2 1 1 female Adult 1 0 4_quartile C
    2 3 1 3 female Young Adult 0 0 1_quartile N
    3 4 1 1 female Young Adult 1 0 4_quartile C
    4 5 0 3 male Young Adult 0 0 2_quartile N
    5 6 0 3 male Unknown 0 0 2_quartile N
    6 7 0 1 male Adult 0 0 4_quartile E
    7 8 0 3 male Baby 3 1 3_quartile N
    8 9 1 3 female Young Adult 0 2 2_quartile N
    9 10 1 2 female Teenager 1 0 3_quartile N
    10 11 1 3 female Baby 1 1 3_quartile G
    11 12 1 1 female Adult 0 0 3_quartile C
    12 13 0 3 male Student 0 0 2_quartile N
    13 14 0 3 male Adult 1 5 4_quartile N
    14 15 0 3 female Teenager 0 0 1_quartile N
    15 16 1 2 female Adult 0 0 3_quartile N
    16 17 0 3 male Baby 4 1 3_quartile N
    17 18 1 2 male Unknown 0 0 2_quartile N
    18 19 0 3 female Young Adult 1 0 3_quartile N
    19 20 1 3 female Unknown 0 0 1_quartile N
    20 21 0 2 male Young Adult 0 0 3_quartile N
    21 22 1 2 male Young Adult 0 0 2_quartile D
    22 23 1 3 female Teenager 0 0 2_quartile N
    23 24 1 1 male Young Adult 0 0 4_quartile A
    24 25 0 3 female Child 3 1 3_quartile N
    25 26 1 3 female Adult 1 5 4_quartile N
    26 27 0 3 male Unknown 0 0 1_quartile N
    27 28 0 1 male Student 3 2 4_quartile C
    28 29 1 3 female Unknown 0 0 1_quartile N
    29 30 0 3 male Unknown 0 0 1_quartile N
    ... ... ... ... ... ... ... ... ... ...
    861 862 0 2 male Student 1 0 2_quartile N
    862 863 1 1 female Adult 0 0 3_quartile D
    863 864 0 3 female Unknown 8 2 4_quartile N
    864 865 0 2 male Student 0 0 2_quartile N
    865 866 1 2 female Adult 0 0 2_quartile N
    866 867 1 2 female Young Adult 1 0 2_quartile N
    867 868 0 1 male Young Adult 0 0 4_quartile A
    868 869 0 3 male Unknown 0 0 2_quartile N
    869 870 1 3 male Baby 1 1 2_quartile N
    870 871 0 3 male Young Adult 0 0 1_quartile N
    871 872 1 1 female Adult 1 1 4_quartile D
    872 873 0 1 male Young Adult 0 0 1_quartile B
    873 874 0 3 male Adult 0 0 2_quartile N
    874 875 1 2 female Young Adult 1 0 3_quartile N
    875 876 1 3 female Teenager 0 0 1_quartile N
    876 877 0 3 male Student 0 0 2_quartile N
    877 878 0 3 male Student 0 0 1_quartile N
    878 879 0 3 male Unknown 0 0 1_quartile N
    879 880 1 1 female Adult 0 1 4_quartile C
    880 881 1 2 female Student 0 1 3_quartile N
    881 882 0 3 male Young Adult 0 0 1_quartile N
    882 883 0 3 female Student 0 0 2_quartile N
    883 884 0 2 male Young Adult 0 0 2_quartile N
    884 885 0 3 male Student 0 0 1_quartile N
    885 886 0 3 female Adult 0 5 3_quartile N
    886 887 0 2 male Young Adult 0 0 2_quartile N
    887 888 1 1 female Student 0 0 3_quartile B
    888 889 0 3 female Unknown 1 2 3_quartile N
    889 890 1 1 male Young Adult 0 0 3_quartile C
    890 891 0 3 male Young Adult 0 0 1_quartile N

    891 rows × 9 columns

    Now let's do some seaborn visualizations with our new dataset:

    (3vars per plot)

    
    
    In [27]:
    sns.barplot(x="Age", y="Survived", hue="Sex", data=train_df2, palette={"male": "blue", "female": "pink"});
    
    
    
    
    
    
    In [28]:
    sns.barplot(x="Cabin", y="Survived", hue="Sex", data=train_df2, palette={"male": "blue", "female": "pink"});
    
    
    
    
    
    
    In [29]:
    sns.barplot(x="Pclass", y="Survived", hue="Sex", data=train_df2, palette={"male": "blue", "female": "pink"});
    
    
    
    
    
    
    In [30]:
    sns.barplot(x="Fare", y="Survived", hue="Sex", data=train_df2, palette={"male": "blue", "female": "pink"});
    
    
    
    
    
    
    In [31]:
    from sklearn import preprocessing
    def encode_features(df_train, df_test):
        features = ['Fare', 'Cabin', 'Age', 'Sex']
        df_combined = pd.concat([df_train[features], df_test[features]])
        
        for feature in features:
            le = preprocessing.LabelEncoder()
            le = le.fit(df_combined[feature])
            df_train[feature] = le.transform(df_train[feature])
            df_test[feature] = le.transform(df_test[feature])
        return df_train, df_test
        
    train_df2, test_df2 = encode_features(train_df2, test_df2)
    train_df2.head()
    
    
    
    
    Out[31]:
    PassengerId Survived Pclass Sex Age SibSp Parch Fare Cabin
    0 1 0 3 1 4 1 0 0 7
    1 2 1 1 0 0 1 0 3 2
    2 3 1 3 0 7 0 0 0 7
    3 4 1 1 0 7 1 0 3 2
    4 5 0 3 1 7 0 0 1 7
    
    
    In [32]:
    train_df2.info()
    
    
    
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 891 entries, 0 to 890
    Data columns (total 9 columns):
    PassengerId    891 non-null int64
    Survived       891 non-null int64
    Pclass         891 non-null int64
    Sex            891 non-null int64
    Age            891 non-null int64
    SibSp          891 non-null int64
    Parch          891 non-null int64
    Fare           891 non-null int64
    Cabin          891 non-null int64
    dtypes: int64(9)
    memory usage: 62.7 KB
    
    
    
    In [33]:
    X_train = train_df2.drop(["Survived", "PassengerId"], axis=1)
    Y_train = train_df2["Survived"]
    X_test  = test_df2.drop("PassengerId", axis=1).copy()
    
    
    
    
    # I initially did not drop PassengerID. Keeping 8 variables ('features') in x-train and x-test. However, later on
    # (during the modelling part) this resulted in an accuracy (for the random forests and classification trees) of 1.00
    # Most likely I think it keeping PassengerId in this manner caused some form of label leakage. 
    # After dropping this in both sets accuracy results were more realistic..
    
    X_train.shape, Y_train.shape , X_test.shape
    
    
    
    
    Out[33]:
    ((891, 7), (891,), (418, 7))


    Make sure that X-train and X-test have the same amount of variables ('features'; in this example '7')..

    Almost ready to try some models ('sci-kitlearn')

    Being: 1) regression 2) decision tree and 3) random forests.

    first a quick looksy at the X-train and Y-train:

    
    
    In [34]:
    X_train.head()
    
    
    
    
    Out[34]:
    Pclass Sex Age SibSp Parch Fare Cabin
    0 3 1 4 1 0 0 7
    1 1 0 0 1 0 3 2
    2 3 0 7 0 0 0 7
    3 1 0 7 1 0 3 2
    4 3 1 7 0 0 1 7
    
    
    In [35]:
    Y_train.head()
    
    
    
    
    Out[35]:
    0    0
    1    1
    2    1
    3    1
    4    0
    Name: Survived, dtype: int64


    4 Modelling

    Ready to try some models:

    • 1) logistic regression
    • 2) decision tree
    • 3) random forests

    
    
    In [36]:
    # Logistic Regression
    
    # Import from the the scikit-learn library (sklearn is the abbreviation for scikit-learn)
    from sklearn.linear_model import LogisticRegression
    
    logreg = LogisticRegression()
    logreg.fit(X_train, Y_train)
    Y_pred = logreg.predict(X_test)
    acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
    acc_log
    
    
    
    
    Out[36]:
    79.120000000000005
    
    
    In [37]:
    # Decision Tree
    
    # Import from the the scikit-learn library (sklearn is the abbreviation for scikit-learn)
    from sklearn.tree import DecisionTreeClassifier
    
    decision_tree = DecisionTreeClassifier()
    decision_tree.fit(X_train, Y_train)
    Y_pred = decision_tree.predict(X_test)
    acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
    acc_decision_tree
    
    
    
    
    Out[37]:
    91.019999999999996
    
    
    In [38]:
    # Random Forest
    
    # Import from the the scikit-learn library (sklearn is the abbreviation for scikit-learn)
    from sklearn.ensemble import RandomForestClassifier
    
    random_forest = RandomForestClassifier(n_estimators=100)
    random_forest.fit(X_train, Y_train)
    Y_pred = random_forest.predict(X_test)
    random_forest.score(X_train, Y_train)
    acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
    
    acc_random_forest
    
    
    
    
    Out[38]:
    91.019999999999996
    
    
    In [39]:
    #Creating a csv with the predicted scores (Y as 0 and 1's for survival)
    submission = pd.DataFrame({
            "PassengerId": test_df["PassengerId"],
            "Survived": Y_pred
        })
    
    # But let's print it first to see if we don't see anythin weird:
    
    
    
    In [40]:
    submission.describe()
    
    
    
    
    Out[40]:
    PassengerId Survived
    count 418.000000 418.000000
    mean 1100.500000 0.330144
    std 120.810458 0.470828
    min 892.000000 0.000000
    25% 996.250000 0.000000
    50% 1100.500000 0.000000
    75% 1204.750000 1.000000
    max 1309.000000 1.000000


    5 Submission

    All looks fine. Let's turn it into a CSV and save it in a logical local place and upload it to Kaggle to find out real score (i.e. the score on their test set)



    
    
    In [57]:
    submission.to_csv('../submission.csv', index=False)
    
    
    
    
    ---------------------------------------------------------------------------
    FileNotFoundError                         Traceback (most recent call last)
    <ipython-input-57-8bf5a4b9cd36> in <module>()
    ----> 1 submission.to_csv('../output/submission.csv', index=False)
    
    /Users/steven/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, tupleize_cols, date_format, doublequote, escapechar, decimal)
       1411                                      doublequote=doublequote,
       1412                                      escapechar=escapechar, decimal=decimal)
    -> 1413         formatter.save()
       1414 
       1415         if path_or_buf is None:
    
    /Users/steven/anaconda/lib/python3.6/site-packages/pandas/io/formats/format.py in save(self)
       1566             f, handles = _get_handle(self.path_or_buf, self.mode,
       1567                                      encoding=self.encoding,
    -> 1568                                      compression=self.compression)
       1569             close = True
       1570 
    
    /Users/steven/anaconda/lib/python3.6/site-packages/pandas/io/common.py in _get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text)
        380         elif is_text:
        381             # Python 3 and no explicit encoding
    --> 382             f = open(path_or_buf, mode, errors='replace')
        383         else:
        384             # Python 3 and binary mode
    
    FileNotFoundError: [Errno 2] No such file or directory: '../output/submission.csv'


    Aaaaaaand we have a 0.7512 score. Not too bad for a first upload.
    Not too well either because guessing at random would not be .5 but .62. ('mean as model')

    But we haven't done any futher optimization like feature engineering yet. Also keep in mind that most likely the >.95 scores are overfitted either by uploading on trial and error basis, or by using the test set data to train the model on (it is publicly avalaible data after all..)

    Now it is time to start improving this 'base' model. (// the titles in the passenger name variable would be a good start.)

    Please feel free to comment below, I will read and if possible incorporate them and please give tips were you think needed. I will try to learn from users' suggestions and tips:)

    
    
    In [ ]: