First Data Check



In [1]:
# create entry points to spark
try:
    sc.stop()
except:
    pass
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
sc=SparkContext()
spark = SparkSession(sparkContext=sc)

Import Data

Data source: https://www.kaggle.com/c/titanic/data


In [5]:
titanic = spark.read.csv('../../data/kaggle-titanic-train.csv', header=True, inferSchema=True)
titanic.show(5)


+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
only showing top 5 rows

Data type

First, we want to check if string and numeric variables are imported as we expect.


In [6]:
titanic.printSchema()


root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)

Data summary

Number of variables


In [7]:
len(titanic.columns)


Out[7]:
12

Number of observations


In [8]:
titanic.count()


Out[8]:
891

Sumarise columns


In [9]:
def describe_columns(df):
    for i in df.columns:
        print('Column: ' + i)
        titanic.select(i).describe().show()

In [10]:
describe_columns(titanic)


Column: PassengerId
+-------+-----------------+
|summary|      PassengerId|
+-------+-----------------+
|  count|              891|
|   mean|            446.0|
| stddev|257.3538420152301|
|    min|                1|
|    max|              891|
+-------+-----------------+

Column: Survived
+-------+-------------------+
|summary|           Survived|
+-------+-------------------+
|  count|                891|
|   mean| 0.3838383838383838|
| stddev|0.48659245426485753|
|    min|                  0|
|    max|                  1|
+-------+-------------------+

Column: Pclass
+-------+------------------+
|summary|            Pclass|
+-------+------------------+
|  count|               891|
|   mean| 2.308641975308642|
| stddev|0.8360712409770491|
|    min|                 1|
|    max|                 3|
+-------+------------------+

Column: Name
+-------+--------------------+
|summary|                Name|
+-------+--------------------+
|  count|                 891|
|   mean|                null|
| stddev|                null|
|    min|"Andersson, Mr. A...|
|    max|van Melkebeke, Mr...|
+-------+--------------------+

Column: Sex
+-------+------+
|summary|   Sex|
+-------+------+
|  count|   891|
|   mean|  null|
| stddev|  null|
|    min|female|
|    max|  male|
+-------+------+

Column: Age
+-------+------------------+
|summary|               Age|
+-------+------------------+
|  count|               714|
|   mean| 29.69911764705882|
| stddev|14.526497332334035|
|    min|              0.42|
|    max|              80.0|
+-------+------------------+

Column: SibSp
+-------+------------------+
|summary|             SibSp|
+-------+------------------+
|  count|               891|
|   mean|0.5230078563411896|
| stddev|1.1027434322934315|
|    min|                 0|
|    max|                 8|
+-------+------------------+

Column: Parch
+-------+-------------------+
|summary|              Parch|
+-------+-------------------+
|  count|                891|
|   mean|0.38159371492704824|
| stddev| 0.8060572211299488|
|    min|                  0|
|    max|                  6|
+-------+-------------------+

Column: Ticket
+-------+------------------+
|summary|            Ticket|
+-------+------------------+
|  count|               891|
|   mean|260318.54916792738|
| stddev|471609.26868834975|
|    min|            110152|
|    max|         WE/P 5735|
+-------+------------------+

Column: Fare
+-------+-----------------+
|summary|             Fare|
+-------+-----------------+
|  count|              891|
|   mean| 32.2042079685746|
| stddev|49.69342859718089|
|    min|              0.0|
|    max|         512.3292|
+-------+-----------------+

Column: Cabin
+-------+-----+
|summary|Cabin|
+-------+-----+
|  count|  204|
|   mean| null|
| stddev| null|
|    min|  A10|
|    max|    T|
+-------+-----+

Column: Embarked
+-------+--------+
|summary|Embarked|
+-------+--------+
|  count|     889|
|   mean|    null|
| stddev|    null|
|    min|       C|
|    max|       S|
+-------+--------+

Find columns with missing values


In [11]:
def find_missing_values_columns(df):
    nrow = df.count()
    for v in df.columns:
        summary_df = df.select(v).describe()
        v_count = int(summary_df.collect()[0][v])
        if v_count < nrow:
            missing_percentage = (1 - v_count/nrow) * 100
            print("Total observations: " + str(nrow) + "\n"
                 "Total observations of " + v + ": " + str(v_count) + "\n"
                 "Percentage of missing values: " + str(missing_percentage) + "%" + "\n"
                 "----------------------------")

In [12]:
find_missing_values_columns(titanic)


Total observations: 891
Total observations of Age: 714
Percentage of missing values: 19.865319865319865%
----------------------------
Total observations: 891
Total observations of Cabin: 204
Percentage of missing values: 77.1043771043771%
----------------------------
Total observations: 891
Total observations of Embarked: 889
Percentage of missing values: 0.22446689113355678%
----------------------------

In [ ]: