Exploration

Exploration of prepocessed DF


In [ ]:
import numpy as np
import pandas as pd
import math

import matplotlib.pyplot as plt
%matplotlib inline

Input

Privacy restriction:

Original (personal) cleaned DF not in Repo. Go through nb "0_Cleaning" with self provided data to reproduce pickled DF of attended events ("events_df.pkl").

For further steps: Repo contains pickled DF for modeling (nb "3_Modeling"), in which private informations are elimated.


In [ ]:
file_path = "../data/events_df.pkl"
df = pd.read_pickle(file_path)

print(df.shape)
print(df.dtypes)
df.head()

Exploration


In [ ]:
print("Stats (continuous Vars):")
print(df.describe())
print("")
print("NaN values count:")
print(df.isnull().sum())

In [ ]:
for col in df:
    print(df[col].value_counts())
    print("")

In [ ]:
df.groupby(df.main_topic).mean()[["distance", "rating"]]

In [ ]:
df.groupby(df.city).mean()[["distance", "rating"]]

Preparation for Modeling

Missing Values


In [ ]:
df_cleaned = df.fillna("missing") # Nan in String val Cols

print(df_cleaned.isnull().sum())

DFs for Modeling


In [ ]:
# Minimal Features Model
model01_cols = [u"main_topic", u"buzzwordy_title", u"buzzwordy_organizer", u"days", u"weekday", u"city", 
                u"country", u"distance", u"ticket_prize", u"rating"]
df_model01 = df_cleaned[model01_cols]

df_model01.head()

Dummie Encoding


In [ ]:
df_model01 = pd.get_dummies(df_model01, prefix=["main_topic", "weekday", "city", "country"])

Output for Modeling


In [ ]:
def pickle_model(df_model, file_path):
    """
        Pickles provided model DF for modeling step
    """
    df_model.to_pickle(file_path)

pickle_model(df_model01, "../data/df_model01.pkl") # Model01

In [ ]: