In this toy example,
we want to see what are the baselines of prediction accuracy based on extremes scenarios of input signals.
We study two distinct cases:
a) The input is just noise
b) The input hides a clear signan (smoking affects men and die while not affects women)
We want to see how the model reacts on such conditions
In [1]:
import pandas as pd
import numpy as np
In [157]:
def get_features_values_combinations(list_a, list_b):
""" Returns a list of combinations of the values of
e.g. from the lists
list_a = ['L', 'M', 'W']
list_b = ['F', 'I', 'S']
we get the combinations:
[('L', 'F'), ('L', 'I'), ('L', 'S'), ('M', 'F'), ('M', 'I') ...
"""
import itertools
return list(itertools.product(list_a, list_b))
def generate_dataset_with_probabilities(n_datapoints,
x_values,
y_values,
y_probabilites=None):
""" Return a dataset of given possible values with given possible probabilities
:n_datapoints: The number of datapoints we want to generate
:x_values: 1-D array e.g. ['Man', 'Woman']
:y_values: 1-D array e.g. ['meat', 'fish', 'vegetables']
:y_probabilites: 1-D array-like e.g. [0.5, 0.25, 0.25]
The probabilities associated with each entry in entries_values.
If not given the sample assumes a uniform distribution over all entries_values
"""
import numpy as np
datapoints = []
for i in range(n_datapoints):
datapoint = []
datapoint = (x_values,) + (np.random.choice(y_values, p=y_probabilites),)
datapoints.append(datapoint)
return datapoints
def generate_datapoints(n_datapoints, var_x, var_y, case_probabilites):
"""
"""
# Generate the datapoints according to the desired destribution
datapoints = []
all_cases = get_features_values_combinations(var_x, var_y)
n_cases = len(all_cases)
n_datapoints_per_case = int(1. * n_datapoints / n_cases)
for case in all_cases:
sex = case[:-1][0]
case_probabilites = cases_probabilites[sex]
datapoints.extend(
generate_dataset_with_probabilities(n_datapoints_per_case,
sex,
var_y,
y_probabilites=case_probabilites))
N_datapoints = len(datapoints)
print "All the combinations we can have are: ", all_cases
print "N total datapoints: ", N_datapoints
return np.array(datapoints)
def get_transformed_dataframe(df):
""" Transform the DataFrame accordinlgy (dummy vars / encoding)
"""
# And transform the categorical variables
from sklearn import preprocessing
df_transformed_X = pd.get_dummies(df[var_x_name])
encoder = preprocessing.LabelEncoder()
encoder.fit(df[var_y_name].values)
df_transformed_y = pd.DataFrame(data = encoder.transform(df[var_y_name].values), columns=[var_y_name])
df_transformed = pd.concat([df_transformed_X, df_transformed_y], axis=1)
df_transformed.head(2)
return df_transformed
In [273]:
# Independent Variable
var_x_name = 'sex'
var_x = ['Man', 'Woman']
# Dependent Variable
var_y_name = 'status'
var_y = ['Died', 'Alive']
In [250]:
# We assume that we do not have any strong signal in our input
# e.g. for men smoking or not 50% diy
# the same for women
cases_probabilites = {
'Man' : [0.5, 0.5],
'Woman' : [0.5, 0.5]}
datapoints = generate_datapoints(1000, var_x, var_y, case_probabilites)
# Cross check that the datapoints we generated follow the frequency we thought
all_cases = get_features_values_combinations(var_x, var_y)
for case in all_cases:
print "Frequency of ", case, ": ", 1. * len([x for x in datapoints if tuple(x.tolist()) == case]) / len(datapoints)
df = pd.DataFrame(data = datapoints, columns=[var_x_name, var_y_name])
df_transformed = get_transformed_dataframe(df)
# Split our sample to train / set
from sklearn.model_selection import train_test_split
X = df_transformed[['Man', 'Woman']].values
y = df_transformed['status'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Apply model
from sklearn import linear_model
logistic = linear_model.LogisticRegression()
print('LogisticRegression score: %f'
% logistic.fit(X_train, y_train).score(X_test, y_test))
In [258]:
# Here we assume that Men are affected a lot from smoking 90% die while women not, only 10% die
cases_probabilites = {
'Man' : [0.9, 0.1],
'Woman' : [0.1, 0.9]}
datapoints = generate_datapoints(1000, var_x, var_y, case_probabilites)
# Cross check that the datapoints we generated follow the frequency we thought
all_cases = get_features_values_combinations(var_x, var_y)
for case in all_cases:
print "Frequency of ", case, ": ", 1. * len([x for x in datapoints if tuple(x.tolist()) == case]) / len(datapoints)
df = pd.DataFrame(data = datapoints, columns=[var_x_name, var_y_name])
df_transformed = get_transformed_dataframe(df)
# Split our sample to train / set
from sklearn.model_selection import train_test_split
X = df_transformed[['Man', 'Woman']].values
y = df_transformed['status'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Apply model
from sklearn import linear_model
logistic = linear_model.LogisticRegression()
print('LogisticRegression score: %f'
% logistic.fit(X_train, y_train).score(X_test, y_test))
In [ ]:
# We assume that we do not have any strong signal in our input
# e.g. for men smoking or not 50% diy
# the same for women
cases_probabilites = {
'Man' : [0.5, 0.5],
'Woman' : [0.5, 0.5]}
datapoints = generate_datapoints(1000, var_x, var_y, case_probabilites)
# Cross check that the datapoints we generated follow the frequency we thought
all_cases = get_features_values_combinations(var_x, var_y)
for case in all_cases:
print "Frequency of ", case, ": ", 1. * len([x for x in datapoints if tuple(x.tolist()) == case]) / len(datapoints)
df = pd.DataFrame(data = datapoints, columns=[var_x_name, var_y_name])
df_transformed = get_transformed_dataframe(df)
# Split our sample to train / set
from sklearn.model_selection import train_test_split
X = df_transformed[['Man', 'Woman']].values
y = df_transformed['status'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Apply model
from sklearn import linear_model
logistic = linear_model.LogisticRegression()
print('LogisticRegression score: %f'
% logistic.fit(X_train, y_train).score(X_test, y_test))
In [269]:
var_y = ['Died', 'Alive', 'Almost_dead']
cases_probabilites = {
'Man' : [0.333, 0.333, 1-0.333-.333],
'Woman' : [0.333, 0.333, 1-.333-.333]}
datapoints = generate_datapoints(10000, var_x, var_y, case_probabilites)
# Cross check that the datapoints we generated follow the frequency we thought
all_cases = get_features_values_combinations(var_x, var_y)
for case in all_cases:
print "Frequency of ", case, ": ", 1. * len([x for x in datapoints if tuple(x.tolist()) == case]) / len(datapoints)
df = pd.DataFrame(data = datapoints, columns=[var_x_name, var_y_name])
df_transformed = get_transformed_dataframe(df)
# Split our sample to train / set
from sklearn.model_selection import train_test_split
X = df_transformed[['Man', 'Woman']].values
y = df_transformed['status'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Apply model
from sklearn import linear_model
logistic = linear_model.LogisticRegression()
print('LogisticRegression score: %f'
% logistic.fit(X_train, y_train).score(X_test, y_test))
In [271]:
var_y = ['Died', 'Alive', 'Almost_dead', 'Almost_alive']
cases_probabilites = {
'Man' : [0.25, 0.25, 0.25, 0.25],
'Woman' : [0.25, 0.25, 0.25, 0.25]
}
datapoints = generate_datapoints(10000, var_x, var_y, case_probabilites)
# Cross check that the datapoints we generated follow the frequency we thought
all_cases = get_features_values_combinations(var_x, var_y)
for case in all_cases:
print "Frequency of ", case, ": ", 1. * len([x for x in datapoints if tuple(x.tolist()) == case]) / len(datapoints)
df = pd.DataFrame(data = datapoints, columns=[var_x_name, var_y_name])
df_transformed = get_transformed_dataframe(df)
# Split our sample to train / set
from sklearn.model_selection import train_test_split
X = df_transformed[['Man', 'Woman']].values
y = df_transformed['status'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Apply model
from sklearn import linear_model
logistic = linear_model.LogisticRegression()
print('LogisticRegression score: %f'
% logistic.fit(X_train, y_train).score(X_test, y_test))
IF the input data is pure noise, the best accuracy the model can get is just 1/(N classes) (N classe = the number of values of the dependent variable)
So if we have 2 cases of output (Alive, Died) the prediction accuracy can be 1/2 0.5
So if we have 3 cases of output (Alive, Died, Almost_alive) the prediction accuracy can be 1/3 = 0.3
So if we have 4 cases of output (Alive, Died, Almost_alive) the prediction accuracy can be 1/4 = 0.25
One the same time, with the very same model, if we have data that come from some signal (eg. death beacause of smoking does depends on the sex) then the model accordinly can reach better accuracy levels.
If we deal with categorical variables, it is important to check if any of the values of the independent variables carries information. E.g. although we have a single variable Sex, maybe one component of the sex - Men is noise (so that 50% are affected of smoking and other 50% is not affected) while the other component - Women hold some clear signal 90% affected and 10% not affected. In such scenario the prediction accuracy will lie between the max (90% when both keep clear signal 90-10) and min (50% where both are jus noise 50-50). Actually this number proven above to be 70%.
In order to judge about how well our model is working based on prediction accuracy, we have always to keep in mind that even for a perfect model (witout any bugs or bad tunings) , there is a max of prediction accuracy can be reached dictated by the signal of the data themselves.
In [ ]: