In this toy example, 
we want to see what are the baselines of prediction accuracy based on extremes scenarios of input signals.

We study two distinct cases:
a) The input is just noise
b) The input hides a clear signan (smoking affects men and die while not affects women)

We want to see how the model reacts on such conditions



In [1]:

    
import pandas as pd
import numpy as np



In [157]:

    
def get_features_values_combinations(list_a, list_b):
    """ Returns a list of combinations of the values of 
    e.g. from the lists 
    list_a = ['L', 'M', 'W']
    list_b = ['F', 'I', 'S']
    we get the combinations:
    [('L', 'F'), ('L', 'I'), ('L', 'S'), ('M', 'F'), ('M', 'I') ...
    """
    import itertools
    return list(itertools.product(list_a, list_b))

def generate_dataset_with_probabilities(n_datapoints,
                                        x_values,
                                        y_values,
                                        y_probabilites=None):
    """ Return a dataset of given possible values with given possible probabilities
    :n_datapoints: The number of datapoints we want to generate
    :x_values: 1-D array e.g. ['Man', 'Woman']
    :y_values: 1-D array e.g. ['meat', 'fish', 'vegetables']
    :y_probabilites: 1-D array-like e.g. [0.5, 0.25, 0.25]
        The probabilities associated with each entry in entries_values.
        If not given the sample assumes a uniform distribution over all entries_values
    """
    import numpy as np
    datapoints = []
    for i in range(n_datapoints):
        datapoint = []
        datapoint = (x_values,) + (np.random.choice(y_values, p=y_probabilites),)
        datapoints.append(datapoint)
        
    return datapoints


def generate_datapoints(n_datapoints, var_x, var_y, case_probabilites):
    """
    """
    # Generate the datapoints according to the desired destribution
    datapoints = []
    all_cases = get_features_values_combinations(var_x, var_y)
    n_cases = len(all_cases)
    n_datapoints_per_case = int(1. * n_datapoints / n_cases)
    for case in all_cases:
        sex = case[:-1][0]
        case_probabilites = cases_probabilites[sex]
        datapoints.extend(
        generate_dataset_with_probabilities(n_datapoints_per_case,
                                            sex, 
                                            var_y,
                                            y_probabilites=case_probabilites))
    N_datapoints = len(datapoints)
    print "All the combinations we can have are: ", all_cases
    print "N total datapoints: ", N_datapoints
    return np.array(datapoints)


def get_transformed_dataframe(df):
    """ Transform the DataFrame accordinlgy (dummy vars / encoding)
    """
    # And transform the categorical variables
    from sklearn import preprocessing
    df_transformed_X = pd.get_dummies(df[var_x_name])
    encoder = preprocessing.LabelEncoder()
    encoder.fit(df[var_y_name].values)       
    df_transformed_y = pd.DataFrame(data = encoder.transform(df[var_y_name].values), columns=[var_y_name])

    df_transformed = pd.concat([df_transformed_X, df_transformed_y], axis=1)
    df_transformed.head(2)
    return df_transformed



In [273]:

    
# Independent Variable
var_x_name = 'sex'
var_x = ['Man', 'Woman']

# Dependent Variable
var_y_name = 'status'
var_y = ['Died', 'Alive']

a) Example with All probabilities 50 - 50 (input signal = noise)



In [250]:

    
# We assume that we do not have any strong signal in our input
# e.g. for men smoking or not 50% diy
# the same for women
cases_probabilites = {
    'Man' : [0.5, 0.5],
    'Woman' : [0.5, 0.5]}
datapoints = generate_datapoints(1000, var_x, var_y, case_probabilites)

# Cross check that the datapoints we generated follow the frequency we thought
all_cases = get_features_values_combinations(var_x, var_y)
for case in all_cases:
    print "Frequency of ", case, ": ", 1. * len([x for x in datapoints if tuple(x.tolist()) == case]) / len(datapoints)
    
df = pd.DataFrame(data = datapoints, columns=[var_x_name, var_y_name])

df_transformed = get_transformed_dataframe(df)

# Split our sample to train / set
from sklearn.model_selection import train_test_split
X = df_transformed[['Man', 'Woman']].values
y = df_transformed['status'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


# Apply model
from sklearn import  linear_model
logistic = linear_model.LogisticRegression()
print('LogisticRegression score: %f'
      % logistic.fit(X_train, y_train).score(X_test, y_test))









    



All the combinations we can have are:  [('Man', 'Died'), ('Man', 'Alive'), ('Woman', 'Died'), ('Woman', 'Alive')]
N total datapoints:  1000
Frequency of  ('Man', 'Died') :  0.238
Frequency of  ('Man', 'Alive') :  0.262
Frequency of  ('Woman', 'Died') :  0.236
Frequency of  ('Woman', 'Alive') :  0.264
LogisticRegression score: 0.518182

b) Example with some clear input singnal



In [258]:

    
# Here we assume that Men are affected a lot from smoking 90% die while women not, only 10% die
cases_probabilites = {
    'Man' : [0.9, 0.1],
    'Woman' : [0.1, 0.9]}
datapoints = generate_datapoints(1000, var_x, var_y, case_probabilites)

# Cross check that the datapoints we generated follow the frequency we thought
all_cases = get_features_values_combinations(var_x, var_y)
for case in all_cases:
    print "Frequency of ", case, ": ", 1. * len([x for x in datapoints if tuple(x.tolist()) == case]) / len(datapoints)
    
df = pd.DataFrame(data = datapoints, columns=[var_x_name, var_y_name])

df_transformed = get_transformed_dataframe(df)

# Split our sample to train / set
from sklearn.model_selection import train_test_split
X = df_transformed[['Man', 'Woman']].values
y = df_transformed['status'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


# Apply model
from sklearn import  linear_model
logistic = linear_model.LogisticRegression()
print('LogisticRegression score: %f'
      % logistic.fit(X_train, y_train).score(X_test, y_test))









    



All the combinations we can have are:  [('Man', 'Died'), ('Man', 'Alive'), ('Woman', 'Died'), ('Woman', 'Alive')]
N total datapoints:  1000
Frequency of  ('Man', 'Died') :  0.452
Frequency of  ('Man', 'Alive') :  0.048
Frequency of  ('Woman', 'Died') :  0.049
Frequency of  ('Woman', 'Alive') :  0.451
LogisticRegression score: 0.912121

Example with some clear input singnal half noise



In [ ]:

    
# We assume that we do not have any strong signal in our input
# e.g. for men smoking or not 50% diy
# the same for women
cases_probabilites = {
    'Man' : [0.5, 0.5],
    'Woman' : [0.5, 0.5]}
datapoints = generate_datapoints(1000, var_x, var_y, case_probabilites)

# Cross check that the datapoints we generated follow the frequency we thought
all_cases = get_features_values_combinations(var_x, var_y)
for case in all_cases:
    print "Frequency of ", case, ": ", 1. * len([x for x in datapoints if tuple(x.tolist()) == case]) / len(datapoints)
    
df = pd.DataFrame(data = datapoints, columns=[var_x_name, var_y_name])

df_transformed = get_transformed_dataframe(df)

# Split our sample to train / set
from sklearn.model_selection import train_test_split
X = df_transformed[['Man', 'Woman']].values
y = df_transformed['status'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


# Apply model
from sklearn import  linear_model
logistic = linear_model.LogisticRegression()
print('LogisticRegression score: %f'
      % logistic.fit(X_train, y_train).score(X_test, y_test))

We assume that there are 3 differnt values for dependent variable



In [269]:

    
var_y = ['Died', 'Alive', 'Almost_dead']

cases_probabilites = {
    'Man' : [0.333, 0.333, 1-0.333-.333],
    'Woman' : [0.333, 0.333, 1-.333-.333]}


datapoints = generate_datapoints(10000, var_x, var_y, case_probabilites)

# Cross check that the datapoints we generated follow the frequency we thought
all_cases = get_features_values_combinations(var_x, var_y)
for case in all_cases:
    print "Frequency of ", case, ": ", 1. * len([x for x in datapoints if tuple(x.tolist()) == case]) / len(datapoints)
    
df = pd.DataFrame(data = datapoints, columns=[var_x_name, var_y_name])

df_transformed = get_transformed_dataframe(df)

# Split our sample to train / set
from sklearn.model_selection import train_test_split
X = df_transformed[['Man', 'Woman']].values
y = df_transformed['status'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


# Apply model
from sklearn import  linear_model
logistic = linear_model.LogisticRegression()
print('LogisticRegression score: %f'
      % logistic.fit(X_train, y_train).score(X_test, y_test))









    



All the combinations we can have are:  [('Man', 'Died'), ('Man', 'Alive'), ('Man', 'Almost_dead'), ('Woman', 'Died'), ('Woman', 'Alive'), ('Woman', 'Almost_dead')]
N total datapoints:  9996
Frequency of  ('Man', 'Died') :  0.164865946379
Frequency of  ('Man', 'Alive') :  0.168867547019
Frequency of  ('Man', 'Almost_dead') :  0.166266506603
Frequency of  ('Woman', 'Died') :  0.163865546218
Frequency of  ('Woman', 'Alive') :  0.170668267307
Frequency of  ('Woman', 'Almost_dead') :  0.165466186475
LogisticRegression score: 0.330706

We have 4 categories for dependent var



In [271]:

    
var_y = ['Died', 'Alive', 'Almost_dead', 'Almost_alive']

cases_probabilites = {
    'Man' : [0.25, 0.25, 0.25, 0.25],
    'Woman' : [0.25, 0.25, 0.25, 0.25]
}


datapoints = generate_datapoints(10000, var_x, var_y, case_probabilites)

# Cross check that the datapoints we generated follow the frequency we thought
all_cases = get_features_values_combinations(var_x, var_y)
for case in all_cases:
    print "Frequency of ", case, ": ", 1. * len([x for x in datapoints if tuple(x.tolist()) == case]) / len(datapoints)
    
df = pd.DataFrame(data = datapoints, columns=[var_x_name, var_y_name])

df_transformed = get_transformed_dataframe(df)

# Split our sample to train / set
from sklearn.model_selection import train_test_split
X = df_transformed[['Man', 'Woman']].values
y = df_transformed['status'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


# Apply model
from sklearn import  linear_model
logistic = linear_model.LogisticRegression()
print('LogisticRegression score: %f'
      % logistic.fit(X_train, y_train).score(X_test, y_test))









    



All the combinations we can have are:  [('Man', 'Died'), ('Man', 'Alive'), ('Man', 'Almost_dead'), ('Man', 'Almost_alive'), ('Woman', 'Died'), ('Woman', 'Alive'), ('Woman', 'Almost_dead'), ('Woman', 'Almost_alive')]
N total datapoints:  10000
Frequency of  ('Man', 'Died') :  0.1279
Frequency of  ('Man', 'Alive') :  0.1239
Frequency of  ('Man', 'Almost_dead') :  0.1237
Frequency of  ('Man', 'Almost_alive') :  0.1245
Frequency of  ('Woman', 'Died') :  0.1265
Frequency of  ('Woman', 'Alive') :  0.121
Frequency of  ('Woman', 'Almost_dead') :  0.1263
Frequency of  ('Woman', 'Almost_alive') :  0.1262
LogisticRegression score: 0.245455

Summary

Point 1

IF the input data is pure noise, the best accuracy the model can get is just 1/(N classes) (N classe = the number of values of the dependent variable)

So if we have 2 cases of output (Alive, Died) the prediction accuracy can be 1/2 0.5
So if we have 3 cases of output (Alive, Died, Almost_alive) the prediction accuracy can be 1/3 = 0.3
So if we have 4 cases of output (Alive, Died, Almost_alive) the prediction accuracy can be 1/4 = 0.25

One the same time, with the very same model, if we have data that come from some signal (eg. death beacause of smoking does depends on the sex) then the model accordinly can reach better accuracy levels.

Point 2

If we deal with categorical variables, it is important to check if any of the values of the independent variables carries information. E.g. although we have a single variable Sex, maybe one component of the sex - Men is noise (so that 50% are affected of smoking and other 50% is not affected) while the other component - Women hold some clear signal 90% affected and 10% not affected. In such scenario the prediction accuracy will lie between the max (90% when both keep clear signal 90-10) and min (50% where both are jus noise 50-50). Actually this number proven above to be 70%.

Point 3

In order to judge about how well our model is working based on prediction accuracy, we have always to keep in mind that even for a perfect model (witout any bugs or bad tunings) , there is a max of prediction accuracy can be reached dictated by the signal of the data themselves.



In [ ]: