Analysis of the Pagination Dataset

Preliminaries

This notebook generates every plot and numerical result mentioned or alluded in Section 4 of Algorithms for the Pagination Problem, a Bin Packing with Overlapping Items.

Main parameters


In [ ]:
from collections import OrderedDict

INPUT_PATH = "gauss/"
(MIN_PREFIX, MAX_PREFIX) = ("C015", "C055") # for instance filenames
OUTPUT_PATH = "plots/"
WINDOW = 150 # size of the subsets of instances used as a moving window

SOLVER_NAMES = OrderedDict([
    ("GeneticGroup", "Grouping GA"),
    ("GeneticStandard", "Standard GA"),
    ("OverloadAndRemove", "Overload-and-Remove"),
    ("OverloadAndRemovePresort", "Overload-and-Remove (with presort)"),
    ("BestFusion", "Best Fusion"),
    ("FirstFit", "First Fit"),
])
EXCLUDED_SOLVER_NAMES = {"OverloadAndRemovePresort"} # excluded from certain plots
solvers = ["solvers" + name for name in SOLVER_NAMES.keys()]
times = ["times" + name for name in SOLVER_NAMES.keys()]

Tools


In [ ]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import Locator
np.warnings.filterwarnings("ignore", category=RuntimeWarning)
np.warnings.filterwarnings("ignore", category=UserWarning)

In [ ]:
!pip install seaborn


Collecting seaborn
  Downloading seaborn-0.8.tar.gz (178kB)
    100% |████████████████████████████████| 184kB 1.7MB/s ta 0:00:01
Building wheels for collected packages: seaborn
  Running setup.py bdist_wheel for seaborn ... - \ done
  Stored in directory: /Users/aristide/Library/Caches/pip/wheels/59/10/63/af632c81af65a5acca29220158270dc767927a81650acffcb5
Successfully built seaborn
Installing collected packages: seaborn
Successfully installed seaborn-0.8

In [ ]:
import seaborn as sns
sns.set_style("white")
sns.set_context("paper", font_scale=2)
sns.set_palette(sns.color_palette("Set1", 5))

In [ ]:
def plot_linear_regression(x, y):
    fit = np.polyfit(x, y, deg=1)
    plt.plot(x, fit[0] * x + fit[1])
    correlation = round(x.corr(y), 3)
    print("Pearson:", correlation)
    return correlation

In [ ]:
!pip install pandas --upgrade

Reading the instances

Create a DataFrame from all the JSON files whose name is comprised between MIN_PREFIX and MAX_PREFIX.


In [ ]:
import os, json

df = []
indexes = []
for filename in os.listdir(INPUT_PATH):
    if not filename.endswith("json") or not MIN_PREFIX <= filename <= MAX_PREFIX:
        continue
    with open(os.path.join(INPUT_PATH, filename)) as f:
        instances = json.loads(f.read())
    indexes.extend([(filename, discriminant) for discriminant in range(len(instances))])
    for instance in instances:
        for (k, v) in list(instance.items()):
            if isinstance(v, dict): # flatten any sub-dict with dot notation
                for (sub_key, sub_value) in v.items():
                    instance[k + sub_key] = sub_value
                del instance[k]
    df.extend(instances)
df = pd.DataFrame(df, index=pd.MultiIndex.from_tuples(indexes, names=("filename", "i")))
df["best"] = df[["pageCount", "cplexOpt", "cplexUB"]].min(axis = 1) # add a column for the best known pagination size
df["cardinality"] = df["tiles"].apply(lambda tiles: sum(len(tile) for tile in tiles))
df_sorted_by_multiplicity = df.sort_values(by="avgMultiplicity") # for use with a moving window
print(df.info())
df.describe()


<class 'pandas.core.frame.DataFrame'>
MultiIndex: 10986 entries, (C015_A020_T020_P00.json, 0) to (C050_A100_T100_P00.json, 5)
Data columns (total 32 columns):
approximation                      10986 non-null object
avgMultiplicity                    10986 non-null float64
capacity                           10986 non-null int64
category                           10986 non-null object
cplexLB                            299 non-null float64
cplexOpt                           43 non-null float64
cplexUB                            8 non-null float64
lowerBound                         10986 non-null int64
pageCount                          10986 non-null int64
pages                              10986 non-null object
paginations                        10986 non-null object
properSymbolCount                  10986 non-null int64
solversBestFusion                  10986 non-null int64
solversFirstFit                    10986 non-null int64
solversGeneticGroup                10986 non-null int64
solversGeneticStandard             10986 non-null int64
solversOverloadAndRemove           10986 non-null int64
solversOverloadAndRemovePresort    10986 non-null int64
symbolCount                        10986 non-null int64
symbols                            10986 non-null object
tileCount                          10986 non-null int64
tileMaxSize                        10986 non-null int64
tileMinSize                        10986 non-null int64
tiles                              10986 non-null object
timesBestFusion                    10986 non-null float64
timesFirstFit                      10986 non-null float64
timesGeneticGroup                  10986 non-null float64
timesGeneticStandard               10986 non-null float64
timesOverloadAndRemove             10986 non-null float64
timesOverloadAndRemovePresort      10986 non-null float64
best                               10986 non-null float64
cardinality                        10986 non-null int64
dtypes: float64(11), int64(15), object(6)
memory usage: 2.7+ MB
None
Out[ ]:
avgMultiplicity capacity cplexLB cplexOpt cplexUB lowerBound pageCount properSymbolCount solversBestFusion solversFirstFit ... tileMaxSize tileMinSize timesBestFusion timesFirstFit timesGeneticGroup timesGeneticStandard timesOverloadAndRemove timesOverloadAndRemovePresort best cardinality
count 10986.000000 10986.000000 299.000000 43.000000 8.00000 10986.000000 10986.000000 10986.000000 10986.000000 10986.000000 ... 10986.000000 10986.000000 10986.000000 10986.000000 10986.000000 10986.000000 10986.000000 10986.000000 10986.000000 10986.000000
mean 11.737779 30.598034 5.080268 3.697674 6.37500 2.925177 15.423630 1.730475 18.269616 18.896869 ... 20.581831 5.032587 0.080350 0.063607 91.022779 7.129880 1.180678 1.129126 15.423084 724.834699
std 9.348821 11.276800 2.023415 1.456265 1.59799 1.174315 9.015449 3.968962 10.292886 10.475035 ... 10.275986 3.571601 0.050651 0.043648 53.335362 3.949599 1.131327 1.100289 9.016101 504.132756
min 1.840000 15.000000 2.000000 2.000000 4.00000 2.000000 2.000000 0.000000 2.000000 2.000000 ... 4.000000 2.000000 0.010000 0.010000 2.220000 1.320000 0.010000 0.010000 2.000000 56.000000
25% 5.340000 20.000000 4.000000 3.000000 5.50000 2.000000 8.000000 0.000000 10.000000 11.000000 ... 13.000000 2.000000 0.040000 0.030000 51.220000 4.200000 0.360000 0.350000 8.000000 367.000000
50% 8.810000 30.000000 4.000000 3.000000 7.00000 3.000000 13.000000 0.000000 15.000000 16.000000 ... 19.000000 4.000000 0.070000 0.050000 78.265000 6.310000 0.820000 0.770000 13.000000 576.000000
75% 14.720000 40.000000 6.000000 4.000000 7.25000 3.000000 21.000000 1.000000 25.000000 26.000000 ... 27.000000 7.000000 0.100000 0.080000 118.457500 9.080000 1.630000 1.530000 21.000000 939.000000
max 65.640000 50.000000 12.000000 8.000000 8.00000 7.000000 61.000000 42.000000 64.000000 65.000000 ... 49.000000 27.000000 0.500000 0.600000 362.850000 32.780000 9.380000 9.770000 61.000000 3624.000000

8 rows × 26 columns


In [ ]:
print("There are a %s instances." % len(df))


There are a 10986 instances.

Statistical difficulty and average multiplicity

[Sec. 4.2] Measuring the statistical difficulty of a given instance

Conjecture 1. The statistical difficulty of a given instance can be approximated by the difference between the average and the minimal number of pages in the paginations calculated by the various solvers.

Correlation between the statistical difficulty and the size of the best pagination

Note that this measure of statistical difficulty is intrinsically correlated to the size of the best pagination:


In [ ]:
x = df[solvers].mean(axis=1) - df["best"]
y = df["best"]
plt.xlabel("Statistical difficulty")
plt.ylabel("Best pagination size")
plt.scatter(x, y, marker="o", s=1)
_ = plot_linear_regression(x, y)


Pearson: 0.792

The dispersion of the pagination sizes could have been measured in several other ways, for instance with the standard deviation (below).


In [ ]:
x = df["avgMultiplicity"]
y = df[solvers].std(axis=1)
axes = plt.gca()
axes.set_xlim([0, 70])
plot_linear_regression(x, y)
plt.scatter(x, y, marker="o", s=1)
plt.xlabel("Average multiplicity")
plt.ylabel("Average standard deviation")
plt.grid()
plt.show()


Pearson: 0.695

Number of instances by statistical difficulty


In [ ]:
result = df.groupby(round(2 * (df[solvers].mean(axis=1) - df["best"]))/2).size()
result.plot(kind="bar")
plt.yscale("symlog")
plt.xlabel("Statistical difficulty")
plt.ylabel("Number of instances (sym-log scale)")
plt.show()
print("Number of instances per statistical difficulty:\n", result)
print("Average statistical difficulty: %.02f" % (df[solvers].mean(axis=1) - df["best"]).mean())
print("Median statistical difficulty: %.02f" % (df[solvers].mean(axis=1) - df["best"]).median())


Number of instances per statistical difficulty:
 0.0     208
0.5     962
1.0    1879
1.5    2027
2.0    1643
2.5    1109
3.0     910
3.5     832
4.0     586
4.5     357
5.0     171
5.5     100
6.0      51
6.5      48
7.0      38
7.5      30
8.0      22
8.5      12
9.0       1
dtype: int64
Average statistical difficulty: 2.17
Median statistical difficulty: 1.83

[Sec. 4.3] Predicting the statistical difficulty of a given instance

Conjecture 2. The statistical difficulty of a given random instance is strongly correlated to the density of its shared symbols, or average multiplicity.

[Fig. 4] Statistical difficulty by average multiplicity


In [ ]:
plt.figure(figsize=(10,5))
x = df["avgMultiplicity"]
y = df[solvers].mean(axis=1) - df["best"]
axes = plt.gca()
axes.set_xlim([0, 70])
axes.set_ylim([-1, 9.5])
plot_linear_regression(x, y)
plt.scatter(x, y, marker="o", s=1)
plt.xlabel("Average multiplicity")
plt.ylabel("Average range (statistical difficulty)")
plt.grid()
plt.savefig(os.path.join(OUTPUT_PATH, "difficulty_by_multiplicity.pdf"), bbox_inches='tight')


Pearson: 0.784

In [ ]:
plt.figure(figsize=(20, 10))
df["bitSize"] = df["symbolCount"] * df["tileCount"]
for (i, column) in enumerate(["symbolCount", "bitSize", "tileCount", "cardinality"], 1):
    plt.subplot(2, 2, i)
    x = df[column]
    y = df[solvers].mean(axis=1) - df["best"]
    if i in [1, 3]:
        plt.ylabel("Average range (statistical difficulty)")
    plt.scatter(x, y, marker="o", s=1)
    correlation = plot_linear_regression(x, y)
    plt.xlabel("%s (r = %s)" % (column, correlation))
plt.show()


Pearson: -0.082
Pearson: 0.366
Pearson: 0.563
Pearson: 0.77

[Fig. 5] Number of instances by average multiplicity


In [ ]:
plt.figure(figsize=(10,6))

range_width = 2
ranges = np.arange(1, df["avgMultiplicity"].max() + range_width, range_width)

slices = pd.cut(df["avgMultiplicity"], ranges)
instances_per_slice = df.groupby(slices).size()
instances_per_slice.plot(kind="bar", width=0.9, color="#ffffbf")

cplex_instances = df[df["cplexOpt"].notnull() | df["cplexLB"].notnull() | df["cplexUB"].notnull()]
cplex_slices = pd.cut(cplex_instances["avgMultiplicity"], ranges)
cplex_instances.groupby(cplex_slices).size().plot(kind="bar", width=0.7, color='#abdda4')

cplex_solved_instances = df[df["cplexOpt"].notnull()]
cplex_solved_slices = pd.cut(cplex_solved_instances["avgMultiplicity"], ranges)
cplex_solved_instances.groupby(cplex_solved_slices).size().plot(kind="bar", width=0.5, color="#2b83ba")

plt.xlabel("Ranges of average multiplicity")
plt.ylabel("Number of instances (sym-log scale)")
plt.yscale('symlog')
axes = plt.gca()
axes.set_ylim(0, 3000)
plt.tick_params(axis='x', which='both', bottom='off', top='off')

axes.yaxis.grid(True)

plt.legend(["All instances", "Submitted to CPLEX", "Solved to optimality by CPLEX"])
plt.savefig(os.path.join(OUTPUT_PATH, "count_by_multiplicity.pdf"), bbox_inches='tight')



In [ ]:
range_width = 1
ranges = np.arange(1, df["avgMultiplicity"].max() + range_width, range_width)
slices = pd.cut(df["avgMultiplicity"], ranges)
instances_per_slice = df.groupby(slices).size()
for start in (4, 23, 53):
    n = instances_per_slice[range_width * (start - 1)]
    print("There are %d instances whose average multiplicity lies between %s and %s." % (n, start, start + range_width))


There are 1119 instances whose average multiplicity lies between 4 and 5.
There are 107 instances whose average multiplicity lies between 23 and 24.
There are 10 instances whose average multiplicity lies between 53 and 54.

In [ ]:
(a, b) = (1, 9)
rate = 100.0 * sum(instances_per_slice[a-1:b-1]) / len(df)
print("%0.2f %% of the instances concentrate between average multiplicities %s and %s." % (rate, a, b))


51.29 % of the instances concentrate between average multiplicities 1 and 9.

[Sec. 4.4] Discussion

[Sec. 4.4.1] Behavior of the integer linear program


In [ ]:
cplex_instances = df[df["cplexOpt"].notnull() | df["cplexLB"].notnull() | df["cplexUB"].notnull()]
print("%s instances (%.2f %%) submitted to CPLEX." % (len(cplex_instances), 100.0 * len(cplex_instances)/len(df)))


342 instances (3.11 %) submitted to CPLEX.

In [ ]:
print("CPLEX's success in less than one hour: %s instances (%.1f %%)." % (df["cplexOpt"].count(), 100.0 * df["cplexOpt"].count() / len(cplex_instances)))


CPLEX's success in less than one hour: 43 instances (12.6 %).

In [ ]:
for above in (13, 20):
    cplex_instances_above = cplex_instances[df["avgMultiplicity"] > above]
    print("CPLEX's success in less than one hour above an average multiplicity of %s: %.1f %%." % (above, 100.0 * cplex_instances_above["cplexOpt"].count() / len(cplex_instances_above)))


CPLEX's success in less than one hour above an average multiplicity of 13: 3.8 %.
CPLEX's success in less than one hour above an average multiplicity of 20: 0.0 %.

In [ ]:
cplex_results = df[df["cplexOpt"].notnull() | df["cplexUB"].notnull()][["cplexOpt","cplexUB","pageCount"]]
print("All the %s instances for which CPLEX has found either a solution, either an upper bound:" % len(cplex_results))
cplex_results


All the 51 instances for which CPLEX has found either a solution, either an upper bound:
Out[ ]:
cplexOpt cplexUB pageCount
filename i
C015_A020_T020_P00.json 0 2.0 NaN 2
1 4.0 NaN 4
2 8.0 NaN 8
3 2.0 NaN 2
4 3.0 NaN 3
5 8.0 NaN 8
C015_A020_T050_P00.json 0 3.0 NaN 3
3 3.0 NaN 3
4 NaN 7.0 6
C015_A020_T080_P00.json 0 3.0 NaN 3
3 3.0 NaN 3
C015_A050_T020_P00.json 0 NaN 8.0 8
1 7.0 NaN 7
2 NaN 7.0 7
3 NaN 8.0 8
5 7.0 NaN 7
C015_A050_T050_P00.json 0 NaN 6.0 6
3 5.0 NaN 6
C015_A050_T080_P00.json 2 NaN 7.0 8
C020_A025_T025_P00.json 0 3.0 NaN 3
1 4.0 NaN 5
3 2.0 NaN 2
4 4.0 NaN 4
C020_A025_T055_P00.json 0 4.0 NaN 4
3 3.0 NaN 3
C020_A025_T085_P00.json 3 4.0 NaN 4
C025_A030_T030_P00.json 0 3.0 NaN 3
1 5.0 NaN 5
3 3.0 NaN 3
4 5.0 NaN 5
C025_A030_T060_P00.json 0 3.0 NaN 3
3 3.0 NaN 3
C025_A030_T090_P00.json 0 4.0 NaN 4
3 4.0 NaN 4
C025_A060_T030_P00.json 4 5.0 NaN 5
C030_A035_T035_P00.json 0 2.0 NaN 3
3 3.0 NaN 3
C030_A035_T065_P00.json 0 3.0 NaN 3
3 3.0 NaN 3
C030_A035_T095_P00.json 0 NaN 4.0 4
C035_A040_T040_P00.json 0 3.0 NaN 3
3 3.0 NaN 3
C035_A040_T100_P00.json 3 NaN 4.0 5
C040_A045_T045_P00.json 0 3.0 NaN 3
3 3.0 NaN 3
C040_A045_T075_P00.json 0 4.0 NaN 4
C045_A050_T050_P00.json 0 3.0 NaN 3
3 3.0 NaN 3
C045_A050_T080_P00.json 0 3.0 NaN 4
C050_A055_T055_P00.json 0 3.0 NaN 3
3 3.0 NaN 3

[Sec. 4.4.2] Comparison of the heuristic methods

[Fig. 6] Performance of the main heuristics


In [ ]:
x = pd.Series.rolling(df_sorted_by_multiplicity["avgMultiplicity"], WINDOW, center=True).mean()
plt.figure(figsize=(10,5))
axes = plt.gca()
axes.set_xlim([2, 52])
for time in times:
    solver_name = time[len("times"):]
    if solver_name in EXCLUDED_SOLVER_NAMES:
        continue
    y = pd.Series.rolling(df_sorted_by_multiplicity[time], WINDOW, center=True).mean()
    plt.plot(x, y, label=SOLVER_NAMES[solver_name])
plt.yscale('log')
plt.xlabel("Average multiplicity (rolling mean on %s instances)" % WINDOW)
plt.ylabel("Execution time (seconds, log scale)")
plt.grid()
plt.savefig(os.path.join(OUTPUT_PATH, "speed_by_multiplicity.pdf"), bbox_inches='tight')
plt.legend(loc=7) # legend not plotted for the paper version
plt.show()



In [ ]:
contents = [
    df[times].min().map('{:,.2f}'.format),
    df[times].max().map('{:,.2f}'.format),
    df[times].mean().map('{:,.2f}'.format),
    df[times].std().map('{:,.2f}'.format)
]
digest = pd.DataFrame(contents, index = ["min", "max", "mean", "std"])
digest.columns = SOLVER_NAMES.values()
print("Basic aggregations on execution times (in seconds):")
digest


Basic aggregations on execution times (in seconds):
Out[ ]:
Grouping GA Standard GA Overload-and-Remove Overload-and-Remove (with presort) Best Fusion First Fit
min 2.22 1.32 0.01 0.01 0.01 0.01
max 362.85 32.78 9.38 9.77 0.50 0.60
mean 91.02 7.13 1.18 1.13 0.08 0.06
std 53.34 3.95 1.13 1.10 0.05 0.04

[Fig. 7] Relative quality of the five main heuristics

The outcomes are plotted at $y = \frac{\mathrm{best~size}}{\mathrm{size}}$, with $y=1$ corresponding to the best known solution (which is either the optimal or the best feasible solution found by CPLEX, or the smallest approximation calculated for the given instance).


In [ ]:
x = pd.Series.rolling(df_sorted_by_multiplicity["avgMultiplicity"], WINDOW, center=True).mean()
plt.figure(figsize=(10,7))
axes = plt.gca()
axes.set_xlim([2, 52])
axes.set_ylim([0.74, 1.01])
axes.spines['right'].set_visible(False)
axes.spines['top'].set_visible(False)
for solver in solvers:
    solver_name = solver[len("solvers"):]
    if solver_name in EXCLUDED_SOLVER_NAMES:
        continue
    ratio = df_sorted_by_multiplicity["best"] / df_sorted_by_multiplicity[solver]
    y = pd.Series.rolling(ratio, WINDOW, center=True).mean()
    plt.plot(x, y, label=SOLVER_NAMES[solver_name])
plt.xlabel("Average multiplicity (rolling mean on %s instances)" % WINDOW)
plt.ylabel("Average pagination size vs. best known result")
plt.grid()

# move the legend to an empty place
legend = plt.legend(loc=7)
plt.draw()
bb = legend.legendPatch.get_bbox().inverse_transformed(axes.transAxes)
bb.set_points([[bb.x0 - 0.02, bb.y0 + 0.2], [bb.x1 - 0.02, bb.y1 + 0.2]])
legend.set_bbox_to_anchor(bb)

plt.savefig(os.path.join(OUTPUT_PATH, "relative_size_by_multiplicity.pdf"), bbox_inches='tight')


Exact algorithms vs. heuristics

The column pageCount gives the smallest pagination size found by the various heuristics:


In [ ]:
assert len(df[df["pageCount"] != df[solvers].min(axis=1)]) == 0

Hence, the optimal value found by CPLEX may be lesser than this one:


In [ ]:
suboptimal_instances_1 = df[df["cplexOpt"] < df["pageCount"]][["cplexOpt", "pageCount"] + solvers]
suboptimal_instances_1.columns = ["cplexOpt", "pageCount"] + list(SOLVER_NAMES.values())
print("The optimal solution is better than the best approximation for these %s instances:" % len(suboptimal_instances_1))
suboptimal_instances_1


The optimal solution is better than the best approximation for these 4 instances:
Out[ ]:
cplexOpt pageCount Grouping GA Standard GA Overload-and-Remove Overload-and-Remove (with presort) Best Fusion First Fit
filename i
C015_A050_T050_P00.json 3 5.0 6 6 7 6 6 6 8
C020_A025_T025_P00.json 1 4.0 5 5 5 5 5 6 6
C030_A035_T035_P00.json 0 2.0 3 3 3 3 3 3 3
C045_A050_T080_P00.json 0 3.0 4 4 4 4 4 5 5

It may happen that the upper bound found by CPLEX is less than the best page count found by the heuristics. In this case, we know that there exists a better pagination (although CPLEX cannot prove its optimality):


In [ ]:
suboptimal_instances_2 = df[df["cplexUB"] < df["pageCount"]][["cplexUB", "pageCount"] + solvers]
suboptimal_instances_2.columns = ["cplexOpt", "pageCount"] + list(SOLVER_NAMES.values())
print("For %s more instances, we know that the best approximation is not optimal:" % len(suboptimal_instances_2))
suboptimal_instances_2


For 2 more instances, we know that the best approximation is not optimal:
Out[ ]:
cplexOpt pageCount Grouping GA Standard GA Overload-and-Remove Overload-and-Remove (with presort) Best Fusion First Fit
filename i
C015_A050_T080_P00.json 2 7.0 8 8 10 8 8 9 11
C035_A040_T100_P00.json 3 4.0 5 5 5 5 5 6 6

The column best gives the minimum pagination sizes found by the heuristics and CPLEX (including the upper bound):


In [ ]:
df[df["best"] < df["pageCount"]][["best", "pageCount"]]


Out[ ]:
best pageCount
filename i
C015_A050_T050_P00.json 3 5.0 6
C015_A050_T080_P00.json 2 7.0 8
C020_A025_T025_P00.json 1 4.0 5
C030_A035_T035_P00.json 0 2.0 3
C035_A040_T100_P00.json 3 4.0 5
C045_A050_T080_P00.json 0 3.0 4

In [ ]:
count = len(suboptimal_instances_1) + len(suboptimal_instances_2)
print("All in all, ILP improved on the heuristics in %s cases" % count, end=" ")
print("(%.02f %% of the %s selected instances)." % (100.0 * count / len(cplex_instances), len(cplex_instances)))


All in all, ILP improved on the heuristics in 6 cases (1.75 % of the 342 selected instances).

Grouping GA vs. the other heuristics


In [ ]:
prefix = ["avgMultiplicity", "pageCount"]
columns = [
    "solversGeneticGroup",
    "solversGeneticStandard",
    "solversOverloadAndRemove",
    "solversOverloadAndRemovePresort"
]
bad_gga = df[df["pageCount"] < df["solversGeneticGroup"]][prefix + columns]
for column in columns[1:]:
    bad_gga[column] = bad_gga[column][bad_gga[column] < bad_gga["solversGeneticGroup"]]
bad_gga.columns = prefix + [SOLVER_NAMES[column[len("solvers"):]] for column in columns]
print("In %.02f %% of the cases," % (100.0 - 100.0 * len(bad_gga) / len(df)),)
print("Grouping GA was the best heuristics, except on these %s instances" % len(bad_gga), end=" ")
print("(greater values erased for clarity, sorted by increasing average multiplicity).")
bad_gga.sort_values(by="avgMultiplicity").fillna("")


In 99.64 % of the cases,
Grouping GA was the best heuristics, except on these 40 instances (greater values erased for clarity, sorted by increasing average multiplicity).
Out[ ]:
avgMultiplicity pageCount Grouping GA Standard GA Overload-and-Remove Overload-and-Remove (with presort)
filename i
C015_A060_T050_P00.json 1 2.33 6 7 6
C015_A095_T090_P00.json 1 2.67 11 12 11
C015_A050_T045_P00.json 0 2.72 5 6 5
C015_A085_T095_P00.json 3 3.09 11 12 11
C015_A080_T100_P00.json 3 3.65 12 13 12 12
C020_A090_T075_P00.json 2 3.89 12 13 12
C020_A100_T085_P00.json 4 4.44 16 17 16
C020_A100_T095_P00.json 2 4.49 16 17 16 16
4 4.66 16 17 16 16
C015_A020_T035_P00.json 0 4.95 2 3 2
C015_A055_T100_P00.json 2 5.15 10 11 10
0 5.18 10 11 10
C015_A100_T080_P00.json 0 5.24 29 30 29
C020_A090_T095_P00.json 4 5.27 16 17 16 16
C015_A040_T080_P00.json 2 5.50 6 7 6
C020_A025_T030_P00.json 0 5.60 2 3 2
C015_A020_T045_P00.json 3 5.70 2 3 2 2
C020_A080_T095_P00.json 4 5.83 15 16 15
C030_A095_T080_P00.json 0 6.02 13 14 13
C025_A095_T095_P00.json 0 6.09 16 17 16 16
C030_A075_T070_P00.json 2 6.21 9 10 9
C030_A035_T040_P00.json 3 6.23 2 3 2
C030_A095_T090_P00.json 2 7.17 15 16 15 15
C030_A080_T095_P00.json 4 7.63 12 13 12
C025_A075_T090_P00.json 2 7.65 15 16 15
C020_A060_T100_P00.json 2 7.78 13 14 13
C035_A100_T100_P00.json 4 8.09 15 16 15
C030_A075_T085_P00.json 2 8.15 12 13 12
C015_A075_T095_P00.json 1 8.49 33 34 33
C025_A070_T095_P00.json 0 8.51 14 15 14
C020_A055_T100_P00.json 4 9.44 14 15 14 14
C015_A060_T090_P00.json 5 9.45 27 28 27
C035_A085_T100_P00.json 2 9.48 14 15 14
C015_A060_T090_P00.json 1 9.77 29 30 29 29
C030_A065_T100_P00.json 4 9.85 11 12 11
C040_A080_T090_P00.json 2 10.71 12 13 12 12
C015_A055_T095_P00.json 1 11.36 30 31 30
C015_A050_T100_P00.json 1 12.80 28 29 28
C045_A050_T060_P00.json 3 13.08 3 4 3
C015_A045_T100_P00.json 5 14.87 30 31 30

In [ ]:
for column in bad_gga.columns[len(prefix) + 1:]:
    count = bad_gga[column].count()
    print("%s produced a better pagination than Grouping GA on %s instances (%.03f %%)." % (column, count, (100.0 * count / len(df))))


Standard GA produced a better pagination than Grouping GA on 4 instances (0.036 %).
Overload-and-Remove produced a better pagination than Grouping GA on 22 instances (0.200 %).
Overload-and-Remove (with presort) produced a better pagination than Grouping GA on 24 instances (0.218 %).