Problem statement:
Input:
C csv filesn rows. Each row in file c encodes the prediction for class c on a 1sec segment.kp[c].Output:
k clips, each 10 seconds in durationc on each clip kc has aggregate likelihood at least p[c] * kk * p[c]? p[y]?So what are our entrofy parameters?
16 * 4 * n_classesIf we only want one example per track, we can make an aux categorical column that's the track index, and set the target number to 1
In [1]:
import numpy as np
import pandas as pd
import entrofy
In [2]:
import matplotlib.pyplot as plt
In [3]:
%matplotlib nbagg
In [50]:
df = pd.read_csv('/home/bmcfee/data/vggish-likelihoods-a226b3-maxagg10.csv.gz', index_col=0)
In [51]:
df.head(5)
Out[51]:
In [54]:
(df >= 0.5).describe().T.sort_values('freq')
Out[54]:
In [71]:
df.median()
Out[71]:
In [55]:
N_OUT = 23 * 100
In [56]:
mappers = {col: entrofy.mappers.ContinuousMapper(df[col],
prefix=col,
n_out=2,
boundaries=[0.0, 0.5, 1.0]) for col in df}
In [ ]:
idx, score = entrofy.entrofy(df, N_OUT, mappers=mappers,
seed=20180205,
quantile=0.05,
n_trials=10)
In [64]:
df.loc[idx].head(10)
Out[64]:
In [65]:
(df.loc[idx] >= 0.5).describe().T.sort_values('freq')
Out[65]:
In [69]:
!pwd
In [68]:
idx.to_series().to_json('subsample_idx.json')
In [ ]:
mappers = {col: entrofy.mappers.ContinuousMapper(df[col], n_out=4,
boundaries=[0.0, 0.25, 0.5, 0.75, 1.0]) for col in df}
In [ ]:
In [3]:
idx, score = entrofy.entrofy(df, 1000, mappers=mappers, n_trials=100)