In [1]:
import pandas as pd
import numpy as np
In [2]:
my_data = pd.DataFrame([1,2,3])
convert integer in to binary string
In [8]:
def to_binary(value):
return "{0:b}".format(value)
In [10]:
to_binary(5)
Out[10]:
In [12]:
unique_values = my_data.thrid.unique()
we apply this to a data we use enumerate to loop then save a dictionary to encapsulate of the values that are unique in binary code, so it give you the index possition and the value, now what we do is that we create a dictionary that maps this individual string in to binary.
In [14]:
my_dict = {}
for index,val in enumerate(unique_vals):
my_dict[val] = to_binary(index)
In [ ]:
my_data["thrid_binary"] = my_data.apply(lambda x: my_dict[x.thrid], axis = 1)
then you will want to split my data binary in to columns and so you split the string in to nothing. you can put a separator in between them and then split on that separator. You have to look at the histogram of the column and see if the top n account for the majority. if they do, you can use those ones, then you'd have a partition for those columns.
It would still be a better model.
In [ ]: