Exploring the names database

I might like to try some sort of frequency-weighting for namegen, but instead of including each name the same number of times that it appears in the corpus, I will try to find some sort of transformation to make it a little nicer.

I should note that I am not using any sort of "best-practices" or theory-driven decision making, because the goal isn't to make these look like real names or anything, but rather to "sort of generate some stuff."



In [7]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

Import names database:



In [2]:

    
d = pd.read_csv("NationalNames.csv")

Get a sense of the disribution of occurrences per name:



In [19]:

    
def print_summ(arr):
    print("Range: {} to {}".format(str(round(min(arr), 2)), str(round(max(arr), 2))))
    print("Mean: {}".format(str(round(np.mean(arr), 2))))
    print("Std. dev.: {}".format(str(round(np.std(arr), 2))))
    print("Median: {}".format(str(round(np.median(arr), 2))))

print_summ(d.Count)









    



Range: 5 to 99680
Mean: 184.69
Std. dev.: 1566.71
Median: 12.0

How about some natural logs?



In [20]:

    
lcount = np.log(d.Count)
print_summ(lcount)









    



Range: 1.61 to 11.51
Mean: 2.93
Std. dev.: 1.45
Median: 2.48



In [15]:

    
density = gaussian_kde(np.log(d.Count))
# this will be rough
dens_xs = np.linspace(0, np.log(max(d.Count)), 200)
plt.plot(dens_xs, density(dens_xs))
plt.show()



In [29]:

    
xs = np.arange(1, len(lcount)+1)
plt.plot(xs, np.sort(lcount))
plt.show()

Just to get a sense of the proportions:



In [31]:

    
len(list(filter(lambda val: val < 11, lcount)))/len(lcount)









    Out[31]:





0.9998849588015556



In [32]:

    
len(list(filter(lambda val: val < 4, lcount)))/len(lcount)









    Out[32]:





0.8239223241828103



In [38]:

    
sum(lcount.astype(int))/len(lcount)









    Out[38]:





2.3869492881962802

So it's not going to be too huge, and I can add 1 and then truncate.



In [49]:

    
transcount = (lcount+1).astype(int)
sum(transcount)









    Out[49]:





6182649

Then make the list that way:



In [57]:

    
transd = {'name': d.Name, 'count': transcount}
td = pd.DataFrame(transd)
newlist = []
for row in td.itertuples():
    # (Index=_, count=_, name=_)
    newlist.extend([row[2]]*row[1])
#     print(row)
#     if len(newlist) > 50:
#         break
print(len(newlist), newlist[:10])









    



6182649 ['Mary', 'Mary', 'Mary', 'Mary', 'Mary', 'Mary', 'Mary', 'Mary', 'Mary', 'Anna']

Looks good. Saving...



In [58]:

    
with open("ext_names.txt", "w") as f:
    f.write("\n".join(newlist))