Reformatting Ubuntu Dialogue Corpus for Chatbot Model

Dataset Description

How to run this notebook:

Download the data from the github repository
CD to the downloaded directory and run ./generate.sh -t -l. I downloaded 1 mill. train samples with p = 1.0.
Change the value of DATA_DIR below to your project path root.

I chose to copy the first 10k lines of data into a "sample" directory so I could quickly play around with the data, and then use the full dataset when done.

Random facts from paper:

2-way (dyadic) conversation, as opposed to multi-participant.

Load the Data



In [7]:

    
import numpy as np
import os.path
import pdb
import pandas as pd
from pprint import pprint

#DATA_DIR = '/home/brandon/terabyte/Datasets/ubuntu_dialogue_corpus/'
DATA_DIR = '/home/brandon/ubuntu_dialogue_corpus/src/' # sample/'
TRAIN_PATH = DATA_DIR + 'train.csv'
VALID_PATH = DATA_DIR + 'valid.csv'
TEST_PATH = DATA_DIR + 'test.csv'

def get_training():
    """Returns dataframe data from train.csv """
    # First, we need to load the data directly into a dataframe from the train.csv file. 
    df_train = pd.read_csv(TRAIN_PATH)
    # Remove all examples with label = 0. (why would i want to train on false examples?)
    df_train = df_train.loc[df_train['Label'] == 1.0]
    # Don't care about the pandas indices in the df, so remove them.
    df_train = df_train.reset_index(drop=True)
    df_train = df_train[df_train.columns[:2]]
    return df_train

def get_validation():
    """Returns data from valid.csv """
    # First, we need to load the data directly into a dataframe from the train.csv file. 
    df_valid = pd.read_csv(VALID_PATH)
    first_two_cols = df_valid.columns[:2]
    df_valid = df_valid[first_two_cols]
    df_valid.columns = ['Context', 'Utterance']
    return df_valid

df_train = get_training()
df_valid = get_validation()

Functions for Visualization and Reformatting



In [8]:

    
# Now get all of the data in a single string and make a 'vocabulary' (unique words). 
import nltk, re, pprint
from nltk import word_tokenize
import pdb

def print_single_turn(turn: str):
    as_list_of_utters = turn.split('__eou__')[:-1]
    for idx_utter, utter in enumerate(as_list_of_utters):
        print("\t>>>", utter)

def print_conversation(df, index=0):
    """Display the ith conversation in nice format."""
    
    # Get the row identified by 'index'. 
    context_entry = df['Context'].values[index]
    target        = df['Utterance'].values[index]
    
    # Split returns a blank last entry, so don't store.
    turns = context_entry.split('__eot__')[:-1]
    print('--------------------- CONTEXT ------------------- ')
    for idx_turn, turn in enumerate(turns):
        print("\nUser {}: ".format(idx_turn % 2))
        print_single_turn(turn)
    print('\n--------------------- RESPONSE ------------------- ')
    print("\nUser {}: ".format(len(turns) % 2))
    print_single_turn(target)
        
def get_user_arrays(df):
    """Returns two arrays of every other turn. 
    Specifically:
        len(returned array) is number of rows in df.  I SURE HOPE NOT!
        each entry is a numpy array. 
        each numpy array contains utterances as entries. 
    """
    userOne = []
    userTwo = []
    contexts = df['Context'].values
    targets  = df['Utterance'].values
    assert(len(contexts) == len(targets))
    
    for i in range(len(contexts)):
        # combined SINGLE CONVERSATION ENTRY of multiple turns each with multiple utterances.
        list_of_turns = contexts[i].lower().split('__eot__')[:-1] + [targets[i].lower()]
        # make sure even number of entries
        if len(list_of_turns) % 2 != 0:
            list_of_turns = list_of_turns[:-1]
        # strip out the __eou__ occurences (leading space bc otherwise would result in two spaces)
        new_list_of_turns = []
        for turn in list_of_turns:
            utter_list = turn.lower().split(" __eou__")
            #if len(utter_list) > 3:
            #   utter_list = utter_list[:3]
            new_list_of_turns.append("".join(utter_list))
        #list_of_turns = [re.sub(' __eou__', '', t) for t in list_of_turns]
        userOneThisConvo = new_list_of_turns[0::2]
        userTwoThisConvo = new_list_of_turns[1::2]
        userOne += userOneThisConvo 
        userTwo += userTwoThisConvo
    assert(len(userOne) == len(userTwo))
    return userOne, userTwo

def save_to_file(fname, arr):
    with open(DATA_DIR+fname,"w") as f:
        for line in arr:
            f.write(line + "\n")

Training Data

At a Glance



In [4]:

    
df_train.describe()









    Out[4]:






  
    
      
      Context
      Utterance
    
  
  
    
      count
      1000000
      1000000
    
    
      unique
      957096
      849957
    
    
      top
      ! ops __eou__ __eot__ ? __eou__ __eot__
      thank __eou__
    
    
      freq
      14
      11658



In [5]:

    
pd.options.display.max_colwidth = 500
df_train.head(2)









    Out[5]:






  
    
      
      Context
      Utterance
    
  
  
    
      0
      i think we could import the old comment via rsync , but from there we need to go via email . I think it be easier than cache the status on each bug and than import bits here and there __eou__ __eot__ it would be very easy to keep a hash db of message-ids __eou__ sound good __eou__ __eot__ ok __eou__ perhaps we can ship an ad-hoc apt_prefereces __eou__ __eot__ version ? __eou__ __eot__ thank __eou__ __eot__ not yet __eou__ it be cover by your insurance ? __eou__ __eot__ yes __eou__ but it 's ...
      basically each xfree86 upload will NOT force users to upgrade 100Mb of fonts for nothing __eou__ no something i do in my spare time . __eou__
    
    
      1
      I 'm not suggest all - only the ones you modify . __eou__ __eot__ ok , it sound like you 're agree with me , then __eou__ though rather than `` the ones we modify '' , my idea be `` the ones we need to merge '' __eou__ __eot__
      oh ? oops . __eou__



In [6]:

    
print_conversation(df_train, 3)









    



--------------------- CONTEXT ------------------- 

User 0: 
	>>> interest 
	>>>  grub-install work with / be ext3 , fail when it be xfs 
	>>>  i think d-i instal the relevant kernel for your machine . i have a p4 and its instal the 386 kernel 
	>>>  holy crap a lot of stuff get instal by default : ) 
	>>>  YOU ARE INSTALLING VIM ON A BOX OF MINE 
	>>>  ; ) 

User 1: 
	>>>  more like osx than debian ; ) 
	>>>  we have a selection of python modules available for great justice ( and python development ) 

User 0: 
	>>>  2.8 be fix them iirc 

User 1: 
	>>>  pong 
	>>>  vino will be in 
	>>>  enjoy ubuntu ? 

User 0: 
	>>>  tell me to come here 
	>>>  suggest thursday as a good day to come 

User 1: 
	>>>  we freeze versions a while back : ) 
	>>>  you come today or thursday ? 
	>>>  we 're consider shift it 
	>>>  yay 
	>>>  enjoy ubuntu ? 
	>>>  usplash ! 

User 0: 
	>>>  thats the one 

User 1: 
	>>>  so i saw your email with the mockup at the airport , but it have n't appear now that i 've pull my mail : | 

User 0: 
	>>>  i 've get a better one now too , give me a minute 
	>>>  we 've get rh9 instal on most desktops . you want me to look at up2date , right ? 

User 1: 
	>>>  aha ! no , the gui thingy 
	>>>  it 's more wizardy 
	>>>  so the first page be okayish 
	>>>  we can do a whole load better on the second page ( icons , translate descriptions ) 
	>>>  but that 's the kind of thing i be think about 
	>>>  ( a single big treeview would get very scary , very quickly ) 
	>>>  sure it 's not a hurricane ? 

User 0: 
	>>>  i think experimental be get 2.8 too 
	>>>  let him work on # 1217 : ) 

User 1: 
	>>>  we call it 'universe ' ; ) 
	>>>  haha 
	>>>  ooh , totally 

User 0: 
	>>>  i want it on in sarge too but nobody else agree 

--------------------- RESPONSE ------------------- 

User 1: 
	>>> i fully endorse this suggestion < /quimby > 
	>>>  how do your reinstall go ?

Turn-Based DataFrame



In [10]:

    
#df_merged = pd.DataFrame(df_train['Context'].map(str) + df_train['Utterance'])
userOne, userTwo = get_user_arrays(df_train)
df_turns = pd.DataFrame({'UserOne': userOne, 'UserTwo': userTwo})
df_turns.head(200)









    Out[10]:






  
    
      
      UserOne
      UserTwo
    
  
  
    
      0
      i think we could import the old comment via rs...
      it would be very easy to keep a hash db of me...
    
    
      1
      ok perhaps we can ship an ad-hoc apt_prefereces
      version ?
    
    
      2
      thank
      not yet it be cover by your insurance ?
    
    
      3
      yes but it 's really not the right time : / w...
      you will be move into your house soon ? post ...
    
    
      4
      how urgent be # 896 ?
      not particularly urgent , but a policy violat...
    
    
      5
      i agree that we should kill the -novtswitch
      ok
    
    
      6
      would you consider a package split a feature ?
      context ?
    
    
      7
      split xfonts* out of xfree86* . one upload fo...
      split the source package you mean ?
    
    
      8
      yes . same binary package .
      i would prefer to avoid it at this stage . th...
    
    
      9
      i 'm not suggest all - only the ones you modif...
      ok , it sound like you 're agree with me , th...
    
    
      10
      afternoon all not entirely relate to warty , b...
      here
    
    
      11
      you might want to know that thinice in warty ...
      and apparently gnome be suddently almost perf...
    
    
      12
      can i file the panel not link to eds ? : )
      be you use alt ? or the windows key ? wait fo...
    
    
      13
      i just restart x and now nautilus wo n't show...
      do you think we have any interest to have hal...
    
    
      14
      be it a know bug that g-s-t do n't know what ...
      somebody should really kick that guy *hard* i...
    
    
      15
      arse . xt-dev ? i add libx11-dev so just libx...
      we have plan to speak about menu organisation...
    
    
      16
      be away , you say ? nope
      the warty repository ok , fine . thanks nice ...
    
    
      17
      you 'll be glad to know i 've fix my miss arr...
      i 've upload the gnome-vfs without hal suppor...
    
    
      18
      should g2 in ubuntu do the magic dont-focus-w...
      we 'll have a bof about this so you 're come t...
    
    
      19
      interest grub-install work with / be ext3 , fa...
      more like osx than debian ; ) we have a selec...
    
    
      20
      2.8 be fix them iirc
      pong vino will be in enjoy ubuntu ?
    
    
      21
      tell me to come here suggest thursday as a go...
      we freeze versions a while back : ) you come ...
    
    
      22
      thats the one
      so i saw your email with the mockup at the ai...
    
    
      23
      i 've get a better one now too , give me a mi...
      aha ! no , the gui thingy it 's more wizardy ...
    
    
      24
      i think experimental be get 2.8 too let him w...
      we call it 'universe ' ; ) haha ooh , totally
    
    
      25
      i want it on in sarge too but nobody else agree
      i fully endorse this suggestion < /quimby > ho...
    
    
      26
      and because python give mark a woody
      i 'm not sure if we 're mean to talk about th...
    
    
      27
      and i think we be a `` pant off '' kind of co...
      mono 1.0 ? dude , that 's go to be a barrel o...
    
    
      28
      there be an accompany irc conversation to tha...
      but debian/ be also part of diff.gz ... you c...
    
    
      29
      notwarty , hth , hand , kthxbye < g > everyon...
      which ? the best feature of the new imac be t...
    
    
      ...
      ...
      ...
    
    
      170
      this mornings run fine .
      yep , it be . could you add hurd-i386 amd64 t...
    
    
      171
      do - i assume that you 'll poke elmo to fresh...
      do n't know i have to i 'll remind him next ti...
    
    
      172
      know bug . check bugs.debian.org/src : xfree86...
      i 'm wait approval to join the sounder list ....
    
    
      173
      setuid /bin/mount be go to be disable in favou...
      under all circumstances , i take it ?
    
    
      174
      i would say so
      do
    
    
      175
      i will accept responsibility for the addition...
      indeed . it 's a recommends : thing , fix act...
    
    
      176
      we need to make sure that cdroms ( and other r...
      presumably i should make the same quietinit- ...
    
    
      177
      yes
      do ( be the actual option change in sysvinit ...
    
    
      178
      sysvinit usplash will also check for it and n...
      the patch as post be wrong , but i 've test w...
    
    
      179
      if it work , can you patch it and upload it ?
      sure , let 's make it tomorrow though : - )
    
    
      180
      np
      hm , that eject change be go to need some tes...
    
    
      181
      pmount just wrap mount , so long as you be a....
      that 's not my point at all
    
    
      182
      i 'll double check that thats the whole reaso...
      maybe just fall back to something like the ol...
    
    
      183
      thats a little difficult with the way it be c...
      do n't have to be exact by any mean be it imp...
    
    
      184
      what do the fb corruption look like ?
      lot of dot and squggles , text be partly read...
    
    
      185
      which package ?
      ^^^
    
    
      186
      they build 'em sturdy down there , apparently...
      over here they install fine , but then apt wa...
    
    
      187
      try apt-get clean ?
      that fix it , thank
    
    
      188
      that issue should n't affect upgrade from woo...
      pitti : i 'm in group plugdev , it 's still n...
    
    
      189
      one difference be that some pcs have a line o...
      removable devices have be bad for me , but it...
    
    
      190
      i think we nail the issue ( executable permis...
      it 's copy into their cache already , be n't ...
    
    
      191
      hmm , good point nono , ( b ) give the cd to ...
      if we be use grub only , should we turn off do...
    
    
      192
      poke
      try to remember if i ask if you be ppc or i38...
    
    
      193
      yeah , series 1 h8 that debconf should be con...
      what kind of puppies ? awesome
    
    
      194
      you ping the other day ? ( i only just get to ...
      i ping a few minutes ago do you make a usplas...
    
    
      195
      right click on the desktop be not work for me
      nautilus manage the desktop ? the icons be di...
    
    
      196
      anybody else see fb corruption during boot ? g...
      which video card ?
    
    
      197
      no , it 's my home desktop
      i remember fb break on that laptop and iirc it...
    
    
      198
      that typically mean that apt be see multiple p...
      there 's something definitely screwy with the...
    
    
      199
      stock warty apt-get install `debootstrap -- p...
      what kernel be you run ? # 268154 , yeah , re...
    
  

200 rows × 2 columns

Sentence-Based DataFrame



In [5]:

    
userOne[0]









    Out[5]:





'i think we could import the old comment via rsync , but from there we need to go via email . I think it be easier than cache the status on each bug and than import bits here and there '



In [6]:

    
def get_sentences(userOne, userTwo):
    encoder = []
    decoder = []
    assert(len(userOne) == len(userTwo))
    for i in range(len(userOne)):
        one = nltk.sent_tokenize(userOne[i])
        one = [s for s in one if s != '.']
        two = nltk.sent_tokenize(userTwo[i])
        two = [s for s in two if s != '.']
        combine = one + two
        assert(len(combine) == len(one) + len(two))
        if len(combine) % 2 != 0:
            combine = combine[:-1]
        enc = combine[0::2]
        dec = combine[1::2]
        assert(len(enc) == len(dec))
        encoder.append(enc)
        decoder.append(dec)
    return encoder, decoder
encoder, decoder = get_sentences(userOne, userTwo)
print('done')









    



---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-6-a5f57e047c74> in <module>()
     19         decoder.append(dec)
     20     return encoder, decoder
---> 21 encoder, decoder = get_sentences(userOne, userTwo)
     22 print('done')

<ipython-input-6-a5f57e047c74> in get_sentences(userOne, userTwo)
      5     assert(len(userOne) == len(userTwo))
      6     for i in range(len(userOne)):
----> 7         one = nltk.sent_tokenize(userOne[i])
      8         one = [s for s in one if s != '.']
      9         two = nltk.sent_tokenize(userTwo[i])

/usr/local/lib/python3.5/dist-packages/nltk/tokenize/__init__.py in sent_tokenize(text, language)
     92     """
     93     tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
---> 94     return tokenizer.tokenize(text)
     95 
     96 # Standard word tokenizer.

/usr/local/lib/python3.5/dist-packages/nltk/tokenize/punkt.py in tokenize(self, text, realign_boundaries)
   1235         Given a text, returns a list of the sentences in that text.
   1236         """
-> 1237         return list(self.sentences_from_text(text, realign_boundaries))
   1238 
   1239     def debug_decisions(self, text):

/usr/local/lib/python3.5/dist-packages/nltk/tokenize/punkt.py in sentences_from_text(self, text, realign_boundaries)
   1283         follows the period.
   1284         """
-> 1285         return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
   1286 
   1287     def _slices_from_text(self, text):

/usr/local/lib/python3.5/dist-packages/nltk/tokenize/punkt.py in span_tokenize(self, text, realign_boundaries)
   1274         if realign_boundaries:
   1275             slices = self._realign_boundaries(text, slices)
-> 1276         return [(sl.start, sl.stop) for sl in slices]
   1277 
   1278     def sentences_from_text(self, text, realign_boundaries=True):

/usr/local/lib/python3.5/dist-packages/nltk/tokenize/punkt.py in <listcomp>(.0)
   1274         if realign_boundaries:
   1275             slices = self._realign_boundaries(text, slices)
-> 1276         return [(sl.start, sl.stop) for sl in slices]
   1277 
   1278     def sentences_from_text(self, text, realign_boundaries=True):

/usr/local/lib/python3.5/dist-packages/nltk/tokenize/punkt.py in _realign_boundaries(self, text, slices)
   1314         """
   1315         realign = 0
-> 1316         for sl1, sl2 in _pair_iter(slices):
   1317             sl1 = slice(sl1.start + realign, sl1.stop)
   1318             if not sl2:

/usr/local/lib/python3.5/dist-packages/nltk/tokenize/punkt.py in _pair_iter(it)
    309     it = iter(it)
    310     prev = next(it)
--> 311     for el in it:
    312         yield (prev, el)
    313         prev = el

/usr/local/lib/python3.5/dist-packages/nltk/tokenize/punkt.py in _slices_from_text(self, text)
   1290             context = match.group() + match.group('after_tok')
   1291             if self.text_contains_sentbreak(context):
-> 1292                 yield slice(last_break, match.end())
   1293                 if match.group('next_tok'):
   1294                     # next sentence starts after whitespace

KeyboardInterrupt:



In [ ]:

    
encoder = [nltk.word_tokenize(s[0]) for s in encoder]
decoder = [nltk.word_tokenize(s[0]) for s in decoder]



In [ ]:

    
max_enc_len = max([len(s) for s in encoder])
max_dec_len = max([len(s) for s in decoder])
print(max_enc_len)
print(max_dec_len)

Analyzing Sentence Lengths



In [ ]:

    
encoder_lengths = [len(s) for s in encoder]
decoder_lengths = [len(s) for s in decoder]
df_lengths = pd.DataFrame({'EncoderSentLength': encoder_lengths, 'DecoderSentLengths': decoder_lengths})
df_lengths.describe()



In [ ]:

    
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = 9, 5
fig, axes = plt.subplots(nrows=1, ncols=2)
plt.subplot(1, 2, 1)
plt.hist(encoder_lengths)
plt.subplot(1, 2, 2)
plt.hist(decoder_lengths, color='b')
plt.tight_layout()
plt.show()



In [11]:

    
save_to_file("train_from.txt", userOne)
save_to_file("train_to.txt", userTwo)

Validation Data



In [12]:

    
print("df_valid has", len(df_valid), "rows.")
df_valid.head()









    



df_valid has 19560 rows.






    Out[12]:






  
    
      
      Context
      Utterance
    
  
  
    
      0
      Any ideas on how lts will be release ? __eou__...
      We be talk 12.04 not 10.04 __eou__
    
    
      1
      how much hdd use ubuntu default install ? __eo...
      thats why i ask how much be default install ? ...
    
    
      2
      in my country its nearly the 27th __eou__ when...
      thanx __eou__
    
    
      3
      it 's not out __eou__ __eot__ they probabaly b...
      wait for many things to be setup __eou__ final...
    
    
      4
      be the ext4 drivers stable ? __eou__ __eot__ I...
      you sound like it 's update to skynet . ; ) __...



In [13]:

    
userOne, userTwo = get_user_arrays(df_valid)
save_to_file("valid_from.txt", userOne)
save_to_file("valid_to.txt", userTwo)



In [14]:

    
print('done')









    



done

Visualization



In [18]:

    
import matplotlib.pyplot as plt
%matplotlib inline
userOne, userTwo = get_user_arrays(df_train)
# Regular expressions used to tokenize.
_WORD_SPLIT = re.compile(b"([.,!?\"':;)(])")
_DIGIT_RE   = re.compile(br"\d")
lengths = np.array([len(t.strip().split()) for t in userOne])

max_ind =  lengths.argmax()
print(max(lengths), "at", max_ind)
print("Sentence:\n", userOne[max_ind])









    



855 at 1330067
Sentence:
  thanks : ) i know , it use .deb , but read my word , they say that nothing like .deb or source be there . that be know to me , so iam think of use alien to build the deb i have see the entire internet , and even the sdl developers tell me , that they do n't exist . lol , what be you say ? there be no .deb yet for wesnoth 1.8 , also , i know i need to compile it , but the source be not available for sdl11-config , i instal them all , no result yes , the older ones be available . sorry , but make the previous version be different . the previous versions of wesnoth do not need all sdl libraries , lol , thank , but that be for wesnoth 1.6.1 , at least 2 years older , but i say 1.8 , anyway . thanks for the help . i be search for a source , and let me compile it . lol , still a formidable game can be make , against age of empires series , : d penumbra if commercial , but o a.d be not , : d a demo be not open-source , and also , the what i like be a rts , like the fan of age of mythology and aok , also , from the screenthots of o a.d , it appear as if it be a jewel to ubuntu . the debate continue , if you consider java too , and i do have them . but , the thing be that , commercial < free < open-source , so flash be better than commercial softwares like penumbra what we want to advocate ubuntu for , be freedom and open-source . so , commercial game be not welcome here . the best way be to use free and open-source softwares . is n't our discussion go a bite offtopic ? also , non-free softwares be mean for profit , which clash with the nature of ubuntu . a mutual profit be always welcome , when it benefit the ideology of the user and the developer . and i have research nicely enough , but you seem to challenge the nature of linux , and also , drag the discussion offtopic . in addition , things like non-free softwares cause a flow of money from the pocket of the users to the developers . but open-source be what allow users to be developers . there be , base on user experience . and ubuntu be the best . fame come from others individual judgements , and people accept ubuntu , since it be the best . also , ubuntu be the most user-friendly linux for pcs . anything *buntu , whether lubuntu , kubuntu or xubuntu be the best . the talk be relate to open-source , but windows be not . also , many people use windows , but they be switch to ubuntu now . and it be ubuntu who challenge microsoft in a bold way . computing world be not only for game . for game , get a xbox or playstation . computer be mainly for learn and development . it be for better use . and in all field , ubuntu be the best . for everyday use , puppy be a joke . ubuntu and puppy 's goals differ . ubuntu jeos can do the same . lubuntu be also nicer than puppy in many ways . in the modern world , people who be in computing , should upgrade their systems . and for low-end systems , any linux with low de can do . why only puppy ? dsl can do the same , and lubuntu too . the `` best '' be count as a basis on `` how many field a thing be better than others '' , and ubuntu be much better in most case . so , it be the best . check out the ubuntu synaptic , and you can get anything like that , low-end system maintainance and so . so , lxde can be use in ubuntu , and so can be icewm , so ubuntu can perform much better than puppy . also , where be the package manager for puppy ? nowhere , ubuntu have strongest support , because check the support of both . nice to know it , but then , community of ubuntu need centralization , not fragmentation . i be among ubuntu sideline developers too ( the default version ) i know , but be it really nicer than synaptic ? no , discussing about puppy be not welcome here . it be he ubuntu support channel . as per my question 's answer , you seem to start debate on the main support channel . please do n't do so . lol , : d it be not so bad , and if you hate google , try www.bing.com , : ) by microsoft , : ) lol , nice



In [19]:

    
import matplotlib.pyplot as plt
plt.hist(sorted(lengths)[:-20])









    Out[19]:





(array([  1.85629000e+06,   2.48062000e+05,   4.06420000e+04,
          8.01600000e+03,   2.06200000e+03,   6.29000000e+02,
          1.99000000e+02,   8.00000000e+01,   3.90000000e+01,
          1.50000000e+01]),
 array([   0. ,   39.3,   78.6,  117.9,  157.2,  196.5,  235.8,  275.1,
         314.4,  353.7,  393. ]),
 <a list of 10 Patch objects>)



In [20]:

    
n_under_20 = sum([1 if l < 100 else 0 for l in lengths])
print(n_under_20, "out of", len(lengths), "({}\%)".format(float(n_under_20)/len(lengths)))









    



2134602 out of 2156054 (0.9900503419673162\%)



In [21]:

    
df_lengths = pd.DataFrame(lengths)



In [22]:

    
df_lengths.describe()









    Out[22]:






  
    
      
      0
    
  
  
    
      count
      2.156054e+06
    
    
      mean
      2.171190e+01
    
    
      std
      2.074356e+01
    
    
      min
      0.000000e+00
    
    
      25%
      8.000000e+00
    
    
      50%
      1.600000e+01
    
    
      75%
      2.800000e+01
    
    
      max
      8.550000e+02

Relationship between Accuracy, Loss, and Others

$$ \text{Loss} = - \frac{1}{N} \sum_{i = 1}^{N} \ln\left(p_{target_i}\right) $$



In [3]:

    
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

# Number of gradient descent steps, each over a batch_size amount of data.
vocab_size = 40000

# Uniform chance of guessing any word. 
loss_random_guess = np.log(float(vocab_size))
print("Loss for uniformly random guessing is", loss_random_guess)

sent_length = [5, 10, 25]
# Outputs correct target x percent of the time. 
pred_accuracy = np.arange(100)

plt.plot(pred_accuracy, [1./p for p in pred_accuracy])









    



Loss for uniformly random guessing is 10.5966347331






    



/usr/local/lib/python3.5/dist-packages/ipykernel/__main__.py:16: RuntimeWarning: divide by zero encountered in true_divide






    Out[3]:





[<matplotlib.lines.Line2D at 0x7fdd93e5bba8>]



In [27]:

    
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = 10, 8
def _sample(logits, t):
    res = logits / t
    res = np.exp(res) / np.sum(np.exp(res))
    return res

N = 100
x = np.arange(N)
before = np.array([1.0+i**2 for i in range(N)])
before /= before.sum()

plt.plot(x, before, 'b--', label='before')

after = _sample(before, 0.1)
plt.plot(x, after, 'g--', label='temp=0.01')

after = _sample(before, 0.2)
print(after.argmax())
plt.plot(x, after, 'r--', label='temp=0.001')


plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)









    



99






    Out[27]:





<matplotlib.legend.Legend at 0x7f2374029358>



In [5]:

    
np.info(plt.plot)









    



 plot(*args, **kwargs)

Plot lines and/or markers to the
:class:`~matplotlib.axes.Axes`.  *args* is a variable length
argument, allowing for multiple *x*, *y* pairs with an
optional format string.  For example, each of the following is
legal::

    plot(x, y)        # plot x and y using default line style and color
    plot(x, y, 'bo')  # plot x and y using blue circle markers
    plot(y)           # plot y using x as index array 0..N-1
    plot(y, 'r+')     # ditto, but with red plusses

If *x* and/or *y* is 2-dimensional, then the corresponding columns
will be plotted.

If used with labeled data, make sure that the color spec is not
included as an element in data, as otherwise the last case
``plot("v","r", data={"v":..., "r":...)``
can be interpreted as the first case which would do ``plot(v, r)``
using the default line style and color.

If not used with labeled data (i.e., without a data argument),
an arbitrary number of *x*, *y*, *fmt* groups can be specified, as in::

    a.plot(x1, y1, 'g^', x2, y2, 'g-')

Return value is a list of lines that were added.

By default, each line is assigned a different style specified by a
'style cycle'.  To change this behavior, you can edit the
axes.prop_cycle rcParam.

The following format string characters are accepted to control
the line style or marker:

================    ===============================
character           description
================    ===============================
``'-'``             solid line style
``'--'``            dashed line style
``'-.'``            dash-dot line style
``':'``             dotted line style
``'.'``             point marker
``','``             pixel marker
``'o'``             circle marker
``'v'``             triangle_down marker
``'^'``             triangle_up marker
``'<'``             triangle_left marker
``'>'``             triangle_right marker
``'1'``             tri_down marker
``'2'``             tri_up marker
``'3'``             tri_left marker
``'4'``             tri_right marker
``'s'``             square marker
``'p'``             pentagon marker
``'*'``             star marker
``'h'``             hexagon1 marker
``'H'``             hexagon2 marker
``'+'``             plus marker
``'x'``             x marker
``'D'``             diamond marker
``'d'``             thin_diamond marker
``'|'``             vline marker
``'_'``             hline marker
================    ===============================


The following color abbreviations are supported:

==========  ========
character   color
==========  ========
'b'         blue
'g'         green
'r'         red
'c'         cyan
'm'         magenta
'y'         yellow
'k'         black
'w'         white
==========  ========

In addition, you can specify colors in many weird and
wonderful ways, including full names (``'green'``), hex
strings (``'#008000'``), RGB or RGBA tuples (``(0,1,0,1)``) or
grayscale intensities as a string (``'0.8'``).  Of these, the
string specifications can be used in place of a ``fmt`` group,
but the tuple forms can be used only as ``kwargs``.

Line styles and colors are combined in a single format string, as in
``'bo'`` for blue circles.

The *kwargs* can be used to set line properties (any property that has
a ``set_*`` method).  You can use this to set a line label (for auto
legends), linewidth, anitialising, marker face color, etc.  Here is an
example::

    plot([1,2,3], [1,2,3], 'go-', label='line 1', linewidth=2)
    plot([1,2,3], [1,4,9], 'rs',  label='line 2')
    axis([0, 4, 0, 10])
    legend()

If you make multiple lines with one plot command, the kwargs
apply to all those lines, e.g.::

    plot(x1, y1, x2, y2, antialiased=False)

Neither line will be antialiased.

You do not need to use format strings, which are just
abbreviations.  All of the line properties can be controlled
by keyword arguments.  For example, you can set the color,
marker, linestyle, and markercolor with::

    plot(x, y, color='green', linestyle='dashed', marker='o',
         markerfacecolor='blue', markersize=12).

See :class:`~matplotlib.lines.Line2D` for details.

The kwargs are :class:`~matplotlib.lines.Line2D` properties:

  agg_filter: unknown
  alpha: float (0.0 transparent through 1.0 opaque) 
  animated: [True | False] 
  antialiased or aa: [True | False] 
  axes: an :class:`~matplotlib.axes.Axes` instance 
  clip_box: a :class:`matplotlib.transforms.Bbox` instance 
  clip_on: [True | False] 
  clip_path: [ (:class:`~matplotlib.path.Path`, :class:`~matplotlib.transforms.Transform`) | :class:`~matplotlib.patches.Patch` | None ] 
  color or c: any matplotlib color 
  contains: a callable function 
  dash_capstyle: ['butt' | 'round' | 'projecting'] 
  dash_joinstyle: ['miter' | 'round' | 'bevel'] 
  dashes: sequence of on/off ink in points 
  drawstyle: ['default' | 'steps' | 'steps-pre' | 'steps-mid' | 'steps-post'] 
  figure: a :class:`matplotlib.figure.Figure` instance 
  fillstyle: ['full' | 'left' | 'right' | 'bottom' | 'top' | 'none'] 
  gid: an id string 
  label: string or anything printable with '%s' conversion. 
  linestyle or ls: ['solid' | 'dashed', 'dashdot', 'dotted' | (offset, on-off-dash-seq) | ``'-'`` | ``'--'`` | ``'-.'`` | ``':'`` | ``'None'`` | ``' '`` | ``''``]
  linewidth or lw: float value in points 
  marker: :mod:`A valid marker style <matplotlib.markers>`
  markeredgecolor or mec: any matplotlib color 
  markeredgewidth or mew: float value in points 
  markerfacecolor or mfc: any matplotlib color 
  markerfacecoloralt or mfcalt: any matplotlib color 
  markersize or ms: float 
  markevery: [None | int | length-2 tuple of int | slice | list/array of int | float | length-2 tuple of float]
  path_effects: unknown
  picker: float distance in points or callable pick function ``fn(artist, event)`` 
  pickradius: float distance in points 
  rasterized: [True | False | None] 
  sketch_params: unknown
  snap: unknown
  solid_capstyle: ['butt' | 'round' |  'projecting'] 
  solid_joinstyle: ['miter' | 'round' | 'bevel'] 
  transform: a :class:`matplotlib.transforms.Transform` instance 
  url: a url string 
  visible: [True | False] 
  xdata: 1D array 
  ydata: 1D array 
  zorder: any number 

kwargs *scalex* and *scaley*, if defined, are passed on to
:meth:`~matplotlib.axes.Axes.autoscale_view` to determine
whether the *x* and *y* axes are autoscaled; the default is
*True*.

.. note::
    In addition to the above described arguments, this function can take a
    **data** keyword argument. If such a **data** argument is given, the
    following arguments are replaced by **data[<arg>]**:

    * All arguments with the following names: 'x', 'y'.



In [ ]:

	Context	Utterance
count	1000000	1000000
unique	957096	849957
top	! ops __eou__ __eot__ ? __eou__ __eot__	thank __eou__
freq	14	11658

	Context	Utterance
0	i think we could import the old comment via rsync , but from there we need to go via email . I think it be easier than cache the status on each bug and than import bits here and there __eou__ __eot__ it would be very easy to keep a hash db of message-ids __eou__ sound good __eou__ __eot__ ok __eou__ perhaps we can ship an ad-hoc apt_prefereces __eou__ __eot__ version ? __eou__ __eot__ thank __eou__ __eot__ not yet __eou__ it be cover by your insurance ? __eou__ __eot__ yes __eou__ but it 's ...	basically each xfree86 upload will NOT force users to upgrade 100Mb of fonts for nothing __eou__ no something i do in my spare time . __eou__
1	I 'm not suggest all - only the ones you modify . __eou__ __eot__ ok , it sound like you 're agree with me , then __eou__ though rather than `` the ones we modify '' , my idea be `` the ones we need to merge '' __eou__ __eot__	oh ? oops . __eou__

	UserOne	UserTwo
0	i think we could import the old comment via rs...	it would be very easy to keep a hash db of me...
1	ok perhaps we can ship an ad-hoc apt_prefereces	version ?
2	thank	not yet it be cover by your insurance ?
3	yes but it 's really not the right time : / w...	you will be move into your house soon ? post ...
4	how urgent be # 896 ?	not particularly urgent , but a policy violat...
5	i agree that we should kill the -novtswitch	ok
6	would you consider a package split a feature ?	context ?
7	split xfonts* out of xfree86* . one upload fo...	split the source package you mean ?
8	yes . same binary package .	i would prefer to avoid it at this stage . th...
9	i 'm not suggest all - only the ones you modif...	ok , it sound like you 're agree with me , th...
10	afternoon all not entirely relate to warty , b...	here
11	you might want to know that thinice in warty ...	and apparently gnome be suddently almost perf...
12	can i file the panel not link to eds ? : )	be you use alt ? or the windows key ? wait fo...
13	i just restart x and now nautilus wo n't show...	do you think we have any interest to have hal...
14	be it a know bug that g-s-t do n't know what ...	somebody should really kick that guy hard i...
15	arse . xt-dev ? i add libx11-dev so just libx...	we have plan to speak about menu organisation...
16	be away , you say ? nope	the warty repository ok , fine . thanks nice ...
17	you 'll be glad to know i 've fix my miss arr...	i 've upload the gnome-vfs without hal suppor...
18	should g2 in ubuntu do the magic dont-focus-w...	we 'll have a bof about this so you 're come t...
19	interest grub-install work with / be ext3 , fa...	more like osx than debian ; ) we have a selec...
20	2.8 be fix them iirc	pong vino will be in enjoy ubuntu ?
21	tell me to come here suggest thursday as a go...	we freeze versions a while back : ) you come ...
22	thats the one	so i saw your email with the mockup at the ai...
23	i 've get a better one now too , give me a mi...	aha ! no , the gui thingy it 's more wizardy ...
24	i think experimental be get 2.8 too let him w...	we call it 'universe ' ; ) haha ooh , totally
25	i want it on in sarge too but nobody else agree	i fully endorse this suggestion < /quimby > ho...
26	and because python give mark a woody	i 'm not sure if we 're mean to talk about th...
27	and i think we be a `` pant off '' kind of co...	mono 1.0 ? dude , that 's go to be a barrel o...
28	there be an accompany irc conversation to tha...	but debian/ be also part of diff.gz ... you c...
29	notwarty , hth , hand , kthxbye < g > everyon...	which ? the best feature of the new imac be t...
...	...	...
170	this mornings run fine .	yep , it be . could you add hurd-i386 amd64 t...
171	do - i assume that you 'll poke elmo to fresh...	do n't know i have to i 'll remind him next ti...
172	know bug . check bugs.debian.org/src : xfree86...	i 'm wait approval to join the sounder list ....
173	setuid /bin/mount be go to be disable in favou...	under all circumstances , i take it ?
174	i would say so	do
175	i will accept responsibility for the addition...	indeed . it 's a recommends : thing , fix act...
176	we need to make sure that cdroms ( and other r...	presumably i should make the same quietinit- ...
177	yes	do ( be the actual option change in sysvinit ...
178	sysvinit usplash will also check for it and n...	the patch as post be wrong , but i 've test w...
179	if it work , can you patch it and upload it ?	sure , let 's make it tomorrow though : - )
180	np	hm , that eject change be go to need some tes...
181	pmount just wrap mount , so long as you be a....	that 's not my point at all
182	i 'll double check that thats the whole reaso...	maybe just fall back to something like the ol...
183	thats a little difficult with the way it be c...	do n't have to be exact by any mean be it imp...
184	what do the fb corruption look like ?	lot of dot and squggles , text be partly read...
185	which package ?	^^^
186	they build 'em sturdy down there , apparently...	over here they install fine , but then apt wa...
187	try apt-get clean ?	that fix it , thank
188	that issue should n't affect upgrade from woo...	pitti : i 'm in group plugdev , it 's still n...
189	one difference be that some pcs have a line o...	removable devices have be bad for me , but it...
190	i think we nail the issue ( executable permis...	it 's copy into their cache already , be n't ...
191	hmm , good point nono , ( b ) give the cd to ...	if we be use grub only , should we turn off do...
192	poke	try to remember if i ask if you be ppc or i38...
193	yeah , series 1 h8 that debconf should be con...	what kind of puppies ? awesome
194	you ping the other day ? ( i only just get to ...	i ping a few minutes ago do you make a usplas...
195	right click on the desktop be not work for me	nautilus manage the desktop ? the icons be di...
196	anybody else see fb corruption during boot ? g...	which video card ?
197	no , it 's my home desktop	i remember fb break on that laptop and iirc it...
198	that typically mean that apt be see multiple p...	there 's something definitely screwy with the...
199	stock warty apt-get install `debootstrap -- p...	what kernel be you run ? # 268154 , yeah , re...

	Context	Utterance
0	Any ideas on how lts will be release ? __eou__...	We be talk 12.04 not 10.04 __eou__
1	how much hdd use ubuntu default install ? __eo...	thats why i ask how much be default install ? ...
2	in my country its nearly the 27th __eou__ when...	thanx __eou__
3	it 's not out __eou__ __eot__ they probabaly b...	wait for many things to be setup __eou__ final...
4	be the ext4 drivers stable ? __eou__ __eot__ I...	you sound like it 's update to skynet . ; ) __...

	0
count	2.156054e+06
mean	2.171190e+01
std	2.074356e+01
min	0.000000e+00
25%	8.000000e+00
50%	1.600000e+01
75%	2.800000e+01
max	8.550000e+02