Data Exploration



In [1]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')

In [ ]:


In [2]:
# Load Data
train_df = pd.read_csv("../data/train.csv")
train_df.Label = train_df.Label.astype('category')

test_df = pd.read_csv("../data/test.csv")
validation_df = pd.read_csv("../data/valid.csv")

In [ ]:


In [3]:
train_df.describe()


Out[3]:
Context Utterance Label
count 1000000 1000000 1000000
unique 957097 736145 2
top ! op __eou__ __eot__ ? __eou__ __eot__ thank __eou__ 0
freq 15 12426 500127

In [ ]:


In [4]:
train_df.Label.hist()
plt.title("Training Label Distribution")


Out[4]:
<matplotlib.text.Text at 0x123bc8390>

In [4]:
pd.options.display.max_colwidth = 500
train_df.head()


Out[4]:
Context Utterance Label
0 i think we could import the old comment via rsync , but from there we need to go via email . i think it be easier than cach the status on each bug and than import bite here and there __eou__ __eot__ it would be veri easi to keep a hash db of message-id __eou__ sound good __eou__ __eot__ ok __eou__ perhap we can ship an ad-hoc apt_preferec __eou__ __eot__ version ? __eou__ __eot__ thank __eou__ __eot__ not yet __eou__ it be cover by your insur ? __eou__ __eot__ yes __eou__ but it 's realli no... basic each xfree86 upload will not forc user to upgrad 100mb of font for noth __eou__ no someth i do in my spare time . __eou__ 1
1 i 'm not suggest all - onli the one you modifi . __eou__ __eot__ ok , it sound like you re agre with me , then __eou__ though rather than `` the one we modifi '' , my idea be `` the one we need to merg '' __eou__ __eot__ sorri __eou__ i think it be ubuntu relat . __eou__ 0
2 afternoon all __eou__ not entir relat to warti , but if grub-instal take 5 minut to instal , be this a sign that i should just retri the instal : ) __eou__ __eot__ here __eou__ __eot__ you might want to know that thinic in warti be buggi compar to that in sid __eou__ __eot__ and appar gnome be suddent almost perfect ( out of the thinic problem ) , nobodi report bug : -p __eou__ i do n't get your question , where do you want to past ? __eou__ __eot__ can i file the panel not link to ed ? : ) ... yep . __eou__ oh , okay . i wonder what happen to you __eou__ what distro do you need ? __eou__ yes __eou__ 0
3 interest __eou__ grub-instal work with / be ext3 , fail when it be xfs __eou__ i think d-i instal the relev kernel for your machin . i have a p4 and it instal the 386 kernel __eou__ holi crap a lot of stuff get instal by default : ) __eou__ you be instal vim on a box of mine __eou__ ; ) __eou__ __eot__ more like osx than debian ; ) __eou__ we have a select of python modul avail for great justic ( and python develop ) __eou__ __eot__ 2.8 be fix them iirc __eou__ __eot__ pong __eou__ vino will... that the one __eou__ 1
4 and becaus python give mark a woodi __eou__ __eot__ i 'm not sure if we re mean to talk about that public yet . __eou__ __eot__ and i think we be a `` pant off '' kind of compani ... : p __eou__ you need new glass __eou__ __eot__ mono 1.0 ? dude , that 's go to be a barrel of laugh for total non-releas relat reason dure hoari __eou__ read bryan clark 's entri about networkmanag ? __eou__ __eot__ there be an accompani irc convers to that one < g > __eou__ explain ? __eou__ i guess you could s... ( i think someon be go to make a joke about .au bandwidth ... ) __eou__ especi not if you re use screen ; ) __eou__ 1

In [5]:
validation_df.head()


Out[5]:
Context Ground Truth Utterance Distractor_0 Distractor_1 Distractor_2 Distractor_3 Distractor_4 Distractor_5 Distractor_6 Distractor_7 Distractor_8
0 ani idea on how lts will be releas ? __eou__ __eot__ alreadi be __eou__ __eot__ we be talk 12.04 not 10.04 __eou__ you rememb my flash issu from yesterday or the day befor ? __eou__ oh , no idea other be probabl ok __eou__ update-manag or even apt __eou__ it will be sort to the right packag by a bug triager , dont worri about it to much __eou__ sinc uniti be a compiz plugin i would say everyon who use uniti doe : ) __eou__ as i say abov , uniti be a compiz plugin ... so it would be heavili notic if compiz wouldnt work __eou__ no , greenit be say his download speed be slow , when connect to a machin on his same lan . i 'm unsur whi you think set it up to go to the internet would be a ) easier , and b ) make it ani faster __eou__ well that be probabl the issu then . he need to be at 1280x1024 instead of 1024x768 __eou__ lsb_releas -sc __eou__ well ... regardless . i believ the solut be go to be to live boot to cd , chroot into the machin and set a password . __eou__ boot to live cd . open a termin and type sudo -s. mount /dev/sdxi /mnt where x be the disk and y be the partit . typic /dev/sda1 . then chroot /mnt . to creat a user : useradd -g admin mynewusernam use an actual user name . __eou__ if we be do more than this we would want to mount more thing ... but for this we shouldnt need to __eou__ then , be... you can buy _anything_ in china __eou__ no __eou__ sudo restart lightdm __eou__ you be still ask for the uniti logout menu right ? __eou__ so i be work as a linux admin intern , and my boss tell me to use `` sudo su - '' __eou__ all rhel or cento box ; be there a reason for that ? __eou__ be it a tradit thing ? do sudo -i not exist at some point __eou__
1 how much hdd use ubuntu default instal ? __eou__ __eot__ https : //help.ubuntu.com/community/installation/systemrequir __eou__ it wont requir 15gb to be honest ... __eou__ __eot__ that whi i ask how much be default instal ? : ) __eou__ all of this possibl in older version of ubuntu __eou__ *was possibl __eou__ : be that a question ? __eou__ yes __eou__ thank __eou__ i would imagin so , the site bonni link earlier explicit state turn hamachi off __eou__ yes i ve investig that alreadi . it seem you ca n't treat both super key differ . __eou__ not realli . i use urxvt myself . __eou__ thank a lot , realli ! __eou__ as someon els suggest , close update-manag , and open from termin update-manag -d but ... it might be wise to wait for a point releas . 1204 be al veri differ than 1004 . you might grab a dvd and load it in vm and see if you be comfort with the chang . __eou__ you re welcom .. sinc 12.04 throw dnsmasq into the mix by default ( see http : //www.stgraber.org/2012/02/24/dns-in-ubuntu-12-04/ ) complex rise a bite .. mayb that page have some info on squid __eou__
2 in my countri it near the 27th __eou__ when will 12.10 be out ? __eou__ __eot__ plan oct 18th accord to this . https : //wiki.ubuntu.com/quantalquetzal/releaseschedul ? action=show & redirect=qreleaseschedul __eou__ __eot__ thanx __eou__ i have no .docx file , so do n't know , whi not tri it yourself __eou__ i ve boot countless distro from usb on my aao __eou__ year of experience.. : ) __eou__ i know , and my experi tell me your result will be negat __eou__ but i 'm sure i can work it out __eou__ the way you put it , that sound like a sever case of pebkac ; ) __eou__ im not familiar with hotspot __eou__ it work fine without set up an ssh tunnel manual . fun thing be , that if one machin run 11.10 and the other 12.04 , the request from 12.04 to 11.10 work , and a connect can be make . if both be 12.04 it doe n't work . __eou__ i ve be think the same thing . i guess i ll have to file one : ( __eou__ just want to check , not that i 'm be stupid . __eou__ do n't test that sinc the other comput be 12.04 ( a relat of mine ) and i run differ ubuntu version on vbox . __eou__ oh , 64mb ? well then .... so it have two be a two-command process ? __eou__ and becaus you onli have 3 gb of ram , be not justifi to run a 64 bite system there as 32 bite would run faster . __eou__ it ok but no error ? then how do you know it a problem with dhcp client ? __eou__
3 it 's not out __eou__ __eot__ they probabali be wait for all the mirror to sync . the releas annoc will be after that . __eou__ __eot__ wait for mani thing to be setup __eou__ final warn - you do n't know when it will be releas , so do n't suggest it will be ani moment __eou__ that 's right , while chat i regrett make a lot of typo 's . __eou__ afaik it 's best to start at 2mb = 2048k __eou__ for the most part , you should be instal python modul through packag avail in our repositori . but pip or easy_instal or manual via distutil would be the next cours of action . __eou__ do you overwrit your win instal or can you brows that drive from ubuntu ? __eou__ your mbr be fine if you be boot ubuntu , you like just need to ask grub to let you choose which os you want befor auto boot ubuntu __eou__ odd , you could manuali add it if need __eou__ for some reason the headphon option doe not chang __eou__ well then i do n't know . can anyth boot on the comput ? __eou__ well then i do n't know . can anyth boot on the comput ? __eou__ ya , but i guess you could do a git of your entir os , and that would be the same xd __eou__ noexec be a mount option . you would have to creat a partit and mount it __eou__
4 be the ext4 driver stabl ? __eou__ __eot__ i be not sure but the last time i check , it be n't __eou__ there have be numer report of data loss or corrupt __eou__ __eot__ you sound like it 's updat to skynet . ; ) __eou__ ok i will tri that , brb __eou__ it complain about export not be an identifi ... never hear of the command myself __eou__ and there be no man entri for export __eou__ ouch __eou__ i do system annalysi and it say everyth pass 100 % __eou__ not to mention way less complex ... you can have a setup in under 10 line __eou__ well , you can , accord to that articl , i also notic the watermark vuner . __eou__ if not , i think you can pretti much grab ani usb analog video convert that compli to the devic class and use that __eou__ not sure which softwar , though __eou__ gpart ? i do n't want do edit partit , just mount at startup __eou__ i ve tri it . not a fan at all __eou__ i have no desir to learn a new way of use the app just becaus peopl be duplic the limit of less power window manag __eou__ ah , okay __eou__

In [ ]:


In [ ]:


In [13]:
plt.figure(1)
train_df_context_len = train_df.Context.str.split(" ").apply(len)
w.hist(bins=40)
plt.title("Training Context Length Statistics")
print(train_df_context_len.describe())

plt.figure(2)
train_df_utterance_len = train_df.Utterance.str.split(" ").apply(len)
train_df_utterance_len.hist(bins=40)
plt.title("Training Utterance Length Statistics")
print(train_df_utterance_len.describe())


count    1000000.000000
mean          86.339195
std           74.929713
min            5.000000
25%           37.000000
50%           63.000000
75%          108.000000
max         1879.000000
Name: Context, dtype: float64
count    1000000.000000
mean          17.246392
std           16.422901
min            1.000000
25%            7.000000
50%           13.000000
75%           22.000000
max          653.000000
Name: Utterance, dtype: float64

In [14]:
pd.options.display.max_colwidth = 500
test_df.head()


Out[14]:
Context Ground Truth Utterance Distractor_0 Distractor_1 Distractor_2 Distractor_3 Distractor_4 Distractor_5 Distractor_6 Distractor_7 Distractor_8
0 anyon know whi my stock oneir export env var usernam ' ? i mean what be that use for ? i know of $ user but not $ usernam . my precis instal doe n't export usernam __eou__ __eot__ look like it use to be export by lightdm , but the line have the comment `` // fixm : be this requir ? '' so i guess it be n't surpris it be go __eou__ __eot__ thank ! how the heck do you figur that out ? __eou__ __eot__ https : //bugs.launchpad.net/lightdm/+bug/864109/comments/3 __eou__ __eot__ nice thank ! __eou__ wrong channel for it , but check efnet.org , unoffici page . __eou__ everi time the kernel chang , you will lose video __eou__ yep __eou__ ok __eou__ ! nomodeset > acer __eou__ i 'm assum it be a driver issu . __eou__ ! pm > acer __eou__ i do n't pm . ; ) __eou__ oop sorri for the cap __eou__ http : //www.ubuntu.com/project/about-ubuntu/deriv ( some call them deriv , other call them flavor , same differ ) __eou__ thx __eou__ unfortun the program be n't instal from the repositori __eou__ how can i check ? by do a recoveri for test ? __eou__ my humbl apolog __eou__ # ubuntu-offtop __eou__
1 i set up my hd such that i have to type a passphras to access it at boot . how can i remov that passwrd , and just boot up normal . i do this at instal , it work fine , just tire of have reboot where i need to be at termin to type passwd in . help ? __eou__ __eot__ backup your data , and re-instal without encrypt `` might '' be the easiest method __eou__ __eot__ so you dont know , ok , anyon els ? __eou__ you be like , yah my mous doesnt work , reinstal your os lolol what a joke __eou__ nmap be nice , but it be n't what i be look for . i final find it again : mtr ( my tracerout ) be what i be look for . i ll be keep nmap handi though . __eou__ ok __eou__ cdrom work fine on window . __eou__ i dont think it have anyth to do with the bure process , cds work fine on my desktop and my other ubuntu lap __eou__ ah yes , i have read return as rerun __eou__ hm ? __eou__ not the case , lts be everi other .04 releas . the .04 be n't alway more stabl __eou__ i would reinstal with precis __eou__ you can restor user data and such from backup __eou__ pretti much __eou__ i use the one i download from amd __eou__ ffmpeg be part of the packag , quixotedon , at least i 'm quit sure it still be __eou__ if not just instal ffmpeg __eou__
2 im tri to use ubuntu on my macbook pro retina __eou__ i read in the forum that ubuntu have a appl version now ? __eou__ __eot__ not that ive ever hear of.. normal ubutnu should work on an intel base mac . there be the ppc version also . __eou__ you want total control ? or what be you want exact ? __eou__ __eot__ just wonder how it run __eou__ yes , that 's what i do , export it to a `` id_dsa '' file , then back to ubuntu copi it into ~/.ssh/ __eou__ noth - i be talk about the question of myhero __eou__ that should fix the font be too larg __eou__ okay , so hcitool echo back hci0 < mac address of control > but the bluetooth devic panel keep disconnect and reconnect the devic ( or so it seem ) ani idea whi that would be ? __eou__ i get to the menu with option such as tri ubuntu ' , instal ubuntu ' , check disc ' __eou__ whi do u need analyz __eou__ it be a toy __eou__ ok msp301 __eou__ but y , i mean it be the same ubunut , onli with older program __eou__ ubuntu 804 or 1204 __eou__ no i dont use 804 __eou__ i be ask hypo qs __eou__ cntrl-c may stop the command but it doe n't fix my hdd problem . __eou__ if you re onli go to run ubuntu , just get a normal pc rather than a mac __eou__ that say , i 'm run it on a macbook , becaus i get one relat cheapli __eou__ the one which be not pick up at the moment be on stderr and not stdout and > be onli cover stdout __eou__
3 no suggest ? __eou__ link ? __eou__ how can i remov luk passphras at boot . i dont want to use featur anymor ... __eou__ __eot__ you may need to creat a new volum __eou__ __eot__ that lead me to the next question lol ... i dont know how to creat new volum exact in cmdline , usual i use a gui . im just tri to access this server via usb load with next os im go to load , the luk pw be stop me __eou__ __eot__ for someth like that i would like use someth like a live gpart disk to avoid the confli... you cant load anyth via usb or cd when luk be run __eou__ it wont allow usb boot , i tri with 2 diff usb drive __eou__ -p sorri ... __eou__ nmap -p22 __eou__ it doe n't say : 22/tcp open ssh ? __eou__ i guess so i ca n't even launch it . __eou__ note __eou__ rxvt-unicod be one __eou__ i tar all of ~ __eou__ i tar all of ~ __eou__ i do n't realli know if i can help , but i be curious . lol __eou__ that 's cool . i ll look into it . now , we better stop talk about this sinc it 's offtop . : p __eou__ that work just fine , thank ! __eou__ thank you __eou__
4 i just ad a second usb printer but not sure what the uri should read - can anyon help with usb printer ? __eou__ __eot__ firefox localhost:631 __eou__ __eot__ firefox ? __eou__ __eot__ yes __eou__ firefox localhost:631 __eou__ firefox http : //localhost:631 __eou__ cup have a web base interfac __eou__ __eot__ i be set it up under the printer configur __eou__ thank ! __eou__ i 'd say the most common venu would be via launchpad . check out the factoid ! bug as well __eou__ the old hardi man page , http : //manpages.ubuntu.com/manpages/hardy/man1/gcalctool.1.html say `` delet '' clear the screen , but it doe n't __eou__ becaus lts be good __eou__ i ll give a tri __eou__ by the way , the url you post for davf be from dapper ... that 's 5.xx iirc __eou__ http : //ubuntuforums.org/showthread.php ? t=1549847 __eou__ so i load up putti gui , then what do i do ? __eou__ you should read error messag , it say be you root ? ' __eou__ wait the colleg semest to close just to make sure i will not need to reconfigur my environ again __eou__ i be call myself a jerk . all i know be that you download a game success . __eou__

In [15]:
test_df.describe()


Out[15]:
Context Ground Truth Utterance Distractor_0 Distractor_1 Distractor_2 Distractor_3 Distractor_4 Distractor_5 Distractor_6 Distractor_7 Distractor_8
count 18920 18920 18920 18920 18920 18920 18920 18920 18920 18920 18920
unique 18920 17914 13982 13902 14077 14041 14101 14072 13969 13975 14123
top hi , when be the new gstreamersdk will be upload to ubuntu repositori ? __eou__ __eot__ ubuntu version most doe not allow the `` new '' softwar in various releas , version be `` freeze '' apart from select applic and import secur fix __eou__ __eot__ from what i understand , the gstreamersdk be go to be the onli possibl way to develop an applic use gstreamer lib . __eou__ __eot__ thank __eou__ thank __eou__ thank __eou__ thank __eou__ thank __eou__ thank __eou__ thank __eou__ thank __eou__ thank __eou__ thank __eou__
freq 1 186 176 186 194 195 167 197 190 188 201