In [1]:
import pandas as pd;
import numpy as np;

In [2]:
regular_tweets = pd.read_csv('processed_tweets/regular_df_eng.csv')
sarcastic_tweets = pd.read_csv('processed_tweets/sarcastic_df_eng.csv')

In [3]:
regular_tweets = regular_tweets.drop( ['Unnamed: 0'], axis=1)
sarcastic_tweets = sarcastic_tweets.drop( ['Unnamed: 0'], axis=1)

More Feature Engineering

AllCapsCount - count of words in tweet with all capital letters


In [4]:
# ALL CAPS COUNT
sarcastic_tweets['AllCapsCount'] = [sum([k.isupper() for k in n.split(" ")]) for n in sarcastic_tweets['0']]
regular_tweets['AllCapsCount'] = [sum([k.isupper() for k in n.split(" ")]) for n in regular_tweets['0']]

In [5]:
sarcastic_tweets


Out[5]:
0 type English ToUser Hashtags AllCapsCount
0 Thanks sarcastic 1 0 0 0
1 Top tip. To illicit a \"thank you\" from some... sarcastic 1 0 1 0
2 Thanks to whoever just threw the bag of waterm... sarcastic 1 0 1 0
3 yes let's #EndFathersDay because the mother i... sarcastic 1 0 1 0
4 Well it's just gonna turn into a lovely day sarcastic 1 0 0 0
5 Nothing to see here, move along Lerner's L... sarcastic 1 0 0 1
6 So who does Campbell play for? sarcastic 1 0 0 0
7 @JamesBraginton @STEM08 @thinkprogress \nJames... sarcastic 1 1 0 1
8 Does this make me fancy? #imsofancyyoualreadyk... sarcastic 1 0 1 0
9 I love that Arequipa just shuts the water off ... sarcastic 1 0 0 1
10 Everyone's at Notre Dame and I'm just sitting ... sarcastic 1 0 1 0
11 Tweet of the day!!!! \ud83d\ude1c Holy shit... sarcastic 1 0 0 0
12 @LUTZLOVER43 sarcastic 1 1 0 1
13 #orgasm sarcastic 1 0 1 0
14 I hate to see Luis Suarez get injured, returni... sarcastic 1 0 0 1
15 @jaycutlersux right bc Bush\/Cheney were total... sarcastic 1 1 0 0
16 @TPoloking don't wrry sarcastic 1 1 0 0
17 wow today is going just absolutely SPECTACULAR sarcastic 1 0 0 1
18 Looking forward to playing Costa Rica what wit... sarcastic 1 0 0 0
19 After the last friendly, I can only be happy a... sarcastic 1 0 0 1
20 Tithes paid. Bills paid. Now to go clock anoth... sarcastic 1 0 0 1
21 Clive thought that was in, great commentary ... sarcastic 1 0 1 0
22 @New0rleans_Lady @Aaron_RS Lol Thats they fair... sarcastic 1 1 0 0
23 \"Nothing says 'come to me baby' like a sexy p... sarcastic 1 0 0 0
24 @ScottCubs36 why is white so positive, you rac... sarcastic 1 1 1 0
25 I just love that when a celebrity or a youtube... sarcastic 1 0 1 1
26 Loving the football tonight \ud83d\udc9c\ud83d... sarcastic 1 0 1 0
27 I just pulled off 3 ticks from my hip yay! #... sarcastic 1 0 1 1
28 @francescaacox apparently people have been men... sarcastic 1 1 0 0
29 The only downside is that it wouldn't destroy ... sarcastic 1 0 0 0
... ... ... ... ... ... ...
130021 Like seriously! :D sarcastic 1 0 0 1
130022 works sometimes. sarcastic 1 0 0 0
130023 In other news, all MQM bhayya's have decided t... sarcastic 1 0 1 2
130024 @WesleyLowery Yeah, just a notch above Andrea ... sarcastic 1 1 0 0
130025 In other news, all MQM bhayya's have decided t... sarcastic 1 0 1 2
130026 REGAIN YOUR POISE HYSTERICAL WOMAN-Careful, bo... sarcastic 1 0 1 4
130027 Ya that was me! sarcastic 1 0 0 0
130028 She always cooks me the best gourmet meals...t... sarcastic 1 0 1 1
130029 Did I miss the week long build up of @KingJame... sarcastic 1 1 1 1
130030 Bite an opposing player in the World Cup, get ... sarcastic 1 1 1 0
130031 Who is this Lebron James that everybody is spe... sarcastic 1 0 0 1
130032 I love how certain circumstances can be both c... sarcastic 1 0 0 1
130033 It's real mature to talk smack behind someone'... sarcastic 1 0 1 1
130034 @Tolstoved @IDFSpokesperson I changed my mind,... sarcastic 1 1 0 1
130035 @TU_Keem sarcastic 1 1 0 0
130036 So wait. Is LeBron James going back to Clevela... sarcastic 1 0 1 1
130037 $BBRY Better listen to TMF BlackBerry is dead ... sarcastic 1 0 0 2
130038 @KindredShins REGAIN YOUR POISE HYSTERICAL WOM... sarcastic 1 1 0 4
130039 I'm really stoked that @espn came back with th... sarcastic 1 1 0 0
130040 In case you didn't know, LeBron is signing wit... sarcastic 1 0 0 0
130041 Hey #Cleveland looks like you are going to get... sarcastic 1 0 1 0
130042 @Wac4 GREAT! Jeremy Lin might becoming to the ... sarcastic 1 1 0 1
130043 #bitchvice studios is moving! So. Much. Fun. ... sarcastic 1 0 1 0
130044 Microsoft updates are so important that I don'... sarcastic 1 0 0 1
130045 @erikld Let me try to reproduce sarcastic 1 1 0 0
130046 And the pursuit of kevin love has picked up. G... sarcastic 1 0 0 1
130047 Forgot how delightful an experience travelling... sarcastic 1 0 1 0
130048 Not sure if everyone has heard, but Labron Jam... sarcastic 1 0 1 0
130049 That heat fan wasn't mad at all sarcastic 1 0 0 0
130050 I always roll my eyes when ppl say \"Your face... sarcastic 1 0 0 2

130051 rows × 6 columns

Unigram counts (top 5000 unigrams)


In [7]:
from sklearn.feature_extraction.text import CountVectorizer

In [8]:
vector = CountVectorizer(max_features=5000).fit(open('processed_tweets/sarcastic_df_eng.csv'))

In [9]:
vector.get_feature_names()


Out[9]:
['00',
 '000',
 '10',
 '100',
 '1000',
 '10k',
 '10pm',
 '11',
 '11pm',
 '12',
 '120',
 '13',
 '13th',
 '14',
 '140',
 '15',
 '16',
 '17',
 '18',
 '19',
 '1am',
 '1n1t',
 '1st',
 '20',
 '200',
 '2010',
 '2013',
 '2014',
 '2014th',
 '2015',
 '21',
 '21st',
 '22',
 '23',
 '24',
 '25',
 '26',
 '27',
 '28',
 '29',
 '2am',
 '2nd',
 '30',
 '300',
 '30am',
 '30pm',
 '31',
 '32',
 '35',
 '38',
 '39',
 '3am',
 '3rd',
 '40',
 '400',
 '44',
 '45',
 '48',
 '4am',
 '4th',
 '4thofjuly',
 '50',
 '500',
 '5am',
 '5sos',
 '5th',
 '60',
 '6th',
 '70',
 '75',
 '76ers',
 '7th',
 '80',
 '81',
 '85',
 '8am',
 '90',
 '911',
 '95',
 '99',
 '9th',
 '__',
 '___',
 '____',
 'aa',
 'aaron',
 'ab',
 'abandoned',
 'abc',
 'ability',
 'able',
 'abortion',
 'about',
 'above',
 'absolute',
 'absolutely',
 'abt',
 'abuse',
 'ac',
 'accent',
 'accept',
 'acceptable',
 'access',
 'accident',
 'accidentally',
 'accomplished',
 'accomplishment',
 'according',
 'account',
 'accounting',
 'accounts',
 'accurate',
 'accused',
 'ace',
 'achievement',
 'across',
 'act',
 'acting',
 'action',
 'actions',
 'activity',
 'actor',
 'acts',
 'actual',
 'actually',
 'ad',
 'adam',
 'adams',
 'add',
 'added',
 'addicted',
 'addiction',
 'adding',
 'addition',
 'address',
 'adds',
 'adelaide',
 'administration',
 'admire',
 'admit',
 'adorable',
 'adrian',
 'ads',
 'adult',
 'advance',
 'advanced',
 'advantage',
 'adventure',
 'adventures',
 'advice',
 'afc',
 'afford',
 'afl',
 'afraid',
 'africa',
 'african',
 'after',
 'afternoon',
 'again',
 'against',
 'age',
 'agency',
 'agenda',
 'agent',
 'agents',
 'ages',
 'aggressive',
 'ago',
 'agree',
 'agreed',
 'ah',
 'aha',
 'ahead',
 'ahh',
 'ahhh',
 'aid',
 'ain',
 'aint',
 'air',
 'aircanada',
 'airlines',
 'airport',
 'aj',
 'ajkreisberg',
 'aka',
 'al',
 'alan',
 'alanashby',
 'alarm',
 'album',
 'alcohol',
 'alert',
 'alex',
 'alexis',
 'algeria',
 'alive',
 'all',
 'allegations',
 'allegiance',
 'allergic',
 'allergies',
 'alliance',
 'allow',
 'allowed',
 'allowing',
 'almost',
 'alone',
 'along',
 'alot',
 'already',
 'alright',
 'also',
 'although',
 'always',
 'am',
 'amaze',
 'amazes',
 'amazing',
 'amazon',
 'america',
 'american',
 'americanair',
 'americans',
 'amirite',
 'among',
 'amount',
 'amp',
 'amtrak',
 'amusement',
 'an',
 'analysis',
 'ananvii',
 'and',
 'andrew',
 'android',
 'andy',
 'angel',
 'angels',
 'anger',
 'angry',
 'animal',
 'animals',
 'ankle',
 'ann',
 'anncoulter',
 'anniversary',
 'announced',
 'announcement',
 'announcer',
 'announcers',
 'annoy',
 'annoyed',
 'annoying',
 'annual',
 'another',
 'answer',
 'answered',
 'answering',
 'answers',
 'ant',
 'anthem',
 'anthony',
 'anthonycumia',
 'anti',
 'antonio',
 'anxiety',
 'any',
 'anybody',
 'anymore',
 'anyone',
 'anything',
 'anytime',
 'anyway',
 'anyways',
 'anywhere',
 'ap',
 'apart',
 'apartment',
 'apologies',
 'apologise',
 'apologize',
 'apology',
 'app',
 'apparently',
 'appear',
 'appears',
 'applause',
 'apple',
 'apply',
 'appointment',
 'appointments',
 'appreciate',
 'appreciated',
 'april',
 'apt',
 'arab',
 'are',
 'area',
 'aren',
 'arena',
 'arent',
 'arg',
 'argentina',
 'argue',
 'arguing',
 'argument',
 'arguments',
 'argvsbih',
 'arizona',
 'arm',
 'armed',
 'arms',
 'army',
 'around',
 'arrested',
 'arrival',
 'arrive',
 'arrived',
 'arsenal',
 'art',
 'article',
 'articles',
 'artist',
 'artistic',
 'as',
 'asap',
 'asg',
 'ashamed',
 'aside',
 'ask',
 'asked',
 'asking',
 'asks',
 'asleep',
 'ass',
 'assad',
 'asshole',
 'assholes',
 'assume',
 'astros',
 'asylum',
 'at',
 'ate',
 'atheist',
 'athlete',
 'athletes',
 'atl',
 'atlanta',
 'atleast',
 'atm',
 'atmosphere',
 'att',
 'attack',
 'attacks',
 'attcustomercare',
 'attempt',
 'attend',
 'attendance',
 'attention',
 'attitude',
 'attractive',
 'audience',
 'august',
 'aunt',
 'auspol',
 'aussie',
 'austin',
 'australia',
 'authority',
 'autistic',
 'auto',
 'autocorrect',
 'automatically',
 'available',
 'average',
 'avi',
 'avoid',
 'aw',
 'awake',
 'award',
 'awards',
 'aware',
 'awareness',
 'away',
 'awe',
 'awesome',
 'awful',
 'awh',
 'awhile',
 'awkward',
 'aww',
 'awww',
 'aye',
 'b4',
 'babe',
 'babies',
 'baby',
 'babysit',
 'babysitting',
 'back',
 'backed',
 'background',
 'backs',
 'backwards',
 'bad',
 'badass',
 'badly',
 'bae',
 'bag',
 'bags',
 'baker',
 'balanced',
 'ball',
 'balls',
 'balotelli',
 'baltimore',
 'ban',
 'banana',
 'band',
 'bands',
 'bandwagon',
 'bang',
 'bank',
 'banks',
 'banned',
 'banter',
 'bar',
 'barackobama',
 'barca',
 'barely',
 'barrel',
 'base',
 'baseball',
 'based',
 'basement',
 'bases',
 'basic',
 'basically',
 'basis',
 'basketball',
 'bastard',
 'bat',
 'bathroom',
 'battery',
 'batting',
 'battle',
 'bay',
 'bb15',
 'bb16',
 'bbad',
 'bbc',
 'bbcnews',
 'bbcsport',
 'bbcworldcup',
 'bbq',
 'bbuk',
 'bc',
 'bcuz',
 'bday',
 'be',
 'beach',
 'bear',
 'beard',
 'beast',
 'beat',
 'beating',
 'beats',
 'beautiful',
 'beauty',
 'became',
 'because',
 'become',
 'becoming',
 'bed',
 'bedroom',
 'bedtime',
 'been',
 'beer',
 'before',
 'begin',
 'beginning',
 'behavior',
 'behind',
 'being',
 'bel',
 'belgian',
 'belgium',
 'beliefs',
 'believe',
 'believed',
 'bell',
 'belong',
 'belongs',
 'below',
 'bench',
 'benefit',
 'benefits',
 'benefitsbritain',
 'benghazi',
 'beroyalkc',
 'besides',
 'best',
 'bet',
 'betawards',
 'better',
 'between',
 'beware',
 'beyonce',
 'beyond',
 'bf',
 'bff',
 'bias',
 'biased',
 'bible',
 'big',
 'bigbrother',
 'bigger',
 'biggest',
 'bike',
 'bikes',
 'bill',
 'billion',
 'bills',
 'billy',
 'bing',
 'bio',
 'bird',
 'birds',
 'birth',
 'birthday',
 'bit',
 'bitch',
 'bitches',
 'bite',
 'bites',
 'biting',
 'bitter',
 'biz',
 'bj',
 'bjp',
 'black',
 'blackhawks',
 'blackrepublican',
 'blacks',
 'blah',
 'blair',
 'blame',
 'blaming',
 'blast',
 'bleeding',
 'bless',
 'blessed',
 'blew',
 'blind',
 'blinker',
 'block',
 'blocked',
 'blog',
 'blonde',
 'blood',
 'bloody',
 'blow',
 'blowing',
 'blown',
 'blue',
 'bluejays',
 'board',
 'boat',
 'bob',
 'body',
 'bold',
 'bomb',
 'bombs',
 'bond',
 'bonus',
 'boo',
 'boob',
 'boobs',
 'book',
 'bookbuzzr',
 'books',
 'boom',
 'boomboom',
 'boost',
 'boots',
 'booze',
 'border',
 'bored',
 'boredom',
 'boring',
 'born',
 'bosh',
 'bosnia',
 'boss',
 'boston',
 'both',
 'bother',
 'bothered',
 'bottle',
 'bottles',
 'bottom',
 'bought',
 'bound',
 'bout',
 'bowl',
 'box',
 'boy',
 'boyfriend',
 'boys',
 'bra',
 'braces',
 'brachi',
 'bradley',
 'brager',
 'brain',
 'brains',
 'brand',
 'brandon',
 'brasil',
 'brave',
 'braves',
 'bravo',
 'bravschi',
 'bravscol',
 'bravsger',
 'brazil',
 'brazilian',
 'brazilians',
 'brazilvsgermany',
 'bread',
 'break',
 'breakfast',
 'breaking',
 'breaks',
 'breath',
 'breathe',
 'brewers',
 'brian',
 'bright',
 'brilliance',
 'brilliant',
 'bring',
 'bringing',
 'brings',
 'british',
 'bro',
 'broadband',
 'broadcast',
 'broke',
 'broken',
 'brookeg105',
 'brooks',
 'brother',
 'brothers',
 'brought',
 'brown',
 'browns',
 'bruh',
 'brutal',
 'bryan',
 'bs',
 'bt',
 'btw',
 'buck',
 'bucket',
 'bucks',
 'bud',
 'buddy',
 'budget',
 'buffalo',
 'bug',
 'build',
 'building',
 'built',
 'bull',
 'bullpen',
 'bullshit',
 'bully',
 'bum',
 'bummed',
 'bunch',
 'burn',
 'burned',
 'burning',
 'burns',
 'burnt',
 'burst',
 'bus',
 'buses',
 'bush',
 'busiest',
 'business',
 'busy',
 'but',
 'butreally',
 'butt',
 'button',
 'buy',
 'buying',
 'buys',
 'buzzfeed',
 'buzzing',
 'by',
 'bye',
 'cable',
 'cake',
 'cal',
 'caleb',
 'california',
 'call',
 'called',
 'calling',
 'callon',
 'calls',
 'calm',
 'calum5sos',
 'cam',
 'came',
 'camera',
 'camerondallas',
 'camp',
 'campaign',
 'campbell',
 'camping',
 'can',
 'canada',
 'canadian',
 'cancel',
 'canceled',
 'cancelled',
 'cancer',
 'candidate',
 'candy',
 'cannot',
 'cant',
 'cantsleep',
 'cantwait',
 'canucks',
 'cap',
 'capable',
 'capitalist',
 'caps',
 'car',
 'card',
 'cardinals',
 'cards',
 'care',
 'career',
 'careful',
 'cares',
 'caring',
 'carlisle',
 'carlos',
 'carmelo',
 'carras16',
 'carry',
 'carrying',
 'cars',
 'carter',
 'case',
 'cash',
 'cast',
 'cat',
 'catch',
 'catching',
 'cats',
 'caught',
 'cause',
 'caused',
 'causes',
 'causing',
 'cavs',
 'cbc',
 'cc',
 'ccot',
 'cd',
 'cdnpoli',
 'cease',
 'ceiling',
 'celeb',
 'celebrate',
 'celebrating',
 'celebration',
 'celebrity',
 'cell',
 'celtics',
 'cena',
 'center',
 'central',
 'centre',
 'century',
 'ceo',
 'ceos',
 'ceremony',
 'certain',
 'certainly',
 'cfl',
 'challenge',
 'champ',
 'champion',
 'champions',
 'championship',
 'championships',
 'chance',
 'chances',
 'change',
 'changed',
 'changes',
 'changing',
 'channel',
 'channels',
 'chaos',
 'character',
 'characters',
 'charge',
 'charged',
 'charger',
 'charges',
 'charlie',
 'charlotte',
 'charm',
 'charming',
 'chasing',
 'chat',
 'cheap',
 'cheat',
 'check',
 'checked',
 'checking',
 'cheer',
 'cheering',
 'cheers',
 'cheery',
 'cheese',
 'chemistry',
 'cheney',
 'chess',
 'chest',
 'chew',
 'chi',
 'chicago',
 'chick',
 'chicken',
 'chiellini',
 'child',
 'children',
 'chile',
 'chiles',
 'chill',
 'chillin',
 'chilling',
 'china',
 'chipotle',
 'chips',
 'chocolate',
 'choice',
 'choices',
 'choose',
 'chose',
 'chosen',
 'chris',
 'chris_broussard',
 'christ',
 'christian',
 'christians',
 'christmas',
 'chuck',
 'church',
 'cia',
 'cigarette',
 'cinema',
 'circle',
 'citizens',
 'city',
 'civil',
 'claim',
 'claiming',
 'claims',
 'clap',
 'class',
 'classes',
 'classic',
 'classy',
 'cle',
 'clean',
 'cleaned',
 'cleaning',
 'clear',
 'clearly',
 'cleveland',
 'clever',
 'click',
 'client',
 'clients',
 'cliff',
 'climate',
 'climatenow',
 'clinton',
 'clock',
 'close',
 'closed',
 'closer',
 'closest',
 'closing',
 'clothes',
 'cloud',
 'clown',
 'clowns',
 'club',
 'clubs',
 'clue',
 'clueless',
 'clutch',
 'cm',
 'cmon',
 'cn',
 'cnblog',
 'cnn',
 'co',
 'coach',
 'coaches',
 'coaching',
 'coast',
 'cocacola',
 'cochran',
 'cock',
 'cod',
 'code',
 'cody',
 'coffee',
 'coincidence',
 'coke',
 'col',
 'cold',
 'collected',
 'collection',
 'college',
 'colombia',
 'color',
 'combination',
 'comcast',
 'come',
 'comeback',
 'comedian',
 'comedy',
 'comes',
 'comfortable',
 'comforting',
 'comfy',
 'comic',
 'comics',
 'comin',
 'coming',
 'comment',
 'commentary',
 'commentating',
 'commentator',
 'commentators',
 'comments',
 'commercial',
 'commercials',
 'common',
 'communication',
 'community',
 'commute',
 'companies',
 'company',
 'compete',
 'competition',
 'competitive',
 'complain',
 'complaining',
 'complete',
 'completely',
 'complicated',
 'compliment',
 'computer',
 'computers',
 'concept',
 'concerned',
 'concert',
 'conclusion',
 'concussions',
 'condescending',
 'conditioning',
 'conditions',
 'conference',
 'confidence',
 'confident',
 'confirm',
 'confirmed',
 'confused',
 'congrats',
 'congratulations',
 'congress',
 'connection',
 'conservative',
 'conservatives',
 'consider',
 'considerate',
 ...]

In [10]:
import matplotlib;
%matplotlib inline
pd.Series(list(vector.vocabulary_.values()))


Out[10]:
0       4330
1       1531
2       2967
3       3689
4       4644
5       1240
6       1175
7        954
8       2212
9       1982
10      3023
11      4438
12      1714
13      1261
14      4876
15      4067
16      2170
17       602
18      1768
19      3909
20      2141
21      1557
22       791
23      4820
24      1520
25      1239
26       441
27      4576
28      2654
29      2038
        ... 
4970    4737
4971    4006
4972    3326
4973    1697
4974    3511
4975    2690
4976    1502
4977     161
4978    4417
4979    4993
4980    4349
4981    3829
4982    2325
4983    3399
4984    2371
4985       3
4986    4518
4987    4266
4988    1918
4989    2833
4990    1583
4991    1659
4992    2532
4993    4719
4994     593
4995    1724
4996    4783
4997     388
4998    3984
4999    2382
dtype: int64

In [11]:
pd.Series(list(vector.vocabulary_.values())).describe()
# matplotlib.pyplot.plot(pd.Series(list(vector.vocabulary_.values())))


Out[11]:
count    5000.000000
mean     2499.500000
std      1443.520003
min         0.000000
25%      1249.750000
50%      2499.500000
75%      3749.250000
max      4999.000000
dtype: float64

Combine tweets and output to csv


In [12]:
master = sarcastic_tweets.append(regular_tweets)
master.to_csv('all_tweets_df.csv')

Bigram exploratory analysis


In [12]:
from sklearn.feature_extraction.text import CountVectorizer
word_vectorizer = CountVectorizer(ngram_range=(1,2), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(sarcastic_tweets['0'].values)
frequencies = sum(sparse_matrix).toarray()[0]
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])


Out[12]:
frequency
00 110
00 am 33
00 brilliant 1
00 but 16
00 est 1
00 for 6
00 houston 8
00 if 1
00 in 1
00 it 6
00 pm 5
00 sarcasm 4
00 they 2
00 this 3
00 to 6
00 yay 6
00 youre 6
000 152
000 00 2
000 000 4
000 759 5
000 americans 1
000 and 6
000 back 3
000 bb16 2
000 before 2
000 bucks 3
000 can 1
000 ch14 1
000 copies 3
... ...
zuma 8
zuma beach 1
zuma handling 1
zuma what 6
zumarek 4
zumarek erratarob 4
zumba 5
zuniga 12
zuniga gets 5
zuniga tackle 7
zup 3
zup guys 3
zusi 15
zusi he 7
zusi to 1
zusi walked 7
zwebackhd 2
zynga 5
zyrtec 1
zyrtec med 1
zzxx 4
zzxx make 3
zzz 2
zzz dammit 2
zzzz 7
zzzz haha 2
zzzz wackypeople 5
zzzzz 5
zzzzz is 1
zzzzzzz 7

291505 rows × 1 columns

Composite unigram counts


In [27]:
word_vectorizer = CountVectorizer(ngram_range=(1,1), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(master['0'].values)
frequencies = sum(sparse_matrix).toarray()[0]
# freqs = pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])


Out[27]:
frequency
00 423
000 358
0000 3
00000 1
0000000 1
00000kt 15
00001 1
000052736 1
000094 1
0000ff 1
0002 1
00094 1
000h 1
000s 3
000th 12
000tps 1
000x 3
001 1
0012 1
00136 1
0014priya 1
002 1
003 4
004 1
0049 1
007 3
007_og 1
007hertzrumble 1
0083 1
00979 1
... ...
zwgman 1
zwiebel 1
zwirner 1
zwood93 1
zxabys 1
zxvaness 3
zyadkahtani 1
zymaster 1
zynga 6
zyrtec 1
zysecurity 1
zywievc 1
zz 1
zzaaammnnnn 1
zzquil 1
zzxx 4
zzyzx 1
zzyzzx 2
zzz 6
zzzentropy 2
zzzquil 1
zzzrgrizz 1
zzzxoz 1
zzzz 13
zzzzbruh 1
zzzzz 8
zzzzzz 3
zzzzzzz 8
zzzzzzzz 3
zzzzzzzzz 2

224940 rows × 1 columns

Top 250 Most Frequent Unigrams


In [35]:
freqs = pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])
freqs.sort_values('frequency',ascending=False)[:250]


Out[35]:
frequency
the 140144
ud83d 126990
to 115725
you 83257
and 67892
my 66012
is 59652
it 58650
of 55084
in 54773
ud83c 52834
that 50654
for 46944
me 42852
on 38786
so 36209
this 36096
at 31918
be 29981
with 29534
just 29497
like 26915
ude02 25695
have 24618
can 23659
all 22831
are 21366
not 21319
was 21065
love 21011
... ...
days 3100
stop 3096
watching 3094
lmao 3093
miss 3092
follow 3086
other 3077
mean 3070
long 3061
little 3051
having 3037
talk 3017
show 2999
makes 2982
birthday 2973
hope 2965
any 2956
glad 2956
nyc 2956
same 2929
udc95 2913
doing 2907
big 2891
4th 2877
play 2860
person 2857
house 2849
does 2849
haha 2844
funny 2838

250 rows × 1 columns