In [1]:
import pandas as pd
import sklearn as sk
import seaborn as sb
import nltk
from nltk.corpus import gutenberg

In [2]:
ds = pd.read_csv("B:/Data_Mining_1/train_data.csv", nrows = 50000)

In [3]:
ds


Out[3]:
asin date downvotes review_comments review_id review_text review_type reviewer star_rating upvotes ... in_app_purchase mas_rating min_os_version num_permissions num_screenshots price release_date size_MB title update_date
0 B00OPRXVIC 2014-11-06 NaN 0 RAC9Q5IDVW1TC this is cool. parents can write an email from ... Verified Purchase A2NGN1TFABVU8Q 5 NaN ... 0 Guidance Suggested 2.3.3 1 10 199.0 2014-10-21 0.32041 Email From Santa 2014 2014-11-06
1 B008JQPY2G 2012-07-21 NaN 0 R295U1EYQDU4CN its no good I didnt like it at all Verified Purchase A3SHL50ZU1K96M 1 NaN ... 0 All Ages 2.1 2 4 99.0 2012-07-11 12.80000 Monsters - Difference Games - Game App 2012-07-11
2 B008JQPY2G 2013-01-04 NaN 0 R2Z8UUF4V7ERKX absolutely terrible. I mean that I did not lik... NaN A9MOKV997YYIN 1 NaN ... 0 All Ages 2.1 2 4 99.0 2012-07-11 12.80000 Monsters - Difference Games - Game App 2012-07-11
3 B008JQPY2G 2015-10-16 NaN 0 R35VWHNI4UXNIP Fun game Verified Purchase A3VYN5P1IRUL3H 5 NaN ... 0 All Ages 2.1 2 4 99.0 2012-07-11 12.80000 Monsters - Difference Games - Game App 2012-07-11
4 B008JQPY2G 2013-03-05 NaN 0 R7LQK4NGD2ATF This game downloaded for me, but when I try to... Verified Purchase AIX9ODIJLREIM 1 NaN ... 0 All Ages 2.1 2 4 99.0 2012-07-11 12.80000 Monsters - Difference Games - Game App 2012-07-11
5 B00I9LO6FC 2014-05-04 NaN 0 R3IXBWPIQ2JVR3 Daddy truly leads us to a permanently open hea... Verified Purchase A3BJMDTYLL91NG 5 NaN ... 0 All Ages 2.1 2 5 299.0 2014-02-07 4.00000 Open Heavens 2014 2014-02-07
6 B00I9LO6FC 2014-06-02 NaN 0 R3TD58YQOVZYVX So much blessing in the words of God. Love thi... Verified Purchase A1I2EHDV2ARMD 5 NaN ... 0 All Ages 2.1 2 5 299.0 2014-02-07 4.00000 Open Heavens 2014 2014-02-07
7 B00CJNGUB4 2014-03-10 NaN 0 R12VY2CWQSDXDC was no good, not helpful.. information limited... Verified Purchase A3KCMTBJBIGK3Y 2 NaN ... 0 Guidance Suggested 2.2 6 7 299.0 2013-04-29 9.30000 Turks and Caicos Offline Map Travel Guide (Kin... 2013-04-29
8 B00CJNGUB4 2013-08-02 NaN 0 R2XW4TWKA509FU I learned a lot about the islands, but it does... Verified Purchase A1LXU8R3PEUZSP 4 NaN ... 0 Guidance Suggested 2.2 6 7 299.0 2013-04-29 9.30000 Turks and Caicos Offline Map Travel Guide (Kin... 2013-04-29
9 B004VMVFIY 2011-11-19 NaN 0 R11AANFJC6VN4U A friend raved about this app on her IPhone. ... Verified Purchase A1BNJNEN9MWVJ3 1 NaN ... 0 All Ages 2.2 12 6 299.0 2011-04-07 1.60000 Grocery Gadget - Shopping List 2013-03-22
10 B004VMVFIY 2014-08-21 NaN 0 R149GOOATHZAFS Ive used this app for several years on an iPho... NaN A8O00KPAJP40B 1 NaN ... 0 All Ages 2.2 12 6 299.0 2011-04-07 1.60000 Grocery Gadget - Shopping List 2013-03-22
11 B004VMVFIY 2015-10-07 NaN 0 R1944JPVR3652W I have used this app for a few years on my Gal... Verified Purchase AGA6K1Y2UO9F5 2 NaN ... 0 All Ages 2.2 12 6 299.0 2011-04-07 1.60000 Grocery Gadget - Shopping List 2013-03-22
12 B004VMVFIY 2011-08-29 NaN 0 R1IM2WO8GGCR I have this app on iPod Touch. I love it, it ... NaN A1M9HAGD2D9ED2 1 NaN ... 0 All Ages 2.2 12 6 299.0 2011-04-07 1.60000 Grocery Gadget - Shopping List 2013-03-22
13 B004VMVFIY 2014-05-29 NaN 0 R1KQEBGD3BIIWI I love this program. My whole family had adop... Verified Purchase A1QZBI80TVR2F3 5 NaN ... 0 All Ages 2.2 12 6 299.0 2011-04-07 1.60000 Grocery Gadget - Shopping List 2013-03-22
14 B004VMVFIY 2016-04-09 NaN 0 R258NCONKMKEGG I bought this app years ago for $2.99 and love... NaN A1FO3REV9O05E0 1 NaN ... 0 All Ages 2.2 12 6 299.0 2011-04-07 1.60000 Grocery Gadget - Shopping List 2013-03-22
15 B004VMVFIY 2015-04-22 NaN 0 R25AO36F9E8MNJ Rev. Review: 5/23/15 Org. Review: 4/22/2015 ... Verified Purchase A9S9RXZFJ0WF 4 NaN ... 0 All Ages 2.2 12 6 299.0 2011-04-07 1.60000 Grocery Gadget - Shopping List 2013-03-22
16 B004VMVFIY 2016-02-12 NaN 0 R2AE9YJ1YI804L DOES NOT WORK ON SAMSUNG "NOTE 4" Verified Purchase A9OF56RUW8NN 1 NaN ... 0 All Ages 2.2 12 6 299.0 2011-04-07 1.60000 Grocery Gadget - Shopping List 2013-03-22
17 B004VMVFIY 2013-02-02 NaN 0 R2BR52J753030H I love this app. It makes shopping so very mu... Verified Purchase A3INH2GLJJHZZA 5 NaN ... 0 All Ages 2.2 12 6 299.0 2011-04-07 1.60000 Grocery Gadget - Shopping List 2013-03-22
18 B004VMVFIY 2013-06-03 NaN 0 R2HH4E0PZD3370 All of my roommates have it on their devices. ... Verified Purchase A1GU4E3UXPNRPM 4 NaN ... 0 All Ages 2.2 12 6 299.0 2011-04-07 1.60000 Grocery Gadget - Shopping List 2013-03-22
19 B004VMVFIY 2013-03-12 NaN 0 R2PHA93SDT2YIG I bought this app because of how good it worke... Verified Purchase A1NXEOQOLGBQEW 5 NaN ... 0 All Ages 2.2 12 6 299.0 2011-04-07 1.60000 Grocery Gadget - Shopping List 2013-03-22
20 B004VMVFIY 2014-05-09 NaN 0 R2SUI0FFFGXJN0 I have been using Grocery Gadget on the web an... Verified Purchase A2WMLNOZXLRTL4 1 NaN ... 0 All Ages 2.2 12 6 299.0 2011-04-07 1.60000 Grocery Gadget - Shopping List 2013-03-22
21 B004VMVFIY 2016-03-30 NaN 0 R2T199EB4XPBZL Extremely disapointed had used program with sa... Verified Purchase A2NOCDANIYGTEN 1 NaN ... 0 All Ages 2.2 12 6 299.0 2011-04-07 1.60000 Grocery Gadget - Shopping List 2013-03-22
22 B004VMVFIY 2015-07-24 NaN 0 R2T9AIKBD3USJM I would love to give this 5 stars but I cant. ... Verified Purchase A20P873IS2OTX1 1 NaN ... 0 All Ages 2.2 12 6 299.0 2011-04-07 1.60000 Grocery Gadget - Shopping List 2013-03-22
23 B004VMVFIY 2014-01-20 NaN 0 R2TP0HIZRH1X4F I used this for years on my iPhone and loved i... Verified Purchase A1KQT27NOUPF1G 5 NaN ... 0 All Ages 2.2 12 6 299.0 2011-04-07 1.60000 Grocery Gadget - Shopping List 2013-03-22
24 B004VMVFIY 2014-07-13 NaN 0 R2UNVXAVS4F0K6 Love it! Verified Purchase A3S3K1VHA9Q0BB 5 NaN ... 0 All Ages 2.2 12 6 299.0 2011-04-07 1.60000 Grocery Gadget - Shopping List 2013-03-22
25 B004VMVFIY 2015-08-20 NaN 0 R2WFRQE5O1H1YX I used this app for several years on my iPhone... Verified Purchase A1QM7BI8JKPEDA 4 NaN ... 0 All Ages 2.2 12 6 299.0 2011-04-07 1.60000 Grocery Gadget - Shopping List 2013-03-22
26 B004VMVFIY 2015-02-18 NaN 0 R308R397W79DRU Was a great app. but last update was very bad ... NaN A1CP3STG4Q0ACR 1 NaN ... 0 All Ages 2.2 12 6 299.0 2011-04-07 1.60000 Grocery Gadget - Shopping List 2013-03-22
27 B004VMVFIY 2011-05-23 NaN 0 R31GBD51O3GOXI I had this app on my IPhone and when I got an ... NaN A1J65EXRZ14U0U 5 NaN ... 0 All Ages 2.2 12 6 299.0 2011-04-07 1.60000 Grocery Gadget - Shopping List 2013-03-22
28 B004VMVFIY 2015-12-03 NaN 0 R35C1YE6BVV8T7 I like others purchased this app originally on... Verified Purchase A3V34U4GHFUBHS 1 NaN ... 0 All Ages 2.2 12 6 299.0 2011-04-07 1.60000 Grocery Gadget - Shopping List 2013-03-22
29 B004VMVFIY 2011-05-21 NaN 0 R3A976H79RW5YO Downloaded app and tried to set up lists but e... Verified Purchase A2IVUC9DAVAMV9 1 NaN ... 0 All Ages 2.2 12 6 299.0 2011-04-07 1.60000 Grocery Gadget - Shopping List 2013-03-22
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49970 B0058VCGXU 2012-07-22 NaN 0 R352NYYB9740FX So painful, x-ray doom. Flesh burns like surfa... NaN A1SXS3TWQGWS35 5 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49971 B0058VCGXU 2013-07-01 NaN 0 R367JPOW6PQ2N Of course it doesnt turn your Kindle into an X... Verified Purchase A2XBD9SA90XQL6 4 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49972 B0058VCGXU 2013-01-21 NaN 0 R398XFMRFQBXEG if you own a kindle fire, please consider that... NaN A2LIYCNH74UG1R 1 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49973 B0058VCGXU 2012-12-26 NaN 1 R399YOVNAKX5FM It is stupid and spent 99 cents and I cant und... Verified Purchase A1ECUGYGK4ZUI8 1 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49974 B0058VCGXU 2012-07-23 NaN 0 R3AQ3OXXQDT6E5 this app is so awesome I tried it and it worke... NaN A20LKRNWSE9N4I 5 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49975 B0058VCGXU 2013-01-05 NaN 1 R3ASIB5NYAVYBQ Read the app description b4 purchase!!! Cant r... NaN A1B2LZJ0HX1HXK 5 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49976 B0058VCGXU 2012-08-30 NaN 0 R3CXPEE2ECLJ4G Of course its not real. Kindle Fire doesnt eve... NaN A133TEF2DPOCKL 5 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49977 B0058VCGXU 2012-08-16 NaN 0 R3DU6T3ST85JIQ (il-lu-sion ~ something that deceives by produ... NaN A16PHEG1TZK41O 5 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49978 B0058VCGXU 2012-09-10 NaN 0 R3DX2FQIXT9J7U The reviews here are worth the .99c! NaN A23743SKQ1WIOD 5 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49979 B0058VCGXU 2012-09-18 NaN 0 R3EAWZ3JA0B0OV I didnt buy it becaus I have a kindle fire and... NaN A1AVRMK4U1VANR 3 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49980 B0058VCGXU 2012-09-15 NaN 2 R3GXGGQMTOZZ1A we bought the game all it shows you bones its ... Verified Purchase AS99FDH4FOVPB 1 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49981 B0058VCGXU 2012-09-01 NaN 0 R3HDVU5BLMJZEB it did not even download. do not even try to g... Verified Purchase A1JO8RVE7DU1XL 1 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49982 B0058VCGXU 2012-12-24 NaN 0 R3IZRQTHECUK28 heyy u guys dont need to say that its December... NaN A254PIOX54SQWH 5 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49983 B0058VCGXU 2012-09-03 NaN 1 R3J0SP3NQNT268 Stupid, stupid, stupid. This cannot x-ray! Don... NaN A1HUSKYX3HUSSY 1 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49984 B0058VCGXU 2012-12-02 NaN 0 R3LFR0ON63HBFX this app is so stupid. it is just a freaked sc... NaN A1M0QBKK16PHBP 1 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49985 B0058VCGXU 2013-03-01 NaN 0 R3LOIMYVSN9P51 This app. Sucks it serves no purpose. Its supp... Verified Purchase A12S683INNTBTF 1 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49986 B0058VCGXU 2011-12-29 NaN 0 R3M6Y3Y57E4QMM Do not buy this app because it does not work a... NaN A1PDGQVSU6WAJL 1 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49987 B0058VCGXU 2012-08-25 NaN 0 R3NE6OYYTSDS49 how does it even work. it does not work at all... Verified Purchase AIQ5BZMV83X6V 1 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49988 B0058VCGXU 2013-02-11 NaN 0 R3OUVH71VCIGPY Pointless. In know it is not possible to do an... Verified Purchase AVWHT6JSXEF38 1 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49989 B0058VCGXU 2013-05-28 NaN 0 R3QD0S1MESUSM6 This application isnt worth a cent, let alone ... NaN A1ZM7BNTSBGDI 1 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49990 B0058VCGXU 2012-08-29 NaN 0 R3SBTFJ5GUMFEG I hope most of you are smart enough to figure ... NaN A2LSW8X5RGG51C 5 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49991 B0058VCGXU 2012-06-18 NaN 0 R3T36QE0LXIY1G This app isnt a real x ray!! Its a prank... yo... NaN A3GC9IQX594Q6A 3 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49992 B0058VCGXU 2012-06-25 NaN 0 R3UASRZOJYQ8B6 it doesnt work on any device and the only time... NaN ASZAO4HQEXOUU 1 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49993 B0058VCGXU 2012-09-27 NaN 0 R3UH8UXF36G8FY I downloaded to an android device and an iphon... Verified Purchase A2X3TOL5HAU6OA 1 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49994 B0058VCGXU 2012-09-23 NaN 1 R3UY0ESGX50WLJ this game is amazing and you nnneeeeeddddd to ... NaN A2N4VERJZ8ZX9H 5 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49995 B0058VCGXU 2012-01-15 NaN 0 R3VJ6UWHQZU094 do not get this app it does not even work on k... NaN A2WCPI30754KNG 1 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49996 B0058VCGXU 2012-09-28 NaN 0 R3YXYAZVFV42N obviously the kindle fire doesnt have a camera... NaN A30T13ANOMDGTL 5 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49997 B0058VCGXU 2012-09-11 NaN 0 R4BX7P1XEHWWO I tried it on my kindle and it didnt work at a... Verified Purchase A23ZHSIA38EG8N 2 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49998 B0058VCGXU 2012-12-07 NaN 1 R5YVCK1BLWVMU Its horrible I hate it it sucks but I hate it ... Verified Purchase A1YK4G3U0DVXS0 1 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15
49999 B0058VCGXU 2012-07-23 NaN 0 R6MZBP07RRMZW I harried I because It wools not let me buy it... Verified Purchase A120OF0988OR1L 1 NaN ... 1 All Ages 1.6 2 4 99.0 2010-09-01 3.50000 X-Ray Scanner (Ad-Free) 2014-12-15

50000 rows × 23 columns


In [4]:
ds.upvotes.unique()


Out[4]:
array([ nan])

In [5]:
from nltk.corpus import stopwords
my_stopwords = stopwords.words('english')
more_stopwords = """. ; , ... ( ) ! ? app game thi play would"""
my_stopwords += more_stopwords.split()
my_stopwords.remove('not')
my_stopwords.remove('no')
my_stopwords.remove('very')
my_stopwords.remove('don')
my_stopwords.remove('ain')
my_stopwords.remove('aren')
my_stopwords.remove('couldn')
my_stopwords.remove('didn')
my_stopwords.remove('doesn')
my_stopwords.remove('hadn')
my_stopwords.remove('hasn')
my_stopwords.remove('haven')
my_stopwords.remove('isn')
my_stopwords.remove('shouldn')
my_stopwords.remove('won')
my_stopwords.remove('wouldn')

In [6]:
import nltk
from nltk import word_tokenize, sent_tokenize

In [8]:
text_analisys = {}
text_analisys['1']=" ".join(str(x) for x in ds.review_text[ds.star_rating == 1])
text_analisys['2']=" ".join(str(x) for x in ds.review_text[ds.star_rating == 2])
text_analisys['3']=" ".join(str(x) for x in ds.review_text[ds.star_rating == 3])
text_analisys['4']=" ".join(str(x) for x in ds.review_text[ds.star_rating == 4])
text_analisys['5']=" ".join(str(x) for x in ds.review_text[ds.star_rating == 5])

In [9]:
from nltk.stem.porter import *
stemmer = PorterStemmer()
def stem_tokens(tokens):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text.lower())
    stems = stem_tokens(tokens)
    return stems

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(stop_words=my_stopwords,tokenizer=tokenize, ngram_range=(1,2))

%time tfidf_matrix = tfidf_vectorizer.fit_transform(text_analisys.values()) #fit the vectorizer to synopses

print(tfidf_matrix.shape)


Wall time: 56.1 s
(5, 383179)

In [12]:
from sklearn.cluster import KMeans

num_clusters = 5

km = KMeans(n_clusters=num_clusters,init='k-means++')

%time km.fit(tfidf_matrix)

clusters = km.labels_.tolist()


Wall time: 9.66 s

In [13]:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = tfidf_vectorizer.get_feature_names()
for i in range(num_clusters):
    print "Cluster %d:" % i,
    for ind in order_centroids[i, :20]:
        print ' %s' % terms[ind],
    print


Top terms per cluster:
Cluster 0:  love  fun  great  like  veri  get  time  one  use  not  good  wa  realli  make  easi  work  challeng  enjoy  kindl  ha
Cluster 1:  not  wa  like  get  time  veri  one  work  no  use  dont  fun  good  kindl  free  realli  onli  cant  becaus  tri
Cluster 2:  not  wa  get  work  no  dont  kindl  like  time  money  tri  even  free  download  wast  one  use  veri  cant  fire
Cluster 3:  like  fun  not  good  get  great  time  wa  use  veri  love  one  realli  onli  enjoy  work  make  level  easi  littl
Cluster 4:  not  like  wa  get  time  fun  good  use  one  veri  realli  work  dont  free  no  onli  make  kindl  need  level

In [14]:
string1 = [str(x) for x in ds.review_text[ds.star_rating==1].tolist()]
string2 = [str(x) for x in ds.review_text[ds.star_rating==2].tolist()]
string3 = [str(x) for x in ds.review_text[ds.star_rating==3].tolist()]
string4 = [str(x) for x in ds.review_text[ds.star_rating==4].tolist()]
string5 = [str(x) for x in ds.review_text[ds.star_rating==5].tolist()]


#define vectorizer parameters
#fit the vectorizer to synopses

In [15]:
pred1 = km.predict( tfidf_vectorizer.transform(string1))
pred2 = km.predict( tfidf_vectorizer.transform(string2))
pred3 = km.predict( tfidf_vectorizer.transform(string3))
pred4 = km.predict( tfidf_vectorizer.transform(string4))
pred5 = km.predict( tfidf_vectorizer.transform(string5))
from collections import Counter
m1 = Counter(pred1)
m2 = Counter(pred2)
m3 = Counter(pred3)
m4 = Counter(pred4)
m5 = Counter(pred5)

In [16]:
print ("1-star : ", m1)
print ("2-star : ", m2)
print ("3-star : ", m3)
print ("4-star : ", m4)
print ("5-star : ", m5)


('1-star : ', Counter({2: 4820, 1: 1161, 4: 623, 3: 467, 0: 341}))
('2-star : ', Counter({1: 1176, 2: 752, 3: 430, 4: 409, 0: 242}))
('3-star : ', Counter({4: 1629, 3: 1228, 2: 870, 1: 711, 0: 652}))
('4-star : ', Counter({3: 4026, 0: 2778, 4: 991, 1: 772, 2: 765}))
('5-star : ', Counter({0: 14586, 3: 5839, 2: 1767, 4: 1566, 1: 1399}))

In [17]:
m1_total = float(len(pred1))
prob_1_1_star = float(m1[2]/m1_total)
prob_1_2_star = float(m1[1]/m1_total)
prob_1_3_star = float(m1[4]/m1_total)
prob_1_4_star = float(m1[3]/m1_total)
prob_1_5_star = float(m1[0]/m1_total)
m2_total = float(len(pred2))
prob_2_1_star = float(m2[2]/m2_total)
prob_2_2_star = float(m2[1]/m2_total)
prob_2_3_star = float(m2[4]/m2_total)
prob_2_4_star = float(m2[3]/m2_total)
prob_2_5_star = float(m2[0]/m2_total)
m3_total = float(len(pred3))
prob_3_1_star = float(m3[2]/m3_total)
prob_3_2_star = float(m3[1]/m3_total)
prob_3_3_star = float(m3[4]/m3_total)
prob_3_4_star = float(m3[3]/m3_total)
prob_3_5_star = float(m3[0]/m3_total)
m4_total = float(len(pred4))
prob_4_1_star = float(m4[2]/m4_total)
prob_4_2_star = float(m4[1]/m4_total)
prob_4_3_star = float(m4[4]/m4_total)
prob_4_4_star = float(m4[3]/m4_total)
prob_4_5_star = float(m4[0]/m4_total)
m5_total = float(len(pred5))
prob_5_1_star = float(m5[2]/m5_total)
prob_5_2_star = float(m5[1]/m5_total)
prob_5_3_star = float(m5[4]/m5_total)
prob_5_4_star = float(m5[3]/m5_total)
prob_5_5_star = float(m5[0]/m5_total)
print(prob_1_1_star, " ", prob_1_2_star ," " , prob_1_3_star," ", prob_1_4_star," " ,prob_1_5_star)
print(prob_2_1_star, " ", prob_2_2_star ," " , prob_2_3_star," ", prob_2_4_star," " ,prob_2_5_star)
print(prob_3_1_star, " ", prob_3_2_star ," " , prob_3_3_star," ", prob_3_4_star," " ,prob_3_5_star)
print(prob_4_1_star, " ", prob_4_2_star ," " , prob_4_3_star," ", prob_4_4_star," " ,prob_4_5_star)
print(prob_5_1_star, " ", prob_5_2_star ," " , prob_5_3_star," ", prob_5_4_star," " ,prob_5_5_star)


(0.650296815974096, ' ', 0.1566378845116028, ' ', 0.08405288720992984, ' ', 0.06300593631948193, ' ', 0.04600647598488937)
(0.24991691591890994, ' ', 0.39082751744765704, ' ', 0.1359255566633433, ' ', 0.1429046194749086, ' ', 0.08042539049518112)
(0.17092337917485265, ' ', 0.13968565815324166, ' ', 0.3200392927308448, ' ', 0.2412573673870334, ' ', 0.1280943025540275)
(0.08197599657093871, ' ', 0.08272610372910416, ' ', 0.10619374196313759, ' ', 0.4314187741105872, ' ', 0.2976853836262323)
(0.07023889970982232, ' ', 0.055610764399570696, ' ', 0.062249075803951184, ' ', 0.23210239694717177, ' ', 0.579798863139484)

In [23]:
ods = pd.read_csv("B:/Data_Mining_1/test_data.csv")

In [164]:
#ods = ods.drop(ods.columns[[0,1,2,3,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]], axis=1)

In [24]:
ods["1-star"]=""
ods["2-star"]=""
ods["3-star"]=""
ods["4-star"]=""
ods["5-star"]=""
ostring1 = [str(x) for x in ods.review_text.tolist()]
pred = km.predict( tfidf_vectorizer.transform(ostring1)) 
ods = ods.drop(ods.columns[[0,1,2,3,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]], axis=1)
for index,row in ods.iterrows() :  
    if pred[index] == 0 :
        row['1-star'] = prob_5_1_star
        row['2-star'] = prob_5_2_star
        row['3-star'] = prob_5_3_star
        row['4-star'] = prob_5_4_star
        row['5-star'] = prob_5_5_star
        ods.loc[index] = row 
    if pred[index] == 3 : 
        row['1-star'] = prob_4_1_star
        row['2-star'] = prob_4_2_star
        row['3-star'] = prob_4_3_star
        row['4-star'] = prob_4_4_star
        row['5-star'] = prob_4_5_star
        ods.loc[index] = row
    if pred[index] == 1 :
        row['1-star'] = prob_3_1_star
        row['2-star'] = prob_3_2_star
        row['3-star'] = prob_3_3_star
        row['4-star'] = prob_3_4_star
        row['5-star'] = prob_3_5_star
        ods.loc[index] = row
    if pred[index] == 4 :
        row['1-star'] = prob_2_1_star
        row['2-star'] = prob_2_2_star
        row['3-star'] = prob_2_3_star
        row['4-star'] = prob_2_4_star
        row['5-star'] = prob_2_5_star
        ods.loc[index] = row
    if pred[index] == 2 :
        row['1-star'] = prob_1_1_star 
        row['2-star'] = prob_1_2_star
        row['3-star'] = prob_1_3_star
        row['4-star'] = prob_1_4_star
        row['5-star'] = prob_1_5_star
        ods.loc[index] = row

In [25]:
#ods = ods.drop(ods.columns[[1]], axis=1)
ods.to_csv('submission.csv',index = False)

In [26]:
ods


Out[26]:
review_id 1-star 2-star 3-star 4-star 5-star
0 R1EZKJDJ5UAF5O 0.0702389 0.0556108 0.0622491 0.232102 0.579799
1 RNVH13QG7XY8Q 0.0702389 0.0556108 0.0622491 0.232102 0.579799
2 R2R90UNBEGV0MP 0.081976 0.0827261 0.106194 0.431419 0.297685
3 R11XZ8I2SEK3PC 0.650297 0.156638 0.0840529 0.0630059 0.0460065
4 RLUVG914XCDG8 0.081976 0.0827261 0.106194 0.431419 0.297685
5 R2884NHGQCOHF3 0.650297 0.156638 0.0840529 0.0630059 0.0460065
6 R2JMYK0QZPCI9H 0.249917 0.390828 0.135926 0.142905 0.0804254
7 R3H5JIN6M9OA1J 0.170923 0.139686 0.320039 0.241257 0.128094
8 R38LDR54VV3AW9 0.0702389 0.0556108 0.0622491 0.232102 0.579799
9 R2Y9Z7VCO2M055 0.081976 0.0827261 0.106194 0.431419 0.297685
10 R166HID37B68T1 0.0702389 0.0556108 0.0622491 0.232102 0.579799
11 RG7NISBDDAHP 0.650297 0.156638 0.0840529 0.0630059 0.0460065
12 R23JKSOJ0MZE09 0.0702389 0.0556108 0.0622491 0.232102 0.579799
13 RCHE1HFH9H907 0.0702389 0.0556108 0.0622491 0.232102 0.579799
14 RSRZ1R3H4UPEI 0.081976 0.0827261 0.106194 0.431419 0.297685
15 RTX2Z6P55NUBB 0.249917 0.390828 0.135926 0.142905 0.0804254
16 R2Q9EE8229NZ0N 0.0702389 0.0556108 0.0622491 0.232102 0.579799
17 R3BDLVV2TB9L1E 0.650297 0.156638 0.0840529 0.0630059 0.0460065
18 R3NWIGBCEPRRKB 0.170923 0.139686 0.320039 0.241257 0.128094
19 R3U37CL57O5CMS 0.081976 0.0827261 0.106194 0.431419 0.297685
20 R18OVQZWURHOQL 0.170923 0.139686 0.320039 0.241257 0.128094
21 R1E63O8W8AYFKP 0.249917 0.390828 0.135926 0.142905 0.0804254
22 RVOXLLAFMAKYX 0.081976 0.0827261 0.106194 0.431419 0.297685
23 RSZMOM7REO6HA 0.0702389 0.0556108 0.0622491 0.232102 0.579799
24 R3PZRYBP8DEP4Q 0.0702389 0.0556108 0.0622491 0.232102 0.579799
25 R1WLFUTNNFFAIF 0.0702389 0.0556108 0.0622491 0.232102 0.579799
26 RSBQY65ZBKBE4 0.249917 0.390828 0.135926 0.142905 0.0804254
27 R1RAX2J1Z55ZMY 0.170923 0.139686 0.320039 0.241257 0.128094
28 RLM7C1RUVYJYX 0.170923 0.139686 0.320039 0.241257 0.128094
29 R3TRROU6SUL5FK 0.0702389 0.0556108 0.0622491 0.232102 0.579799
... ... ... ... ... ... ...
99970 R16KLI79DKTNG5 0.0702389 0.0556108 0.0622491 0.232102 0.579799
99971 R2KSVUNHJH72RW 0.170923 0.139686 0.320039 0.241257 0.128094
99972 RXHH92DYIIJED 0.249917 0.390828 0.135926 0.142905 0.0804254
99973 R2WFGK6ETE85GU 0.249917 0.390828 0.135926 0.142905 0.0804254
99974 R3GOVX11KEDGAT 0.0702389 0.0556108 0.0622491 0.232102 0.579799
99975 R28H8ZW1IBYH4X 0.081976 0.0827261 0.106194 0.431419 0.297685
99976 RPVHOG7BLGB79 0.0702389 0.0556108 0.0622491 0.232102 0.579799
99977 R2K36WJMQAXZ6Z 0.0702389 0.0556108 0.0622491 0.232102 0.579799
99978 R325VGO1WQEEKA 0.0702389 0.0556108 0.0622491 0.232102 0.579799
99979 R1AE61E52RE8CR 0.0702389 0.0556108 0.0622491 0.232102 0.579799
99980 R2HVG6M1W2IPFC 0.0702389 0.0556108 0.0622491 0.232102 0.579799
99981 R22HXUZYCWBJ1T 0.081976 0.0827261 0.106194 0.431419 0.297685
99982 R1I8PL33WBDNR9 0.0702389 0.0556108 0.0622491 0.232102 0.579799
99983 R1JBKMHJYHUR35 0.0702389 0.0556108 0.0622491 0.232102 0.579799
99984 R2B8BMBCU7K2J0 0.650297 0.156638 0.0840529 0.0630059 0.0460065
99985 R10QH0TOF1HB5F 0.650297 0.156638 0.0840529 0.0630059 0.0460065
99986 RSZ3R8ZE6821R 0.249917 0.390828 0.135926 0.142905 0.0804254
99987 R3OH3DL54G87JL 0.081976 0.0827261 0.106194 0.431419 0.297685
99988 RFP16517VBSDF 0.0702389 0.0556108 0.0622491 0.232102 0.579799
99989 R3NQ0WFNOR2IQQ 0.081976 0.0827261 0.106194 0.431419 0.297685
99990 R5F5MI1BTZ0QG 0.0702389 0.0556108 0.0622491 0.232102 0.579799
99991 R1XKB8D4J939E 0.081976 0.0827261 0.106194 0.431419 0.297685
99992 R24DZX527321LX 0.170923 0.139686 0.320039 0.241257 0.128094
99993 RZIOB1NKUTXEC 0.0702389 0.0556108 0.0622491 0.232102 0.579799
99994 R1RB2P85GJPSA2 0.249917 0.390828 0.135926 0.142905 0.0804254
99995 R1BISOVSFZ91PX 0.081976 0.0827261 0.106194 0.431419 0.297685
99996 R1F3JN4IX68WR7 0.081976 0.0827261 0.106194 0.431419 0.297685
99997 R25FAN8BJ4N1S7 0.081976 0.0827261 0.106194 0.431419 0.297685
99998 R2QCP2YWSDRHPE 0.081976 0.0827261 0.106194 0.431419 0.297685
99999 R2QB2OHDBBJRT6 0.650297 0.156638 0.0840529 0.0630059 0.0460065

100000 rows × 6 columns