In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
app = pd.read_pickle('/Users/krystal/Desktop/app_clean.p')
app = app.drop_duplicates()
app.head()


Out[2]:
app category comment 1 comment 2 comment 3 current rating current reviews description link multiple devices ... overall rating overall reviews price rate 1 rate 2 rate 3 seller user 1 user 2 user 3
0 Kindle – Read eBooks, Magazines & Textbooks Books Please focus on reading and listening experien... I can't believe they added a clock in the last... You guys did an amazing job updating it and co... 2.65152 66 Turn your iPhone or iPad into a Kindle with th... https://itunes.apple.com/us/app/kindle-read-eb... Y ... 3.5 245670 Free 2 1 5 AMZN Mobile LLC LambdaExpression Lorring Prellvis
1 Audible – audio books, original series & podcasts Books It's very enjoyable to be able to listen to a ... Works great 99% of the time. When it didn't, t... Audiobooks are a little bit expensive, but I l... 4.65113 1995 Welcome to Audible. We’re an Amazon company, a... https://itunes.apple.com/us/app/audible-audio-... Y ... 4 100917 Free 3 5 5 Audible, Inc. Smuckitelly Scorchedterra vtalbot
2 Wattpad - Free Books and eBook Reader Books I am absolutely *in love* with Wattpad. Not on... I love this app sooo much I won't be able to l... I love wattpad! I really do! It's my favorite ... 4.59425 313 Discover Wattpad:At Wattpad, we’re connecting ... https://itunes.apple.com/us/app/wattpad-free-b... Y ... 4.5 236764 Free 3 4 5 Wattpad Corp dg2017xx RedPandaWorld Kajdisksn
3 NOOK - Read Books, Magazines, Newspapers & Comics Books I would give it 5 stars if the App would put t... Glad to have Nook on my iPad but wish maneuver... I love reading books electronically. Having un... 4.33955 1072 Get the FREE NOOK Reading App for your iPad, i... https://itunes.apple.com/us/app/nook-read-book... Y ... 4 55593 Free 4 3 3 Barnes&Noble Morgan737364737 Too old to tap Rstocky7
4 HOOKED - Chat Stories Books Keeps you on the edge of your seat. If you lov... It's a great app and I love the chills (even i... I just got through the first three parter and ... 4.54263 129 HOOKED lets you read amazing chat stories FREE... https://itunes.apple.com/us/app/hooked-chat-st... Y ... 4.5 28004 Free 4 2 4 Telepathic Inc. 100% Honest Feedback Abbidon Gangrel_Bloodfang

5 rows × 22 columns


In [97]:
len(app)


Out[97]:
5894

Categorical Variables

In this part, frequency table for each categorical variable is made.

Category


In [31]:
def frequecy_table(var_name):
    table = pd.DataFrame(app[var_name].value_counts())
    table.reset_index(level = 0, inplace = True)
    table['percentage'] = table[var_name]/table[var_name].sum()
    return table

In [32]:
frequecy_table('category')


Out[32]:
index category percentage
0 Games 380 0.064472
1 Lifestyle 327 0.055480
2 Entertainment 271 0.045979
3 News 264 0.044791
4 Health & Fitness 258 0.043773
5 Photo & Video 252 0.042755
6 Sports 252 0.042755
7 Food & Drink 250 0.042416
8 Education 248 0.042077
9 Business 248 0.042077
10 Music 246 0.041737
11 Travel 245 0.041568
12 Books 244 0.041398
13 Social Networking 242 0.041059
14 Shopping 242 0.041059
15 Medical 241 0.040889
16 Productivity 241 0.040889
17 Reference 241 0.040889
18 Catalogs 240 0.040719
19 Finance 240 0.040719
20 Weather 240 0.040719
21 Utilities 240 0.040719
22 Navigation 240 0.040719
23 Magazines & Newspapers 2 0.000339

Multiple Languages


In [33]:
frequecy_table('multiple languages')


Out[33]:
index multiple languages percentage
0 N 3087 0.523753
1 Y 2807 0.476247

Price


In [34]:
frequecy_table('price')


Out[34]:
index price percentage
0 Free 5730 0.972175
1 $0.99 36 0.006108
2 $2.99 33 0.005599
3 $3.99 27 0.004581
4 $1.99 27 0.004581
5 $4.99 21 0.003563
6 $9.99 8 0.001357
7 $6.99 3 0.000509
8 $5.99 3 0.000509
9 $24.99 1 0.000170
10 $13.99 1 0.000170
11 $19.99 1 0.000170
12 $29.99 1 0.000170
13 $16.99 1 0.000170
14 $14.99 1 0.000170

We can find that most apps are free, as a result, we may discard variable 'price' in the following analysis.

Multiple Devices


In [35]:
frequecy_table('multiple devices')


Out[35]:
index multiple devices percentage
0 Y 5881 0.997794
1 N 13 0.002206

We can find that most apps are multiple devices, as a result, we may discard variable 'multiple devices' in the following analysis.

Continuous Variables

In this part, for all continuous variables, mean, var, median, range, min value and max value are calculated, density plot is also made for each continuous variable.

Current Rating


In [71]:
def statistics(var_name):
    table = []
    for each in app[var_name]:
        if each != '' and float(each) > 0:
            table.append(float(each))
    mean = np.mean(table)
    var = np.var(table)
    median = np.median(table)
    range_1 = np.max(table) - np.min(table)
    min_1 = np.min(table)
    max_1 = np.max(table)
    dict_1 = {'mean':mean, 'var':var, 'median':median, 'range':range_1, 'min':min_1, 'max':max_1}
    summary_table = pd.DataFrame.from_dict(dict_1, orient='index').T
    return summary_table

In [72]:
statistics('current rating')


Out[72]:
min max median range var mean
0 1.0 5.0 4.18182 4.0 0.946402 3.838456

In [92]:
def plot_density(var_name):
    table = []
    for each in app[var_name]:
        if each != '' and float(each) > 0:
            table.append(float(each))
    table = pd.DataFrame(table)
    table.plot(kind = "density") 
    plt.legend(labels = [var_name], loc='upper left')
    plt.title('Distribution of %s'%(var_name))
    plt.show()

In [93]:
plot_density('current rating')


Current Reviews


In [73]:
statistics('current reviews')


Out[73]:
min max median range var mean
0 5.0 111910.0 61.0 111905.0 1.275234e+07 593.843338

In [94]:
plot_density('current reviews')


Overall Rating


In [74]:
statistics('overall rating')


Out[74]:
min max median range var mean
0 1.0 5.0 4.0 4.0 0.656913 3.782541

In [95]:
plot_density('overall rating')


Overall Reviews


In [75]:
statistics('overall reviews')


Out[75]:
min max median range var mean
0 5.0 2959259.0 1776.5 2959254.0 7.773057e+09 19384.681849

In [96]:
plot_density('overall reviews')


We can see that only variables current reviews and overall reviews generally follow a normal distribution.


In [ ]: