The explosive growth of the mobile application (app) market has made it difficult for users to find the most interesting and relevant apps from the hundreds of thousands that exist today. We are interested to analyze the quality of apps and understand the user experiences of using mobile apps on the Itune App Store. We think these insights would be helpful for the developers to design better apps, as well as for Apple to better manage the App Store. For example, a good app recommendation system would rely on understanding the quality of apps by predicting their ratings. In this project, we would go through the data science cycle including problem/data curations, data management, data analytics and result-oriented presentations through data visualization.
We are particularily interested in the following questions:
In our workflow, firstly, we used scrapy, a high level web scraping framework to collect our data from Itune App Store and then stored the data in the NoSQL database - MongoDB. We explored our data with pandas and statsmodel for hypothesis testing. Then, we used nltk for NLP, sckit-learn for machine learning and gensim for topic modeling. For visualization, we use plotly and pyLDAvis and we share our insights through Jupyter notebook. We collaborate and manage the project on our shared Github directory based on the cookiecutter data science project template.
In [34]:
from IPython.display import Image
Image(filename='workflow.PNG',width=800, height=800)
Out[34]:
To collect our data, we found the Apple's app store websites provide a lot of useful information and they share a consistent web design pattern which makes web scraping easier. Note that we only collected the first indexing pages of different categories since they are shown as most popular apps. Then we scraped different links of apps. An example of Spotify Music is showed below. Thus, our goal is to collect and store the various information on the web page for each app. We also identify some interesting characteristics, such as Is_Multilingual, Is_Multiplatform and Has_InAppPurchased. In the end, we collected about 5600 unique apps from ITune App Store and sotred into the Mongo Database.
In [37]:
Image(filename='webscrape1.PNG',width=850, height=850)
Out[37]:
In [36]:
Image(filename='webscrape2.PNG',width=850, height=850)
Out[36]:
In [5]:
# for data manipulation
import numpy as np
import pandas as pd
# for MongoDB connection
import pymongo
import matplotlib as plt
# for statistical hypothesis testing
import scipy.stats
%matplotlib inline
In [6]:
# for interactive plotting
import plotly.plotly as py
import cufflinks as cf
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
cf.set_config_file(offline=True, theme='ggplot')
#print __version__ # requires version >= 1.9.0
In [7]:
def read_mongo(collection, query={}, host='localhost', port=27017, username=None, password=None, no_id=True):
""" Read from Mongo and Store into DataFrame """
# Connect to MongoDB and Make a query to the specific DB and Collection
with pymongo.MongoClient(host, port) as client:
table = client.appstore[collection]
df = pd.DataFrame(list(table.find(query)))
# Delete the _id
if no_id:
del df['_id']
return df
In [9]:
apps_df = read_mongo('appitems')
rating_cleaned = {'1 star':1, "1 and a half stars": 1.5, '2 stars': 2, '2 and a half stars':2.5, "3 stars":3, "3 and a half stars":3.5, "4 stars": 4,
'4 and a half stars': 4.5, "5 stars": 5}
apps_df.overall_rating = apps_df.overall_rating.replace(rating_cleaned)
cate_cnt = apps_df.groupby(['category', 'overall_rating'])['id'].count().reset_index()
rate_cate_cnt = cate_cnt.pivot_table(index = 'category', columns = 'overall_rating', values = 'id', fill_value= 0)
rate_cate_cnt.iplot(kind = 'bar', barmode = 'stack', yTitle='Number of Apps', title='Distribution of Apps by Category and Rating',
colorscale = 'Paired', theme='white', labels = 'Rating')
We can see that the Entertainment, Lifestyle and Photo types of apps are the most popular in terms of number of apps in our collection. Besides, the health&Fitness and Photo categories have the largest portions of high rated apps(Overall rating > 4).
We found that the dataset has current rating(numerical variable) and overall rating(categorical) and often times the overall rating won't be updated until the new version releases, which could be biased. We have to define a more balanced metric to evaluate the quality of an app to facilitate our further analysis. Although our metric is not perfect, it does help correct the bias for apps that didn't account for the new current ratings efficiently.
In [12]:
rating_df['weighted_rating'] = map(lambda a, b, c,d: np.divide(a,b)*c+(1-np.divide(a,b))*d, rating_df['num_current_rating'],
rating_df['num_overall_rating'], rating_df['current_rating'], rating_df['overall_rating'])
rating_df[['weighted_rating', 'current_rating','overall_rating']].iplot(kind='histogram', barmode='stack', theme='white', title = 'Distribution of Rating Metrics')
When looking at the reviews, there seem to be two kinds of thoughts about paid apps among users. The first is the common folk wisdom that "you get what you pay for", which is usually positive. The second kind of thought is that expensive apps are not worth buying, so those users would complain about the price tags. We want to run a statistical test to see whether in-app purchases significantly affect the user experiences of apps. Here, we would use the weighted rating we derived from the last step as the proxy of user experience for mobile apps.
We can answer this question by doing hypothesis testing. Since the distribution of ratings are not normal, obviously we can’t use t-test or one-way ANOVA test. Thus, we prefer to use Kruskal-Wallis H-test, a non-parametric test which only requires the independence assumption.
$H_0:$ The medians of two groups are the same. $H_1$: The medians of two groups are different.
In [13]:
free_df = apps_df[(apps_df['is_InAppPurcased'] == 0)&(pd.notnull(apps_df['overall_rating']))][["name","overall_rating", "current_rating", 'num_current_rating', "num_overall_rating"]]
paid_df = apps_df[(apps_df['is_InAppPurcased'] == 1)&(pd.notnull(apps_df['overall_rating']))][["name","overall_rating", "current_rating", 'num_current_rating', "num_overall_rating"]]
free_df['weighted_rating'] = map(lambda a, b, c,d: np.divide(a,b)*c+(1-np.divide(a,b))*d, free_df['num_current_rating'],
free_df['num_overall_rating'], free_df['current_rating'], free_df['overall_rating'])
paid_df['weighted_rating'] = map(lambda a, b, c,d: np.divide(a,b)*c+(1-np.divide(a,b))*d, paid_df['num_current_rating'],
paid_df['num_overall_rating'], paid_df['current_rating'], paid_df['overall_rating'])
free = list(free_df['weighted_rating'])
paid = list(paid_df['weighted_rating'])
scipy.stats.kruskal(free, paid)
Out[13]:
In [30]:
Image('ht1.PNG', width=500, height=500)
Out[30]: