In [32]:
import pandas as pd
import json
json_data = open('../views/sample/input00.in') # Edit this to where you have put the input00.in file
data = []
for line in json_data:
data.append(json.loads(line))
data.remove(9000)
data.remove(1000)
df = pd.DataFrame(data)
df['anonymous'] = df['anonymous'].map({False: 0, True:1}).astype(int)
cleaned_df=df[:9000]
# to make reading the question_text cells easier, remove the maximum column width
pd.set_option('display.max_colwidth', -1)
In [33]:
from plotnine import *
At Quora, context_topic
has been depreciated since 2015. It used to be the primary topic assigned to every new question. And this topic tag was not visible to viewers, so they didn't have a way to filter it out (For this reason, I expect context_topic_followers to not contribute to view count.).
Those with missing context_topic
, which is the primary topic of the question. Although, as per the prompt, every question's context_topic
json blob is said to be included in the topics array of each question, it is not. Let's investigate those with missing primary topics and try to derive some insight.
In [100]:
df[['question_text', 'topics', 'context_topic']].head()
Out[100]:
But we want to see the mean of cleaned_df.__ans__
with the ctm_w_ans
rows removed. So, let's make a column of a boolean feature of whether context_topic
is missing or not and do an np.mean
pivot table.
In [35]:
# make a copy of cleaned_df
data_df = cleaned_df.copy()
In [36]:
# create the column
data_df['context_present'] = data_df['context_topic'].apply(lambda x: 0 if x==None else 1)
Let's see what happens if we create a column out of the product of num_answers
and context_present
.
In [37]:
test = data_df['context_present'] * data_df['num_answers']
In [38]:
context_xnum = pd.DataFrame(test, columns=['context_xnum'])
In [39]:
test.corr(data_df.__ans__)
Out[39]:
We see a moderate correlation with coefficient of 0.362
between the product context_xnum
and __ans__
.
In [40]:
(ggplot(pd.concat([data_df, context_xnum], axis=1), aes(x='context_xnum', y='__ans__')) + geom_point() + geom_smooth(method='lm') + theme_bw())
Out[40]:
In [41]:
(ggplot(pd.concat([data_df, context_xnum], axis=1), aes(x='context_xnum', y='__ans__', color='factor(anonymous)')) +
geom_point() +
stat_smooth( method='lm') + facet_wrap('~anonymous'))
Out[41]:
Surprisingly, those missing a context topic have a higher __ans__
than those that do. You would imagine the opposite assuming those missing a primary topic would be ones that are ignored or those that didn't garner enough attention. But quite the opposite seems to be true.
In [99]:
data_df[data_df.context_present == 0][13:16]
Out[99]:
Another insight from the missing context_topic dataframe is that there are rows with missing topics
. These went undetected last time because they are empty arrays and not NaN
or missing values.
In [43]:
topics_present = data_df['topics'].apply(lambda x: 0 if len(x) == 0 else 1)
In [44]:
topics_present.value_counts()
Out[44]:
In [45]:
topics_present.corr(data_df['__ans__'])
Out[45]:
In [46]:
test_1 = topics_present * test # 0 if topics aren't present or num_answers = 0 or context_topic not present.
In [47]:
test_1.corr(data_df['__ans__'])
Out[47]:
The correlation hasn't changed, so at this point, we have nothing to do with the empty topics rows.
In [48]:
# Short questionsb
data_df['len_question_text'] = data_df.question_text.apply(lambda x: len(x))
# plot
(ggplot(data_df, aes(x='len_question_text')) + geom_density() + theme_bw())
Out[48]:
Let's look at questions with less than 10 characters.
In [49]:
data_df[(data_df.len_question_text < 15)][['__ans__', 'question_text', 'len_question_text', 'num_answers']].sort_values(by="__ans__", ascending=False)
Out[49]:
In [149]:
#gives the correlation of `num_answers` to number of characters less than int_
def NumAnsCorrNumChar(int_):
question_shorter_than_int_=data_df['len_question_text'].apply(lambda x: 1 if x <= int_ else 0)
return question_shorter_than_int_.corr(data_df['num_answers'])
#gives the correlation of `__ans__` to the product of num_answers and the ...
# ... boolean column of number of characters less than int_
def TargetCorrNumChar(int_):
question_shorter_than_int_=data_df['len_question_text'].apply(lambda x: 1 if x <= int_ else 0)
prod = question_shorter_than_int_ * data_df['num_answers']
return prod.corr(data_df['__ans__'])
In [159]:
some_list = []
for i in data_df['len_question_text'].values:
some_list.append(NumAnsCorrNumChar(i))
In [160]:
min(some_list), max(some_list) #7 and 15 characters
Out[160]:
This is useless, did it out of curiousity.
How about the correlation between the number of characters and __ans__
?
In [151]:
some_list_1 = []
for i in data_df['len_question_text'].values:
some_list_1.append(TargetCorrNumChar(i))
In [152]:
min(some_list_1), max(some_list_1) #7 and 15 characters
Out[152]:
In [153]:
some_list_1.index(0.3820463704019823) #the index of the max
Out[153]:
In [155]:
data_df.iloc[141]['len_question_text'] # the number of characters
Out[155]:
__ans__
. Observe that most of the questions in the bottom of the list have no question marks; so, let's explore that.
In [56]:
data_df[(~data_df.question_text.str.contains("[?]"))][['__ans__', 'question_text', 'num_answers']].sort_values(by="__ans__", ascending=False).head()
Out[56]:
In [57]:
q_df = data_df.question_text.apply(lambda x: 1 if ( (any(~pd.Series(x).str.contains('[?]')))) else 0)
In [58]:
q_df.corr(data_df.__ans__)
Out[58]:
Nothing significant!
Let me start by clarifying the title of this section. I want to investigate if the appearance of certain words in the question_text
affect views (__ans__
and technically the ratio views to age, but I will just call it views). Also, when I write "the correlation of some word," I am referring to the correlation of __ans__
to a boolean column corresponding to whether the question text in that row contains this word.
To make our lives easier, let's automate the process.
In [59]:
# Combination checker function `ccc=checkCorrComb`
def CorrOR(str_):
split = str_.split(', ')
joined= '|'.join(split)
# create a pd series with boolean values
combined_df = data_df.question_text.apply(lambda x: 1 if any(pd.Series(x).str.contains(str(joined))) else 0)
return combined_df.corr(data_df.__ans__)
def CorrAND(str_):
split = str_.split(', ')
# create a pd series with boolean values
combined_df = data_df.question_text.apply(lambda x: 1 if all(words in x for words in split) else 0)
return combined_df.corr(data_df.__ans__)
#Since Quora has a huge userbase in India and Pakistan, let's start there
CorrOR('Pakistan, Pakistani, India, Indian, IIT, Delhi, Modi'), CorrAND('India, Pakistan')
Out[59]:
In [60]:
combination_test_1 = data_df.question_text.apply(lambda x: 1 if any(pd.Series(x).str.contains('Pakistan|Pakistani|India|Indian|IIT|Delhi|Modi')) else 0)
combination_test_2 = data_df.question_text.apply(lambda x: 1 if all(words in x for words in ['India', 'Pakistan']) else 0)
combination_test_1.corr(data_df.__ans__), combination_test_2.corr(data_df.__ans__)
Out[60]:
Ok, the functions work well!
Ok let's create another question_text filter.
In [61]:
# Superlatives
sup_list = ['angriest', 'worst', 'biggest', 'bitterest', 'blackest', 'blandest', 'bloodiest', 'bluest', 'boldest', 'bossiest', 'bravest', 'briefest', 'brightest', 'broadest', 'busiest', 'calmest', 'cheapest', 'chewiest', 'chubbiest', 'classiest', 'cleanest', 'clearest', 'cleverest', 'closest', 'cloudiest', 'clumsiest', 'coarsest', 'coldest', 'coolest', 'craziest', 'creamiest', 'creepiest', 'crispiest', 'cruellest', 'crunchiest', 'curliest', 'curviest', 'cutest', 'dampest', 'darkest', 'deadliest', 'deepest', 'densest', 'dirtiest', 'driest', 'dullest', 'dumbest', 'dustiest', 'earliest', 'easiest', 'faintest', 'fairest', 'fanciest', 'furthest/farthest', 'fastest', 'fattest', 'fewest', 'fiercest', 'filthiest', 'finest', 'firmest', 'fittest', 'flakiest', 'flattest', 'freshest', 'friendliest', 'fullest', 'funniest', 'gentlest', 'gloomiest', 'best', 'grandest', 'gravest', 'greasiest', 'greatest', 'greediest', 'grossest', 'guiltiest', 'hairiest', 'handiest', 'happiest', 'hardest', 'harshest', 'healthiest', 'heaviest', 'highest', 'hippest', 'hottest', 'humblest', 'hungriest', 'iciest', 'itchiest', 'juiciest', 'kindest', 'largest', 'latest', 'laziest', 'lightest', 'likeliest', 'littlest', 'liveliest', 'loneliest', 'longest', 'loudest', 'loveliest', 'lowest', 'maddest', 'meanest', 'messiest', 'mildest', 'moistest', 'narrowest', 'nastiest', 'naughtiest', 'nearest', 'neatest', 'neediest', 'newest', 'nicest', 'noisiest', 'oddest', 'oiliest', 'oldest/eldest', 'plainest', 'politest', 'poorest', 'prettiest', 'proudest', 'purest', 'quickest', 'quietest', 'rarest', 'rawest', 'richest', 'ripest', 'riskiest', 'roomiest', 'roughest', 'rudest', 'rustiest', 'saddest', 'safest', 'saltiest', 'sanest', 'scariest', 'shallowest', 'sharpest', 'shiniest', 'shortest', 'shyest', 'silliest', 'simplest', 'sincerest', 'skinniest', 'sleepiest', 'slimmest', 'slimiest', 'slowest', 'smallest', 'smartest', 'smelliest', 'smokiest', 'smoothest', 'softest', 'soonest', 'sorest', 'sorriest', 'sourest', 'spiciest', 'steepest', 'stingiest', 'strangest', 'strictest', 'strongest', 'sunniest', 'sweatiest', 'sweetest', 'tallest', 'tannest', 'tastiest', 'thickest', 'thinnest', 'thirstiest', 'tiniest', 'toughest', 'truest', 'ugliest', 'warmest', 'weakest', 'wealthiest', 'weirdest', 'wettest', 'widest', 'wildest', 'windiest', 'wisest', 'worldliest', 'worthiest', 'youngest']
data_df['qcontains_superlatives'] = data_df.question_text.apply(lambda x: 1 if any(st in x for st in sup_list) else 0)
In [62]:
data_df['qcontains_superlatives'].corr(data_df['__ans__'])
Out[62]:
Ok, not as promising as I thought! Let's go through the list and look for the best one.
In [66]:
# questions containing best, most, or epic
CorrOR('best, most, epic')
Out[66]:
Not as I expected. What if I multiply it by the number of answers?
In [67]:
# Calculates the correlation between [a boolean column of question texts containing `str_` multiplied by num_answers] and [num_answers]
# eg. CorrORxNumAns('cat, dog, book, animals')
def CorrORxNumAns(str_):
split = str_.split(', ')
joined= '|'.join(split)
combined_df = data_df.question_text.apply(lambda x: 1 if any(pd.Series(x).str.contains(str(joined))) else 0)
xnum_combined_df = combined_df * data_df['num_answers']
return xnum_combined_df.corr(data_df.__ans__)
In [68]:
CorrORxNumAns('best, most, epic, university, stories')
Out[68]:
Let's check if it matters that the number of answers is greater than 1 (as opposed to just 0 or 1, for all with at least one answer). So lets do the following:
In [69]:
# make a class of questions with answers >= 1
a=pd.DataFrame()
a['num_answers_g0'] = data_df['num_answers'].apply(lambda x: 1 if x != 0 else 0)
In [70]:
# multiply `qcontains_best` by the number of answers>=1
a['qcontains_best'] = data_df.question_text.apply(lambda x: 1 if any(pd.Series(x).str.contains(str('best|most|epic|India|'))) else 0)
b = a['qcontains_best'] * a['num_answers_g0']
b.corr(data_df['__ans__'])
Out[70]:
So, it does matter that some questions have more number of answers than others.
Let's begin with a simple historgram.
In [71]:
(ggplot(data_df, aes(x="num_answers")) +\
geom_histogram(binwidth = 5)
+ theme_bw())
Out[71]:
How about questions with more than 6 answers?
In [75]:
(ggplot(data_df[data_df.num_answers > 6], aes(x="num_answers")) +\
geom_histogram(binwidth = 5) + theme_bw())
Out[75]:
Ok, let's see the scatter plot of num_answers
and __ans__
(dependent variable).
In [76]:
(ggplot(data_df, aes(x="num_answers", y="__ans__"))
+ geom_point()
+ geom_smooth(method='lm')
+ theme_bw())
Out[76]:
In [77]:
print('The correlation above is {}.'.format(data_df['num_answers'].corr(data_df['__ans__'])))
In [78]:
(ggplot(data_df, aes('num_answers', '__ans__', color='factor(anonymous)'))
+ theme_bw() #black and white theme
+ geom_point(size=0.2)
+ geom_smooth(aes(color='factor(anonymous)'), method='lm')
+ facet_wrap('~anonymous', nrow=1, scales='free')) # divide the plot by the 'anonymous' column
Out[78]:
In [79]:
print('For the 0 plot, the coefficient of correlation is {0}, whereas for the 1 plot, it is {1}.'.format(data_df[data_df['anonymous']==0]['num_answers'].corr(data_df[data_df['anonymous']==0]['__ans__']), data_df[data_df['anonymous']==1]['num_answers'].corr(data_df[data_df['anonymous']==1]['__ans__'])))
In [80]:
(ggplot(data_df[data_df.num_answers > 20], aes('num_answers', '__ans__', color='factor(anonymous)'))
+ theme(legend_position="left")
+ geom_point()
+ geom_smooth(aes(color='factor(anonymous)'), method='lm')
+ facet_wrap('~anonymous', nrow=1, scales='free')
+ theme_bw())
Out[80]:
This begs the question: which num_answers
maximizes the correlation.
In [81]:
def Corr_gNumAns(int_):
# create a pd series with boolean values
combined_df = data_df.num_answers.apply(lambda x: 1 if x > int_ else 0)
return combined_df.corr(data_df.__ans__)
In [82]:
Corr_gNumAns(0)
Out[82]:
In [83]:
s_df = pd.DataFrame()
s_df['>=num_ans'] = data_df.num_answers
s = data_df.num_answers.values
In [84]:
# calculate correlation
cor = []
for i in range(len(s)):
cor.append(Corr_gNumAns(i))
In [85]:
# calculate correlation
cor = []
for i in s:
cor.append(Corr_gNumAns(i))
In [86]:
cor_df = pd.DataFrame(cor, columns=['cor_coef'])
cor_df.tail()
Out[86]:
Since we have quite a few NaN
, we will only plot the non-NaN
ones.
In [87]:
num_corcoef_df = pd.concat([s_df, cor_df, data_df['anonymous'], data_df['__ans__']], axis=1)
In [98]:
num_corcoef_df.head()
Out[98]:
In [89]:
num_corcoef_df.iloc[2451]
Out[89]:
In [90]:
num_corcoef_df.iloc[2451] = num_corcoef_df.iloc[2452]
In [91]:
(ggplot(num_corcoef_df, aes('>=num_ans', 'cor_coef', size='__ans__'))
+ geom_point(size=0.2)
+ geom_line(size=0.3)
+ theme_bw())
Out[91]:
In [92]:
(ggplot(num_corcoef_df, aes('>=num_ans', 'cor_coef', size='__ans__', color='factor(anonymous)'))
+ geom_point(size=0.2)
+ geom_line(size=0.3)
+ theme_bw()
+ facet_wrap('~anonymous', scales='free'))
Out[92]:
So where is the absolute max of cor_coef
achieved?
In [93]:
num_corcoef_df.iloc[list(num_corcoef_df.cor_coef.values).index(max(list(num_corcoef_df.cor_coef.values)))] # there is probably a non-gory way of doing this
Out[93]:
This tells us that setting a boolean column of whether the number of answers is more than 29 gives us a feature that has a correlation coefficient of 0.39 with our target label __ans__
.
In [105]:
data_df['num_ans>= 29'] = data_df['num_answers'].apply(lambda x: 1 if x>=29 else 0)
In [107]:
data_df['num_ans>= 29'].describe()
Out[107]:
In [108]:
data_df['num_ans>= 29'].corr(data_df['__ans__'])
Out[108]:
What about stories? One common theme I see among popular questions in Quora is questions about the best stories or one liners, etc. Let's see if questions with these terms have a more significant correlation than features we built earlier.
In [110]:
data_df[data_df.question_text.str.contains(' story|Story|stories|Stories')][['question_text', '__ans__']].describe() # the space before story is in order to avoid mixing up of history
Out[110]:
There are 54 of these.
In [111]:
# Best stories
data_df['qcontains_best_story'] = data_df.question_text.apply(lambda x: 1 if any(pd.Series(x).str.contains(" story|Story|stories|Stories")) else 0)
In [112]:
# do a correlation
data_df['qcontains_best_story'].corr(data_df['__ans__'])
Out[112]:
But it's not promising.
In [113]:
# Best one-liner
# first let's get a hint of the keywords associated with "liners"
data_df['qcontains_best_liner'] = data_df.question_text.apply(lambda x: 1 if any(pd.Series(x).str.contains("liners")) else 0)
print(data_df['qcontains_best_liner'].corr(data_df['__ans__']))
data_df[data_df.question_text.str.contains('liners')][['question_text', '__ans__']]
Out[113]:
It wasn't a bad observation afterall, even though I am surprised at the fewness of the questions about one-liners. That probably explains why the correlation is quite low.
We will continue exploring keywords in the next notebook!