Reddit Data with BigQuery

by Max Woolf

This notebook is the complement for my blog post How to Analyze Every Reddit Submission and Comment, in Seconds, for Free.



In [38]:

    
options(warn=-1)

# IMPORTANT: This assumes that all packages in "Rstart.R" are installed,
# and the fonts "Source Sans Pro" and "Open Sans Condensed Bold" are installed
# via extrafont. If ggplot2 charts fail to render, you may need to change/remove the theme call.

source("Rstart.R")
library(tidyr)
library(bigrquery)
library(methods) # needed for query_exec in Jupyter: https://github.com/hadley/bigrquery/issues/32
library(wordcloud)
library(digest)

options(repr.plot.mimetypes = 'image/png', repr.plot.width=4, repr.plot.height=3, repr.plot.res=300)

rbigquery

This uses the rbigquery R package to query the data. Ensure that it is set up correctly, with your own project name from BigQuery.



In [4]:

    
project_name <- <FILL IN>   # DO NOT SHARE!

Hello World!

Simple query to test things out.



In [10]:

    
sql <- "SELECT DATE(SEC_TO_TIMESTAMP(created)) date_submission,
COUNT(*) as num_submissions
FROM [fh-bigquery:reddit_posts.full_corpus_201509]
GROUP BY date_submission
ORDER by date_submission"

df <- tbl_df(query_exec(sql, project=project_name, max_pages=Inf))
df %>% tail(10)









    Out[10]:





date_submission num_submissions

	1 2015-08-23 170999
	2 2015-08-24 163107
	3 2015-08-25 264787
	4 2015-08-26 235858
	5 2015-08-27 212472
	6 2015-08-28 206100
	7 2015-08-29 180039
	8 2015-08-30 183686
	9 2015-08-31 214685
	10 2015-09-01 10299

Now we can plot it in ggplot2:



In [19]:

    
plot <- ggplot(df, aes(x=as.Date(date_submission), y=num_submissions)) +
            geom_area(fill="#2980b9", alpha=0.85, size=0) +
            fte_theme() +
            scale_x_date(breaks=date_breaks("1 year"), labels=date_format("%Y")) +
            scale_y_continuous(breaks=pretty_breaks(8), labels=comma) +
            labs(x="Date of Submission", y="# of Submissions", title="Daily # of Reddit Submissions from 2006 - 2015")

max_save(plot, "reddit-bigquery-1", "Reddit")

When is the best time to submit to reddit for virality?

Create heatmap.



In [21]:

    
sql <- "SELECT
  DAYOFWEEK(SEC_TO_TIMESTAMP(created - 60*60*5)) as sub_dayofweek,
  HOUR(SEC_TO_TIMESTAMP(created - 60*60*5)) as sub_hour,
  SUM(IF(score >= 3000, 1, 0)) as num_gte_3000,
FROM [fh-bigquery:reddit_posts.full_corpus_201509]
GROUP BY sub_dayofweek, sub_hour
ORDER BY sub_dayofweek, sub_hour"

df <- tbl_df(query_exec(sql, project=project_name, max_pages=Inf))
df %>% tail(10)









    Out[21]:





sub_dayofweek sub_hour num_gte_3000

	1 7 14 1001
	2 7 15 893
	3 7 16 890
	4 7 17 806
	5 7 18 807
	6 7 19 763
	7 7 20 769
	8 7 21 705
	9 7 22 620
	10 7 23 505

A few tweaks to format Time aliases into readable representations:



In [25]:

    
dow_format <- data_frame(sub_dayofweek = 1:7, dow_format = c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"))

hour_format <- data_frame(sub_hour = 0:23, hour_format = c(paste(c(12,1:11),"AM"), paste(c(12,1:11),"PM")))

df_time <- df %>% left_join(dow_format) %>% left_join(hour_format)

df_time %>% tail(10)









    



Joining by: "sub_dayofweek"
Joining by: "sub_hour"






    Out[25]:





sub_dayofweek sub_hour num_gte_3000 dow_format hour_format

	1 7 14 1001 Saturday 2 PM
	2 7 15 893 Saturday 3 PM
	3 7 16 890 Saturday 4 PM
	4 7 17 806 Saturday 5 PM
	5 7 18 807 Saturday 6 PM
	6 7 19 763 Saturday 7 PM
	7 7 20 769 Saturday 8 PM
	8 7 21 705 Saturday 9 PM
	9 7 22 620 Saturday 10 PM
	10 7 23 505 Saturday 11 PM



In [28]:

    
# Necessary for correct order when plotting.
df_time$dow_format <- factor(df_time$dow_format, level = rev(dow_format$dow_format))
df_time$hour_format <- factor(df_time$hour_format, level = hour_format$hour_format)

plot <- ggplot(df_time, aes(x=hour_format, y=dow_format, fill=num_gte_3000)) +
    geom_tile() +
    fte_theme() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.6), legend.title = element_blank(), legend.position="top", legend.direction="horizontal", legend.key.width=unit(1, "cm"), legend.key.height=unit(0.25, "cm"), legend.margin=unit(-0.5,"cm"), panel.margin=element_blank()) +
    labs(x = "Hour of Reddit Submission (Eastern Standard Time)", y = "Day of Week of Reddit Submission", title = "# of Reddit Submissions Which Received >3000 Points, by Time of Original Submission") +
    scale_fill_gradient(low = "white", high = "#27ae60", labels=comma, breaks=pretty_breaks(6))

max_save(plot, "reddit-bigquery-2", "Reddit", w=6)

Which words in comments lead to the most upvotes?



In [153]:

    
# In R, note that the backslashes and quotes are escaped.

sql <- "SELECT word, COUNT(*) as num_words, AVG(score) as avg_score
FROM(FLATTEN((
  SELECT SPLIT(LOWER(REGEXP_REPLACE(body, r'[\\.\\\",*:()\\[\\]/|\\n]', ' ')), ' ') word, score
  FROM [fh-bigquery:reddit_comments.2015_08] 
  WHERE author NOT IN (SELECT author FROM [fh-bigquery:reddit_comments.bots_201505])
    AND subreddit=\"news\"
  ), word))
GROUP EACH BY word
HAVING num_words >= 10000
ORDER BY num_words DESC"

df <- tbl_df(query_exec(sql, project=project_name, max_pages=Inf))
df %>% head(10)









    



Running query:   RUNNING  3.3s






    



11.3 gigabytes processed






    Out[153]:





word num_words avg_score

	1 the 860688 10.56031
	2 to 566054 9.885569
	3 a 510322 9.933583
	4 and 419449 10.13845
	5 of 387376 9.68622
	6 that 319336 8.988705
	7 is 310461 8.917468
	8 i 291533 8.348729
	9 you 283140 6.48695
	10 in 277130 9.831895

Create a wordcloud using the wordcloud package. (I may do a seperate post on how to make Wordclouds.)



In [161]:

    
stop_words <- unlist(strsplit("a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your,id,item,it\'s,don\'t",","))

pal <- brewer.pal(9, "Purples")
pal <- pal[-c(1:3)]   # Remove light colors

df_nostop <- df %>% filter(!(word %in% stop_words))

png(filename = "reddit-bigquery-3.png", width = 1000, height = 1000, res= 300)

wordcloud(toupper(df_nostop$word),
          df_nostop$num_words,
          scale=c(5,.1),
          random.order=F,
          rot.per=.10,
          max.words=5000,
          colors=pal,
          family="Avenir Next Condensed Bold",
          random.color=T)

dev.off()









    Out[161]:




pdf: 2

Subreddit Comment Monthly Active Users



In [149]:

    
# Query is about 53GB; use with caution!

sql <- "SELECT subreddit, date, unique_authors FROM
(SELECT subreddit, date, unique_authors, ROW_NUMBER() OVER (PARTITION BY date ORDER BY unique_authors DESC) rank FROM
(SELECT subreddit, LEFT(DATE(SEC_TO_TIMESTAMP(created_utc)), 7) as date, COUNT(UNIQUE(author)) as unique_authors
FROM TABLE_QUERY([fh-bigquery:reddit_comments], \"table_id CONTAINS \'20\' AND LENGTH(table_id)<8\")
GROUP EACH BY subreddit, date
))
WHERE rank <= 20
ORDER BY date ASC, unique_authors DESC"

df <- tbl_df(query_exec(sql, project=project_name, max_pages=Inf))
df %>% tail(10)









    



Running query:   RUNNING 11.9s






    



53.3 gigabytes processed






    Out[149]:





subreddit date unique_authors

	1 news 2015-08 107419
	2 gifs 2015-08 106822
	3 movies 2015-08 101296
	4 AdviceAnimals 2015-08 99190
	5 Showerthoughts 2015-08 76849
	6 aww 2015-08 71682
	7 IAmA 2015-08 67675
	8 explainlikeimfive 2015-08 60421
	9 mildlyinteresting 2015-08 60346
	10 Music 2015-08 59769



In [151]:

    
df_subreddit <- df %>% mutate(date_format=paste(date,"-01",sep=''))

system("mkdir -p subreddit-ranks")

# Assign colors to subreddits at random using a hash of subreddit name

colorHash <- function(strings) {
    colors <- color_palette
    
    if (strtoi(substr(digest(strings),1,6), base=36) %% length(colors) == 0) { return ("#999999") }
    return (colors[strtoi(substr(digest(strings),1,6), base=36) %% length(colors)])
}

subredditPlot <- function(month) {
    df_subset <- df_subreddit %>% filter(date_format==month)
    
    subreddit_colors <- unlist(lapply(df_subset$subreddit, colorHash))

    df_subset$subreddit <- factor(df_subset$subreddit, levels=rev(df_subset$subreddit))
        
    left_labels <- ifelse(df_subset$unique_authors > max(df_subset$unique_authors) * 0.90,
                             format(df_subset$unique_authors, big.mark=","), '')
    right_labels <- ifelse(df_subset$unique_authors < max(df_subset$unique_authors) * 0.90,
                             format(df_subset$unique_authors, big.mark=","), '')
    
    plot <- ggplot(df_subset, aes(x=subreddit, y=unique_authors, fill=subreddit)) +
                geom_bar(stat="identity") +
                geom_text(label=left_labels, size=2, hjust=1.25, color="white", family="Open Sans Condensed Bold") +
                geom_text(label=right_labels, size=2, hjust=-0.25, color=subreddit_colors, family="Open Sans Condensed Bold") +
                fte_theme() +
                coord_flip() +
                scale_y_continuous(labels=comma, breaks=pretty_breaks(6)) +
                scale_fill_manual(values=rev(subreddit_colors)) +
                theme(axis.text.y = element_text(color=rev(subreddit_colors)), plot.title=element_text(hjust=1), axis.title.y=element_blank()) +
                labs(y="Monthly Unique Commenters in Subreddit", title=sprintf("Subreddits with Greatest # of Distinct Comment Authors in %s", format(as.Date(month), "%B %Y")))
    
                         
    max_save(plot, sprintf("subreddit-ranks/%s", month), "Reddit")
    
}

subredditPlot("2015-08-01")

Loop over the subredditPlot function to create each frame for the GIF.



In [152]:

    
start_date <- "2010-08-01"

months <- as.character(seq(as.Date(start_date), as.Date("2015-08-01"), "months"))

x <- lapply(months, subredditPlot)

	date_submission	num_submissions
1	2015-08-23	170999
2	2015-08-24	163107
3	2015-08-25	264787
4	2015-08-26	235858
5	2015-08-27	212472
6	2015-08-28	206100
7	2015-08-29	180039
8	2015-08-30	183686
9	2015-08-31	214685
10	2015-09-01	10299

	sub_dayofweek	sub_hour	num_gte_3000
1	7	14	1001
2	7	15	893
3	7	16	890
4	7	17	806
5	7	18	807
6	7	19	763
7	7	20	769
8	7	21	705
9	7	22	620
10	7	23	505

	sub_dayofweek	sub_hour	num_gte_3000	dow_format	hour_format
1	7	14	1001	Saturday	2 PM
2	7	15	893	Saturday	3 PM
3	7	16	890	Saturday	4 PM
4	7	17	806	Saturday	5 PM
5	7	18	807	Saturday	6 PM
6	7	19	763	Saturday	7 PM
7	7	20	769	Saturday	8 PM
8	7	21	705	Saturday	9 PM
9	7	22	620	Saturday	10 PM
10	7	23	505	Saturday	11 PM

	word	num_words	avg_score
1	the	860688	10.56031
2	to	566054	9.885569
3	a	510322	9.933583
4	and	419449	10.13845
5	of	387376	9.68622
6	that	319336	8.988705
7	is	310461	8.917468
8	i	291533	8.348729
9	you	283140	6.48695
10	in	277130	9.831895

	subreddit	date	unique_authors
1	news	2015-08	107419
2	gifs	2015-08	106822
3	movies	2015-08	101296
4	AdviceAnimals	2015-08	99190
5	Showerthoughts	2015-08	76849
6	aww	2015-08	71682
7	IAmA	2015-08	67675
8	explainlikeimfive	2015-08	60421
9	mildlyinteresting	2015-08	60346
10	Music	2015-08	59769