MISSION PART ONE: GETTING DATA

  • You are going to scrape the front page of reddit every 4 hours, saving a CSV file that includes:
  1. The title of the post
  2. The number of votes it has (the number between the up and down arrows)
  3. The number of comments it has
  4. What subreddit it is from (e.g. /r/AskReddit, /r/todayilearned)
  5. When it was posted (get a TIMESTAMP, e.g. 2016-06-22T12:33:58+00:00, not "4 hours ago")
  6. The URL to the post itself
  7. The URL of the thumbnail image associated with the post

For the purposes of this exercise I shall be scraping https://www.reddit.com/r/funny/ because who doesn't want funny e-mails coming in at 8am every morning? I am going to regret this aren't I.


In [11]:
import requests
from bs4 import BeautifulSoup

In [12]:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

In [13]:
#Grab the reddit site
response = requests.get("https://www.reddit.com/r/funny/", headers=headers)

In [14]:
doc = BeautifulSoup(response.text, 'html.parser')

In [78]:
#doc

In [21]:
#Grab the posts

posts = doc.find_all('div', {'class': 'thing'})
#len(posts)


Out[21]:
27

In [83]:
#posts

In [107]:
all_posts = []

#Let's grab the title of the posts
for post in posts:
    #Titles are <a> and title
    title = post.find('a', {'class': 'title'})
    title_text = title.text.strip()
    #Votes are <div> <score>
    vote = post.find('div', {'class': 'score'})
    vote_text = vote.text.strip()
    #Comments are <a> and bylink
    comment = post.find('a', {'class': 'bylink'})
    comment_text = comment.text.strip()
    #When it was posted <time> datetime
    time = post.find('time', {'class': 'live-timestamp'})['datetime']
    #URL
    link = post.find('a', {'class': 'title'})['href']
    #URL link
    thumbnail = post.find('img')
    if thumbnail:
        thumbnail = "http:" + (thumbnail['src'])
    funny_posts = {'title': title_text, 'votes': vote_text, 'comments': comment_text, 'timestamp': time, 'link': link, 'thumbnail': thumbnail}
    all_posts.append(funny_posts)
all_posts
    
#QUESTION: How to replace /r/ with link? Do we use regular expressions?


Out[107]:
[{'comments': '126 comments',
  'link': 'https://www.reddit.com/r/unfortunateplacement',
  'thumbnail': 'http://a.thumbs.redditmedia.com/ExQ61Q54Z-aAuJpkFNcC0viWh-2iQcEc9HrocEZcxw8.jpg',
  'timestamp': '2016-06-01T16:08:54+00:00',
  'title': 'Subreddit Of The Month [June 2016]: /r/unfortunateplacement, "(Un)fortunate Ad Placement". Know of a small (under 10,000 subscribers) humor-based subreddit that deserves a month in the spotlight? Link it inside!',
  'votes': '390'},
 {'comments': '372 comments',
  'link': '/r/funny/comments/4j1nln/irs_phone_scams_and_similar_posts_tldr_dont_post/',
  'thumbnail': None,
  'timestamp': '2016-05-12T16:57:43+00:00',
  'title': "IRS phone scams, and similar posts. tldr - DON'T POST PHONE NUMBERS ON REDDIT",
  'votes': '1283'},
 {'comments': '713 comments',
  'link': 'https://imgur.com/gallery/gJgxe',
  'thumbnail': 'http://b.thumbs.redditmedia.com/JyqfanLHxAJf3fAapXOhSupHU3JcOVwo8AOhWYRMv3c.jpg',
  'timestamp': '2016-06-22T12:12:57+00:00',
  'title': 'Ill bet this was a drunk idea gone right.',
  'votes': '5854'},
 {'comments': '100 comments',
  'link': 'http://imgur.com/1qacZGi',
  'thumbnail': 'http://a.thumbs.redditmedia.com/Rzz12w_DE8QbQmNX9tk9Na6YQz20DryZooNocFa5li8.jpg',
  'timestamp': '2016-06-22T15:58:00+00:00',
  'title': 'Snek',
  'votes': '2737'},
 {'comments': '74 comments',
  'link': '/r/funny/comments/4paw64/the_tooth_fairy/',
  'thumbnail': 'http://b.thumbs.redditmedia.com/fjdQ0-IQYm5RhA25z3rwznNoAFw-YVr0jsJNaWwB1_c.jpg',
  'timestamp': '2016-06-22T14:23:06+00:00',
  'title': 'The Tooth Fairy',
  'votes': '3021'},
 {'comments': '84 comments',
  'link': '/r/funny/comments/4pb83n/i_thought_we_were_past_this/',
  'thumbnail': 'http://b.thumbs.redditmedia.com/G1lEtjgvEddcuF9oyZ4KMIeUyhGMPH9Ke0CuiJVH-Xw.jpg',
  'timestamp': '2016-06-22T15:28:09+00:00',
  'title': 'I thought we were past this..',
  'votes': '2415'},
 {'comments': '60 comments',
  'link': 'http://imgur.com/2JojN3l',
  'thumbnail': 'http://b.thumbs.redditmedia.com/JT_ilJYJUCi_m0tfSfj0h-MIXvzb_2Xab5dJNTgUXHw.jpg',
  'timestamp': '2016-06-22T13:20:53+00:00',
  'title': "The heat's getting ridiculous lately",
  'votes': '3454'},
 {'comments': '54 comments',
  'link': 'http://i.imgur.com/JMWtoSO.jpg',
  'thumbnail': 'http://a.thumbs.redditmedia.com/f3G7Iib-kRx3_qCjzSDfUELzw1aqp-nSQwtEWKqyGY0.jpg',
  'timestamp': '2016-06-22T15:59:42+00:00',
  'title': 'Prom Queen',
  'votes': '1959'},
 {'comments': '51 comments',
  'link': 'http://imgur.com/UDAYf2M',
  'thumbnail': 'http://b.thumbs.redditmedia.com/C_ow6gcDhuSq1vsThDOxndfrc2rxUAE1SB9Xt752Nvs.jpg',
  'timestamp': '2016-06-22T16:54:13+00:00',
  'title': 'Still need teepee for my bunghole.',
  'votes': '1663'},
 {'comments': '60 comments',
  'link': 'http://i.imgur.com/lOFAqLV.jpg',
  'thumbnail': 'http://b.thumbs.redditmedia.com/n8UTPZEap1pbNZJKu0FGcZD28ej-dRCoJ4UWUr84DyU.jpg',
  'timestamp': '2016-06-22T16:48:22+00:00',
  'title': 'How to Adult..',
  'votes': '1521'},
 {'comments': '178 comments',
  'link': '/r/funny/comments/4pa7xz/its_been_a_week_and_no_one_has_noticed_a_problem/',
  'thumbnail': 'http://b.thumbs.redditmedia.com/fDUCJODDWdJ15o2aBzxRdOuqB602_q3N_UHrdwF_Nfk.jpg',
  'timestamp': '2016-06-22T11:44:18+00:00',
  'title': "It's been a week and no one has noticed a problem in the office. It's almost worrying.",
  'votes': '3827'},
 {'comments': '34 comments',
  'link': 'http://i.imgur.com/nuRZv1O.gifv',
  'thumbnail': 'http://a.thumbs.redditmedia.com/qT7PkIVnsieQ39kCxbnzf2NucgBHDExsa6bcJTR2oz8.jpg',
  'timestamp': '2016-06-22T18:31:02+00:00',
  'title': '"I\'m stuck...I\'m stuck! Never mind...I got it!"',
  'votes': '1033'},
 {'comments': '318 comments',
  'link': 'http://imgur.com/gallery/EvuHeeu',
  'thumbnail': 'http://a.thumbs.redditmedia.com/Uh-xFeGsRD3BnMa6maXinluw5pv-p2xuw5Q6bzNaLk0.jpg',
  'timestamp': '2016-06-22T09:49:03+00:00',
  'title': 'Really is the same.',
  'votes': '4622'},
 {'comments': '99 comments',
  'link': 'http://imgur.com/d5c3eVp',
  'thumbnail': 'http://a.thumbs.redditmedia.com/ft0PtajHyi-Jrlh8brmtvcYBkX6MF5EPqBXe4tngv28.jpg',
  'timestamp': '2016-06-22T13:16:47+00:00',
  'title': 'Best button ever',
  'votes': '2062'},
 {'comments': '281 comments',
  'link': 'http://imgur.com/m4Pip8G',
  'thumbnail': 'http://b.thumbs.redditmedia.com/KHoVRBea1MHmrswy9WS8i-5UDD_MxUz0k2IzrinIDqw.jpg',
  'timestamp': '2016-06-22T12:20:21+00:00',
  'title': 'Over Protective Dad',
  'votes': '2198'},
 {'comments': '83 comments',
  'link': 'https://i.reddituploads.com/c1beb2da907f4f15b5ca4df1fd7fa714?fit=max&h=1536&w=1536&s=c6d72d95256e55a1a84dd402f38bbac1',
  'thumbnail': 'http://b.thumbs.redditmedia.com/kgH4GvboOZSBZVe9Cf5Ju_O4myPyQfssmio8hcwOpvY.jpg',
  'timestamp': '2016-06-22T15:52:35+00:00',
  'title': 'Life is though',
  'votes': '1097'},
 {'comments': '48 comments',
  'link': 'http://i.imgur.com/oFFIDTv.jpg',
  'thumbnail': 'http://b.thumbs.redditmedia.com/E-4Zgxs7IN1uBmiARVmr13mlYkgVMrNCMx8t8TDRmSk.jpg',
  'timestamp': '2016-06-22T14:49:52+00:00',
  'title': 'The Lord has returned',
  'votes': '1297'},
 {'comments': '965 comments',
  'link': 'http://i.imgur.com/b4Ooqum.gifv',
  'thumbnail': 'http://b.thumbs.redditmedia.com/NufbdJdT7ElD8oIYz_njwM8jznoO3CFL5-yMqvUh0AY.jpg',
  'timestamp': '2016-06-22T07:47:03+00:00',
  'title': 'The lord of the memes',
  'votes': '4754'},
 {'comments': '37 comments',
  'link': 'http://i.imgur.com/E3xp4jR.jpg',
  'thumbnail': 'http://b.thumbs.redditmedia.com/bTxp_JjGpl1SQ5ODuWQehvRTK6HK3HoEIjnj70DFxdw.jpg',
  'timestamp': '2016-06-22T17:51:17+00:00',
  'title': "My girlfriend recently started working nights. I've been waking up to a lot of worrying texts.",
  'votes': '724'},
 {'comments': '38 comments',
  'link': 'https://4.bp.blogspot.com/-0kvkFCjt17E/V18upnCqMEI/AAAAAAAA8XI/KfgmxOYnHrYjB3B3DX2PkArs5zc7LMnOACLcB/s1600/flip_pleatedjeans.jpg',
  'thumbnail': 'http://a.thumbs.redditmedia.com/h1cMDUiG3F2bMC3lHtaNZDsKJsY-1-dLmvp3NWjhTa8.jpg',
  'timestamp': '2016-06-22T16:13:20+00:00',
  'title': 'Define prophetic',
  'votes': '837'},
 {'comments': '58 comments',
  'link': 'http://i.imgur.com/IApJLYU.png',
  'thumbnail': 'http://b.thumbs.redditmedia.com/jRaQ2tCDNZf3r71uGLD8PMUBVdRu0a0OA6Mhhav0eQw.jpg',
  'timestamp': '2016-06-22T17:40:28+00:00',
  'title': 'She Said We Need To Run Our House More Like A Business. . .',
  'votes': '483'},
 {'comments': '183 comments',
  'link': 'http://imgur.com/gallery/DsfHo',
  'thumbnail': 'http://b.thumbs.redditmedia.com/zRQG0RMEiIJCv-9fKeNbgqIRlyJbDpJ6knogRV_6bFw.jpg',
  'timestamp': '2016-06-22T09:41:42+00:00',
  'title': 'I vape!',
  'votes': '1677'},
 {'comments': '285 comments',
  'link': 'http://i.imgur.com/k0JbGUL.jpg',
  'thumbnail': 'http://a.thumbs.redditmedia.com/Ympe6O1i7wIbWWuaU7QRCqlo5CkabVHB26Wm1N0v064.jpg',
  'timestamp': '2016-06-22T06:45:59+00:00',
  'title': 'Dumbest phone case ever...',
  'votes': '2386'},
 {'comments': '29 comments',
  'link': 'http://imgur.com/GUso0OB',
  'thumbnail': 'http://b.thumbs.redditmedia.com/Ic4RraavhfZFaM8jO1qI7lUCILMiAscvNiU5DonCAaM.jpg',
  'timestamp': '2016-06-22T12:39:47+00:00',
  'title': 'Awkward sign placement',
  'votes': '785'},
 {'comments': '37 comments',
  'link': 'https://i.imgur.com/I7Jry8d.png',
  'thumbnail': 'http://b.thumbs.redditmedia.com/PurZEaKNLEI8IdSWEDiC7KBbzkCF7tiUnFHo7QvEdkQ.jpg',
  'timestamp': '2016-06-22T17:30:20+00:00',
  'title': 'Lindsay Lohan wore the same jacket in her mugshot as she did in a scene from the Parent Trap.',
  'votes': '334'},
 {'comments': '1087 comments',
  'link': 'http://imgur.com/q8LaNvG',
  'thumbnail': 'http://b.thumbs.redditmedia.com/bsXv0zjBkn0uc3nZGMjjBvYBR9oOLG4L7SSpXMHroVY.jpg',
  'timestamp': '2016-06-22T02:15:41+00:00',
  'title': 'Sounds about right',
  'votes': '5209'},
 {'comments': '5 comments',
  'link': 'https://i.reddituploads.com/d07bd700dec94e6496156017f8fbaea1?fit=max&h=1536&w=1536&s=417daee8ba002968f63e296f960ebca7',
  'thumbnail': 'http://b.thumbs.redditmedia.com/vG3EQAiM5bxVXJn3ljAT7_90q192QTx0AW5d13R-vRA.jpg',
  'timestamp': '2016-06-22T17:20:45+00:00',
  'title': "It's a little late but... Happy Father's Day!",
  'votes': '280'}]

In [108]:
#len(all_posts)


Out[108]:
27

In [109]:
import pandas as pd

In [111]:
posts_df = pd.DataFrame(all_posts)
posts_df.head()


Out[111]:
comments link thumbnail timestamp title votes
0 126 comments https://www.reddit.com/r/unfortunateplacement http://a.thumbs.redditmedia.com/ExQ61Q54Z-aAuJ... 2016-06-01T16:08:54+00:00 Subreddit Of The Month [June 2016]: /r/unfortu... 390
1 372 comments /r/funny/comments/4j1nln/irs_phone_scams_and_s... None 2016-05-12T16:57:43+00:00 IRS phone scams, and similar posts. tldr - DON... 1283
2 713 comments https://imgur.com/gallery/gJgxe http://b.thumbs.redditmedia.com/JyqfanLHxAJf3f... 2016-06-22T12:12:57+00:00 Ill bet this was a drunk idea gone right. 5854
3 100 comments http://imgur.com/1qacZGi http://a.thumbs.redditmedia.com/Rzz12w_DE8QbQm... 2016-06-22T15:58:00+00:00 Snek 2737
4 74 comments /r/funny/comments/4paw64/the_tooth_fairy/ http://b.thumbs.redditmedia.com/fjdQ0-IQYm5RhA... 2016-06-22T14:23:06+00:00 The Tooth Fairy 3021

In [112]:
import time

In [113]:
datestring = time.strftime("%Y-%m-%d-%H-%M")
datestring


Out[113]:
'2016-06-22-18-55'

In [115]:
filename = "reddit-data" + datestring + ".csv"
posts_df.to_csv(filename, index=False)

In [ ]: