MISSION PART ONE: GETTING DATA

  • You are going to scrape the front page of reddit every 4 hours, saving a CSV file that includes:
  1. The title of the post
  2. The number of votes it has (the number between the up and down arrows)
  3. The number of comments it has
  4. What subreddit it is from (e.g. /r/AskReddit, /r/todayilearned)
  5. When it was posted (get a TIMESTAMP, e.g. 2016-06-22T12:33:58+00:00, not "4 hours ago")
  6. The URL to the post itself
  7. The URL of the thumbnail image associated with the post

For the purposes of this exercise I shall be scraping https://www.reddit.com/r/funny/ because who doesn't want funny e-mails coming in at 8am every morning? I am going to regret this aren't I.


In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

In [3]:
#Grab the reddit site
response = requests.get("https://www.reddit.com/r/funny/", headers=headers)

In [4]:
doc = BeautifulSoup(response.text, 'html.parser')

In [5]:
#doc

In [6]:
#Grab the posts

posts = doc.find_all('div', {'class': 'thing'})
#len(posts)

In [7]:
#posts

In [8]:
all_posts = []

#Let's grab the title of the posts
for post in posts:
    #Titles are <a> and title
    title = post.find('a', {'class': 'title'})
    title_text = title.text.strip()
    #Votes are <div> <score>
    vote = post.find('div', {'class': 'score'})
    vote_text = vote.text.strip()
    #Comments are <a> and bylink
    comment = post.find('a', {'class': 'bylink'})
    comment_text = comment.text.strip()
    #When it was posted <time> datetime
    time = post.find('time', {'class': 'live-timestamp'})['datetime']
    #URL
    link = post.find('a', {'class': 'title'})['href']
    #URL link
    thumbnail = post.find('img')
    if thumbnail:
        thumbnail = "http:" + (thumbnail['src'])
    funny_posts = {'title': title_text, 'votes': vote_text, 'comments': comment_text, 'timestamp': time, 'link': link, 'thumbnail': thumbnail}
    all_posts.append(funny_posts)
all_posts
    
#QUESTION: How to replace /r/ with link? Do we use regular expressions?


Out[8]:
[{'comments': '129 comments',
  'link': 'https://www.reddit.com/r/unfortunateplacement',
  'thumbnail': 'http://a.thumbs.redditmedia.com/ExQ61Q54Z-aAuJpkFNcC0viWh-2iQcEc9HrocEZcxw8.jpg',
  'timestamp': '2016-06-01T16:08:54+00:00',
  'title': 'Subreddit Of The Month [June 2016]: /r/unfortunateplacement, "(Un)fortunate Ad Placement". Know of a small (under 10,000 subscribers) humor-based subreddit that deserves a month in the spotlight? Link it inside!',
  'votes': '406'},
 {'comments': '378 comments',
  'link': '/r/funny/comments/4j1nln/irs_phone_scams_and_similar_posts_tldr_dont_post/',
  'thumbnail': None,
  'timestamp': '2016-05-12T16:57:43+00:00',
  'title': "IRS phone scams, and similar posts. tldr - DON'T POST PHONE NUMBERS ON REDDIT",
  'votes': '1304'},
 {'comments': '2694 comments',
  'link': 'http://imgur.com/WQ9f3g0',
  'thumbnail': 'http://b.thumbs.redditmedia.com/1ZRp2v23jOSqkzRmfcqjR5fthye4Iz1dRPwea7Bz9YQ.jpg',
  'timestamp': '2016-06-23T12:41:33+00:00',
  'title': 'Army Specialist was denied leave to go to a baby shower because his CO said "Men don\'t go to baby showers", so he changed his reason',
  'votes': '9292'},
 {'comments': '834 comments',
  'link': 'http://i.imgur.com/m0xCyiI.jpg',
  'thumbnail': 'http://b.thumbs.redditmedia.com/sXji7NpPLobngCcxOQUs0z6SvB_z-sH-629dgDQn3qk.jpg',
  'timestamp': '2016-06-23T03:43:15+00:00',
  'title': 'A girl at the fire station after getting stuck in a Barney head',
  'votes': '5853'},
 {'comments': '145 comments',
  'link': 'http://i.imgur.com/tw02cpV.jpg',
  'thumbnail': 'http://b.thumbs.redditmedia.com/ZuCNEoY9po1AeTNf9D3F2ZhfG3gU_RwqY0O4i4fwOVA.jpg',
  'timestamp': '2016-06-23T10:50:10+00:00',
  'title': 'This is it. No more arguing.',
  'votes': '1304'},
 {'comments': '261 comments',
  'link': '/r/funny/comments/4pep0t/turns_out_michael_scott_is_the_original/',
  'thumbnail': 'http://b.thumbs.redditmedia.com/QRzzvyYU6Fa8b-cHOpd9ifHiBr6u-7kSD0CBQLEvjvE.jpg',
  'timestamp': '2016-06-23T03:20:45+00:00',
  'title': 'Turns out Michael Scott is the original r/explainlikeimfive',
  'votes': '3781'},
 {'comments': '84 comments',
  'link': 'http://i.imgur.com/02oarlg.jpg',
  'thumbnail': 'http://b.thumbs.redditmedia.com/BXeCXhbWikrOOMN1Yj-2C3xA_GnCNb6pMxp-o5VUp0Q.jpg',
  'timestamp': '2016-06-23T14:19:23+00:00',
  'title': 'How American people shower.',
  'votes': '484'},
 {'comments': '30 comments',
  'link': 'https://s-media-cache-ak0.pinimg.com/originals/ce/a6/a3/cea6a36743adc33bc438ae97a47e665e.gif',
  'thumbnail': 'http://b.thumbs.redditmedia.com/4A3qbHni8uMYgsFQ6REI2Af10o6rzwnKEoAgP2iekcY.jpg',
  'timestamp': '2016-06-23T09:08:09+00:00',
  'title': 'Oh sorry I booped you',
  'votes': '1046'},
 {'comments': '443 comments',
  'link': 'http://i.imgur.com/cOYywLL.png',
  'thumbnail': 'http://b.thumbs.redditmedia.com/IULOVQ_EUnPJswMuMtv-yGY6tnokrekYhORH9VFT0pU.jpg',
  'timestamp': '2016-06-23T01:08:43+00:00',
  'title': 'Swing and a piss',
  'votes': '4064'},
 {'comments': '88 comments',
  'link': '/r/funny/comments/4peg03/sometimes_its_just_way_too_hot_for_pants/',
  'thumbnail': 'http://b.thumbs.redditmedia.com/jl37vldlXaeTWZfXphMfMflai-EgM_P4PTu0Sja8AAo.jpg',
  'timestamp': '2016-06-23T02:20:29+00:00',
  'title': 'Sometimes its just way too hot for pants',
  'votes': '2809'},
 {'comments': '27 comments',
  'link': 'http://i.imgur.com/fKCs7Ub.gifv',
  'thumbnail': 'http://b.thumbs.redditmedia.com/Hw6vMqip29DOeo7VRP87OWST1f7zXVwd8TuqkKb1UzI.jpg',
  'timestamp': '2016-06-23T13:42:51+00:00',
  'title': '"I have no regrets."',
  'votes': '313'},
 {'comments': '16 comments',
  'link': 'http://i.imgur.com/DRqBzIw.png',
  'thumbnail': 'http://b.thumbs.redditmedia.com/t2EW0e5yLtySiUFzVAaIgTwmTlxBo8yLzNbtLtNwssU.jpg',
  'timestamp': '2016-06-23T07:29:44+00:00',
  'title': 'Lara Croft cosplay',
  'votes': '794'},
 {'comments': '39 comments',
  'link': '/r/funny/comments/4phjkg/what_math_we_should_teach/',
  'thumbnail': 'http://b.thumbs.redditmedia.com/tMZn_vA8eoCrMFSplue8q5mGosNL36Pg4urpgBqWxbI.jpg',
  'timestamp': '2016-06-23T16:19:01+00:00',
  'title': 'What Math We Should Teach',
  'votes': '•'},
 {'comments': '38 comments',
  'link': 'http://imgur.com/qPWL49l',
  'thumbnail': 'http://b.thumbs.redditmedia.com/o6lpw4gGkzHE7M5bP-Tc_NXcZPJPRRVpCZmIS4ol76g.jpg',
  'timestamp': '2016-06-23T03:16:03+00:00',
  'title': 'Dog Rules-A sign at our local animal hospital',
  'votes': '1415'},
 {'comments': '508 comments',
  'link': 'http://i.imgur.com/sStCbCE.jpg',
  'thumbnail': 'http://b.thumbs.redditmedia.com/Ta38y4xziatD6l-mBFfHerFrZpP4wgRGnkyv9zSF4go.jpg',
  'timestamp': '2016-06-22T20:18:34+00:00',
  'title': 'T-Rex Arms',
  'votes': '4947'},
 {'comments': '356 comments',
  'link': 'http://i.imgur.com/c1R6xUM.gifv',
  'thumbnail': 'http://b.thumbs.redditmedia.com/wXo_mv40rGfq5u__pFsOA5XIi2A-o_ZgMBh2bTEoDpo.jpg',
  'timestamp': '2016-06-22T20:59:24+00:00',
  'title': 'Animals reacting to themselves in a mirror.',
  'votes': '4324'},
 {'comments': '7 comments',
  'link': 'http://i.imgur.com/2mrlTUf.jpg',
  'thumbnail': 'http://b.thumbs.redditmedia.com/pd-mCpdNQ9DL6ba5piU4vg_RKZEoTK1G55jSxFjIXRE.jpg',
  'timestamp': '2016-06-23T13:53:31+00:00',
  'title': 'Instructions unclear',
  'votes': '195'},
 {'comments': '152 comments',
  'link': '/r/funny/comments/4pcmpc/cat_sitter_recieves_accurate_descriptions/',
  'thumbnail': 'http://b.thumbs.redditmedia.com/5D3u5d2DrYpOhyfJrviiquafXk76Rb6nhg6nhsTAyWk.jpg',
  'timestamp': '2016-06-22T19:46:19+00:00',
  'title': 'Cat sitter recieves accurate descriptions',
  'votes': '4944'},
 {'comments': '15 comments',
  'link': 'http://funnyasduck.net/wp-content/uploads/2013/02/funny-simpsons-tv-scene-recycled-paper-zero-is-a-percent-pics.jpg',
  'thumbnail': 'http://a.thumbs.redditmedia.com/LkV5JfCyZ7FX4PhBAGjXPW2_sa80K8jIc3RwlvaSoW8.jpg',
  'timestamp': '2016-06-23T01:12:53+00:00',
  'title': 'Simpsons Episode shows how to recycle and profit!!',
  'votes': '1670'},
 {'comments': '559 comments',
  'link': 'http://i.imgur.com/nuRZv1O.gifv',
  'thumbnail': 'http://a.thumbs.redditmedia.com/qT7PkIVnsieQ39kCxbnzf2NucgBHDExsa6bcJTR2oz8.jpg',
  'timestamp': '2016-06-22T18:31:02+00:00',
  'title': '"I\'m stuck...I\'m stuck! Never mind...I got it!"',
  'votes': '5419'},
 {'comments': '17 comments',
  'link': 'http://i.imgur.com/4HN0O12.jpg',
  'thumbnail': 'http://a.thumbs.redditmedia.com/3X5XKMvOKHzavfEk4YIzX-vCZ9wBvEc8MDlVSXJb7Z0.jpg',
  'timestamp': '2016-06-23T12:15:10+00:00',
  'title': "Disgusting! This once proud animal was killed by a teenage girl, and now she's posing with its head as a trophy.",
  'votes': '206'},
 {'comments': '121 comments',
  'link': 'https://i.reddituploads.com/534a3bbc66d04ea7a088de5b840dbb34?fit=max&h=1536&w=1536&s=a2a0be88cc6f6c0440fdaaad65cea89d',
  'thumbnail': 'http://b.thumbs.redditmedia.com/WNowB3YREiAn73XpHGfZj74ZCWIaaXrOlT12VeRnhTU.jpg',
  'timestamp': '2016-06-22T20:55:48+00:00',
  'title': 'This happened when Ireland scored.',
  'votes': '3001'},
 {'comments': '14 comments',
  'link': 'http://imgur.com/qbGPsTh.gifv',
  'thumbnail': 'http://b.thumbs.redditmedia.com/eZnOUb_CY2GnCBsKvH4F2FHdSsjf4MHqzcPO3qjB2yI.jpg',
  'timestamp': '2016-06-23T12:15:39+00:00',
  'title': 'Cannonball!',
  'votes': '177'},
 {'comments': '10 comments',
  'link': 'http://www.pidjin.net/2012/11/27/vampires-pirates/',
  'thumbnail': 'http://b.thumbs.redditmedia.com/g20tDXb4uYRfakqwHFT9fQIS8dlyuETW0Vv2O5jaX8Y.jpg',
  'timestamp': '2016-06-23T12:48:02+00:00',
  'title': 'Vampires and pirates',
  'votes': '156'},
 {'comments': '458 comments',
  'link': 'http://i.imgur.com/E3xp4jR.jpg',
  'thumbnail': 'http://b.thumbs.redditmedia.com/bTxp_JjGpl1SQ5ODuWQehvRTK6HK3HoEIjnj70DFxdw.jpg',
  'timestamp': '2016-06-22T17:51:17+00:00',
  'title': "My girlfriend recently started working nights. I've been waking up to a lot of worrying texts.",
  'votes': '4983'},
 {'comments': '20 comments',
  'link': 'http://imgur.com/IEnq1HT',
  'thumbnail': 'http://b.thumbs.redditmedia.com/qJEfNjo76_-X2BeFucL9Nr_wreDpFh2svkklUUSH49k.jpg',
  'timestamp': '2016-06-23T03:57:15+00:00',
  'title': 'Slimey yet satisfying',
  'votes': '752'},
 {'comments': '15 comments',
  'link': 'http://i.imgur.com/C6CQ9iX.jpg',
  'thumbnail': 'http://b.thumbs.redditmedia.com/H7MgulvDhBAKEvAPY6r4X-rGyY3ne8xD-A7dVlHQDuY.jpg',
  'timestamp': '2016-06-23T12:47:57+00:00',
  'title': 'Week 1 of Fatherhood (2016)',
  'votes': '148'}]

In [9]:
#len(all_posts)

In [10]:
import pandas as pd

In [11]:
posts_df = pd.DataFrame(all_posts)
posts_df.head()


Out[11]:
comments link thumbnail timestamp title votes
0 129 comments https://www.reddit.com/r/unfortunateplacement http://a.thumbs.redditmedia.com/ExQ61Q54Z-aAuJ... 2016-06-01T16:08:54+00:00 Subreddit Of The Month [June 2016]: /r/unfortu... 406
1 378 comments /r/funny/comments/4j1nln/irs_phone_scams_and_s... None 2016-05-12T16:57:43+00:00 IRS phone scams, and similar posts. tldr - DON... 1304
2 2694 comments http://imgur.com/WQ9f3g0 http://b.thumbs.redditmedia.com/1ZRp2v23jOSqkz... 2016-06-23T12:41:33+00:00 Army Specialist was denied leave to go to a ba... 9292
3 834 comments http://i.imgur.com/m0xCyiI.jpg http://b.thumbs.redditmedia.com/sXji7NpPLobngC... 2016-06-23T03:43:15+00:00 A girl at the fire station after getting stuck... 5853
4 145 comments http://i.imgur.com/tw02cpV.jpg http://b.thumbs.redditmedia.com/ZuCNEoY9po1AeT... 2016-06-23T10:50:10+00:00 This is it. No more arguing. 1304

In [12]:
import time

In [13]:
datestring = time.strftime("%Y-%m-%d-%H-%M")
datestring


Out[13]:
'2016-06-23-13-16'

In [14]:
filename = "reddit-data" + datestring + ".csv"
posts_df.to_csv(filename, index=False)

In [ ]: