This note book contains code for pulling and aggregating the bi-daily scapes from 4chan into one dataframe which can be easily used for further analysis

If you don't have access to s3, here's an excellent guide on obtaining access. Keys can be obtained by asking moderaters in the slack channel for far-right https://github.com/Data4Democracy/tutorials/blob/master/aws/AWS_Boto3_s3_intro.ipynb

From this notebook anyone should be able to work on some more analysis, and the text of the messages from these scrapes looks very clean.

If you only want to pull a certain set of dates, just adjust the regex in match_string, or add some more conditionals to the loop which grabs the list of files to be read.

You can also read fewer files (if you have a slow connection or not that much memory) by shortening the files list before the second for loop.


In [10]:
import boto
import boto3
import pandas as pd
import re
from IPython.display import clear_output

In [5]:
session = boto3.Session(profile_name='default')
s3 = session.resource('s3')
bucket = s3.Bucket("far-right")
session.available_profiles


Out[5]:
['default']

In [15]:
base_url = 's3:far-right/'
match_string = "info-source/daily/[0-9]+/fourchan/fourchan"

files = []
print("Getting bucket and files info")
for obj in bucket.objects.all():
    if bool(re.search(match_string, obj.key)):
        files.append(obj.key)
        
df = pd.DataFrame()
for i, file in enumerate(files):
    clear_output()
    print("Loading file: " + str(i + 1) + " out of " + str(len(files)))
    if df.empty:
        df = pd.read_json(base_url + file)        
    else:
        df = pd.concat([df, pd.read_json(base_url + file)])
    
clear_output()
print("Completed Loading Files")


Completed Loading Files

In [12]:
df.shape


Out[12]:
(968, 8)

In [13]:
df.head()


Out[13]:
authors language pub_date pub_time source text_blob title url
0 [Iv9eTFJC] en 2015-01-08 2017-04-02 17:40:44 4chan This board is for the discussion of news, worl... Welcome to /pol/ - Politically Incorrect http://boards.4chan.org/pol/thread/40489590#p4...
1 [UihOc4nM] en 2017-02-03 2017-04-02 20:16:14 4chan To further divide, splinter, and fracture the ... Operation SPLINTER // General http://boards.4chan.org/pol/thread/110922069#p...
2 [beB77TBi] en 2017-02-03 2017-04-02 19:15:03 4chan http://www.express.co.uk/news/uk/76 Brit/pol/ - Papers edition http://boards.4chan.org/pol/thread/110912341#p...
3 [c1Xx6qvr] en 2017-02-03 2017-04-02 19:28:04 4chan 5^7D CHESS Get the popcorn ready http://boards.4chan.org/pol/thread/110914524#p...
4 [80+crxO/] en 2017-02-03 2017-04-02 20:42:10 4chan wtf I hate america now None http://boards.4chan.org/pol/thread/110926048#p...

In [ ]: