A full-fledged scraper

Import our modules or packages that we will need to scrape a website, including requests and bs4 and csv


In [ ]:
import requests
from bs4 import BeautifulSoup
import csv

Make a request to the webpage url that we are scraping. The url is: https://s3-us-west-2.amazonaws.com/nicar-2015/Weekly+Rankings+-+Weekend+Box+Office+Results+++Rentrak.html


In [ ]:
r = requests.get('https://s3-us-west-2.amazonaws.com/nicar-2015/Weekly+Rankings+-+Weekend+Box+Office+Results+++Rentrak.html')

Assign the html code from that site to a variable


In [ ]:
html = r.text
print(html)

Alternatively, to access this from local file in html/ dir, uncomment the next line

r= open('../project2/html/movies.html', 'r')
html = r.read()

Parse the html


In [ ]:
soup = BeautifulSoup(html, "html.parser")
print(soup)

Isolate the table


In [ ]:
table = soup.find('table',{'class':'entChartTable'})
print(table)

Find the rows, at the same time we are going to use slicing to skip the first two header rows.


In [ ]:
rows = table.find_all('tr')
# print(rows)
#skip the blank rows
rows = rows[2:]
# print(rows)

We are going to the csv module's DictWriter to write out our results. The DictWriter requires two things when we create it - the file and the fieldnames. First open our output file:


In [ ]:
csvfile = open("../project2/data/movies.csv","w", newline="")

Next specify the fieldnames.


In [ ]:
fieldnames = [
    "title", 
    "world_box_office", 
    "international_box_office", 
    "domestic_box_office", 
    "world_cume", 
    "international_cume", 
    "domestic_cume", 
    "international_distributor", 
    "number_territories", 
    "domestic_distributor"
]

Point our csv.DictWriter at the output file and specify the fieldnames along with other necessary parameters.


In [ ]:
output = csv.DictWriter(csvfile, fieldnames=fieldnames, delimiter=',',quotechar='"',quoting=csv.QUOTE_MINIMAL)
output.writeheader()

In [ ]:
#loop through the rows
for row in rows:
    #grab the table cells from each row
    cells = row.find_all('td')
    #create a dictionary and assign the cell values to keys in our dictionary
    result = {
        "title" : cells[0].text.strip(),
        "world_box_office" : cells[1].text.strip(),
        "international_box_office" : cells[2].text.strip(),
        "domestic_box_office" : cells[3].text.strip(),
        "world_cume" : cells[4].text.strip(),
        "international_cume" : cells[5].text.strip(),
        "domestic_cume" : cells[6].text.strip(),
        "international_distributor" : cells[7].text.strip(),
        "number_territories" : cells[8].text.strip(),
        "domestic_distributor" : cells[9].text.strip()
    }
    #write the variables out to a csv file
    output.writerow(result)

close the csv file to officially finish writing to it


In [ ]:
csvfile.close()

#win