Import our modules or packages that we will need to scrape a website, including requests
and bs4
and csv
In [ ]:
import requests
from bs4 import BeautifulSoup
import csv
Make a request to the webpage url that we are scraping. The url is: https://s3-us-west-2.amazonaws.com/nicar-2015/Weekly+Rankings+-+Weekend+Box+Office+Results+++Rentrak.html
In [ ]:
r = requests.get('https://s3-us-west-2.amazonaws.com/nicar-2015/Weekly+Rankings+-+Weekend+Box+Office+Results+++Rentrak.html')
Assign the html code from that site to a variable
In [ ]:
html = r.text
print(html)
Alternatively, to access this from local file in html/ dir, uncomment the next line
r= open('../project2/html/movies.html', 'r')
html = r.read()
Parse the html
In [ ]:
soup = BeautifulSoup(html, "html.parser")
print(soup)
Isolate the table
In [ ]:
table = soup.find('table',{'class':'entChartTable'})
print(table)
Find the rows, at the same time we are going to use slicing to skip the first two header rows.
In [ ]:
rows = table.find_all('tr')
# print(rows)
#skip the blank rows
rows = rows[2:]
# print(rows)
We are going to the csv module's DictWriter to write out our results. The DictWriter requires two things when we create it - the file and the fieldnames. First open our output file:
In [ ]:
csvfile = open("../project2/data/movies.csv","w", newline="")
Next specify the fieldnames.
In [ ]:
fieldnames = [
"title",
"world_box_office",
"international_box_office",
"domestic_box_office",
"world_cume",
"international_cume",
"domestic_cume",
"international_distributor",
"number_territories",
"domestic_distributor"
]
Point our csv.DictWriter at the output file and specify the fieldnames along with other necessary parameters.
In [ ]:
output = csv.DictWriter(csvfile, fieldnames=fieldnames, delimiter=',',quotechar='"',quoting=csv.QUOTE_MINIMAL)
output.writeheader()
In [ ]:
#loop through the rows
for row in rows:
#grab the table cells from each row
cells = row.find_all('td')
#create a dictionary and assign the cell values to keys in our dictionary
result = {
"title" : cells[0].text.strip(),
"world_box_office" : cells[1].text.strip(),
"international_box_office" : cells[2].text.strip(),
"domestic_box_office" : cells[3].text.strip(),
"world_cume" : cells[4].text.strip(),
"international_cume" : cells[5].text.strip(),
"domestic_cume" : cells[6].text.strip(),
"international_distributor" : cells[7].text.strip(),
"number_territories" : cells[8].text.strip(),
"domestic_distributor" : cells[9].text.strip()
}
#write the variables out to a csv file
output.writerow(result)
close the csv file to officially finish writing to it
In [ ]:
csvfile.close()
#win