In [1]:
import requests, pandas
from BeautifulSoup import *
In [2]:
url = "http://craftcans.com/db.php?search=all&sort=beerid&ord=desc&view=text"
In [3]:
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page)
If one goes to the website and uses the inspect element feature of Google chrome, then it can be seen that this table (although has no class or ID) jas a style attrbute of width:100%;margin-top:10px; value. We can use it to identify the correc ttable from the page.
In [4]:
table = soup.find("table",attrs={"style":"width:100%;margin-top:10px;"})
Now once we found the table, we need to go row-by-row, read all the columns for each row and save the text inside. Let's save it as a dictionary, and then paste all the dictionaries into a lsit (thus, get a JSON file). Please note, that the BEER column is a bit different: the value inside table cell is in bold (e.g. <b> tag). Thus we should first find the <b> tag, and then only go for the text content.
In [5]:
# find all the rows of the table and save them into the rows variable
rows = table.findAll("tr")
# create and empty list to be filled in with dictionaires
data_list = []
# for each row in the list of rows:
for row in rows:
columns = row.findAll("td") # find all columns in that row
# and create a dictionary, where we give the key and get the text content as value
beer = {
"id":columns[0].text,
"beer":columns[1].find('b').text,
"brewery":columns[2].text,
"location":columns[3].text,
"style":columns[4].text,
"size":columns[5].text,
"abv":columns[6].text,
"ibu":columns[7].text
}
# append the dictionary to the list
data_list.append(beer)
Let's see the result. The first 5 dictionaires must be enough.
In [7]:
data_list[:5]
Out[7]:
If you are more comfortable with working in Dataframes, when the conversion can easility be done.
In [8]:
data = pandas.DataFrame(data_list)
In [9]:
data.head()
Out[9]:
Let's this time save the resulted data to a JSON file.
In [10]:
import json
with open("craftcans.json","w") as f:
json.dump(data_list,f,sort_keys = True, indent = 4)