Lets try to get a list of all the years of all of Amitabh Bachchan movies! If you don't know, he's kind of the Sean Connery of India.
BeautifulSoup lets you download webpages and search them for specific HTML entities. You can use this ability to scrape data out of the webpage, or a series of webpages. It is fast and works well. Their documentation is a handy reference.
First you gotta grab the content (I like to use requests for this)
In [1]:
import requests
r = requests.get('http://www.imdb.com/name/nm0000821') # lets look at Amitabh Bachchan's list of movies
How you can make your "beautiful soup"! This turns the HTML into a DOM tree that you can navigate with code.
In [2]:
from bs4 import BeautifulSoup
webpage = BeautifulSoup(r.text, "html.parser")
In [3]:
webpage.title.text
Out[3]:
Or you can search for specific tags. This would get all the links (as DOM elements):
In [4]:
len(webpage.find_all('a'))
Out[4]:
Or you can use good old CSS selectors, to actually find all the years his movies were made in:
In [5]:
len(webpage.select('div.filmo-row span.year_column'))
Out[5]:
Of course, we really want to turn this into a list of years... not DOM elements
In [6]:
raw_year_list = [e.text.strip() for e in webpage.select('div.filmo-row span.year_column')]
In [7]:
'1972' in raw_year_list
Out[7]:
And we can look for messy data:
In [8]:
[year for year in raw_year_list if not year.isnumeric()]
Out[8]:
And we can remove these messy entries (even though that isn't the best thing to do):
In [9]:
year_list = [year for year in raw_year_list if year.isnumeric()]
','.join(year_list)
Out[9]:
In [10]:
import collections
year_freq = collections.Counter(year_list)
for year in sorted(year_freq.keys()):
print str(year)+': '+('+'*year_freq[year])
In [ ]: