Scraping Webpages with BeautifulSoup

Lets try to get a list of all the years of all of Amitabh Bachchan movies! If you don't know, he's kind of the Sean Connery of India.

BeautifulSoup lets you download webpages and search them for specific HTML entities. You can use this ability to scrape data out of the webpage, or a series of webpages. It is fast and works well. Their documentation is a handy reference.

Getting the Content

First you gotta grab the content (I like to use requests for this)



In [1]:

    
import requests
r = requests.get('http://www.imdb.com/name/nm0000821') # lets look at Amitabh Bachchan's list of movies

How you can make your "beautiful soup"! This turns the HTML into a DOM tree that you can navigate with code.



In [2]:

    
from bs4 import BeautifulSoup
webpage = BeautifulSoup(r.text, "html.parser")

Scraping the Info You Want

Now there are a few ways to get content out. For instance, to get the title you could treat it like an object:



In [3]:

    
webpage.title.text









    Out[3]:





u'Amitabh Bachchan - IMDb'

Or you can search for specific tags. This would get all the links (as DOM elements):



In [4]:

    
len(webpage.find_all('a'))









    Out[4]:





671

Or you can use good old CSS selectors, to actually find all the years his movies were made in:



In [5]:

    
len(webpage.select('div.filmo-row span.year_column'))









    Out[5]:





341

Of course, we really want to turn this into a list of years... not DOM elements



In [6]:

    
raw_year_list = [e.text.strip() for e in webpage.select('div.filmo-row span.year_column')]

Cleaning and Analyzing the Data

So we can check if he made any films in a particular year



In [7]:

    
'1972' in raw_year_list









    Out[7]:





True

And we can look for messy data:



In [8]:

    
[year for year in raw_year_list if not year.isnumeric()]









    Out[8]:





[u'',
 u'',
 u'2014/I',
 u'2013/I',
 u'2013/I',
 u'2003/I',
 u'2003/I',
 u'1983/I',
 u'1980/I',
 u'',
 u'2014/I',
 u'2015-2016',
 u'2000-2012']

And we can remove these messy entries (even though that isn't the best thing to do):



In [9]:

    
year_list = [year for year in raw_year_list if year.isnumeric()]
','.join(year_list)









    Out[9]:





u'2016,2016,2016,2016,2015,2015,2014,2014,2014,2014,2013,2013,2013,2012,2012,2012,2012,2011,2011,2010,2010,2010,2009,2009,2009,2009,2008,2008,2008,2008,2008,2007,2007,2007,2007,2007,2007,2007,2007,2007,2007,2006,2006,2006,2006,2006,2005,2005,2005,2005,2005,2005,2005,2005,2005,2004,2004,2004,2004,2004,2004,2004,2004,2004,2004,2004,2003,2003,2003,2003,2003,2002,2002,2002,2002,2002,2001,2001,2001,2001,2000,1999,1999,1999,1999,1999,1998,1998,1998,1997,1997,1997,1996,1994,1994,1993,1992,1991,1991,1991,1991,1990,1990,1989,1989,1989,1989,1988,1988,1988,1986,1985,1985,1985,1984,1984,1984,1984,1984,1983,1983,1983,1983,1982,1982,1982,1982,1982,1982,1981,1981,1981,1981,1981,1981,1981,1981,1980,1980,1980,1979,1979,1979,1979,1979,1979,1979,1978,1978,1978,1978,1978,1977,1977,1977,1977,1977,1977,1977,1977,1976,1976,1976,1976,1976,1975,1975,1975,1975,1975,1975,1974,1974,1974,1974,1974,1974,1974,1973,1973,1973,1973,1973,1973,1972,1972,1972,1972,1972,1972,1972,1972,1972,1971,1971,1971,1971,1970,2012,2012,2011,2009,2009,2008,2007,2007,2006,2004,2003,2003,2002,2001,2001,1999,1999,1989,1989,1989,1984,1983,1983,1981,2016,2015,2012,2009,2009,2009,2008,2007,2006,2004,2003,2001,1999,1999,1991,1989,1981,1981,1981,1979,1979,1978,1977,1976,2011,2005,2001,1998,1997,1997,1996,1998,1998,2015,2012,2010,2009,2009,2008,2007,2005,2004,2001,1999,1993,1989,2014,2013,2012,2012,2012,2011,2011,2011,2010,2010,2008,2008,2007,2007,2007,2006,2005,2005,2005,2005,2004,2004,2004,2004,2004,2003,2003,2002,2001,2000,1999,1996,1993,1992,1991,1990,1990,1988,1988,1988,1988,1988,1987,1986,1985,1985,1984,1984,1981,1981,1979,1979,1977,1975,1975,1971,2013,2008,2004,1983'



In [10]:

    
import collections
year_freq = collections.Counter(year_list)
for year in sorted(year_freq.keys()):
    print str(year)+': '+('+'*year_freq[year])









    



1970: +
1971: +++++
1972: +++++++++
1973: ++++++
1974: +++++++
1975: ++++++++
1976: ++++++
1977: ++++++++++
1978: ++++++
1979: +++++++++++
1980: +++
1981: ++++++++++++++
1982: ++++++
1983: +++++++
1984: ++++++++
1985: +++++
1986: ++
1987: +
1988: ++++++++
1989: +++++++++
1990: ++++
1991: ++++++
1992: ++
1993: +++
1994: ++
1996: +++
1997: +++++
1998: ++++++
1999: +++++++++++
2000: ++
2001: ++++++++++
2002: +++++++
2003: ++++++++++
2004: ++++++++++++++++++++
2005: +++++++++++++++
2006: ++++++++
2007: +++++++++++++++++
2008: +++++++++++
2009: +++++++++++
2010: ++++++
2011: +++++++
2012: +++++++++++
2013: +++++
2014: +++++
2015: ++++
2016: +++++



In [ ]: