04 Grabbing HTML tables with Pandas

What if you saw a table you wanted on a web page? For example: https://en.wikipedia.org/wiki/World_Happiness_Report. Can Python help us download those data?

Why yes. Yes it can.

Specifically, we use the Pandas' read_html function, which is able to identify tables in an HTML page and pull them out into a dataframe object.


In [ ]:
#Import pandas
import pandas

In [ ]:
#Here, the read_html function pulls into a list object any table in the URL we provide.
tableList = pandas.read_html('https://en.wikipedia.org/wiki/World_Happiness_Report',header=0)
print "{} tables were found".format(len(tableList))

In [ ]:
#Let's grab the first one and display it's firt five rows
df = tableList[0]
df.head()

In [ ]:
#Looks like the first row should be a header, we can fix this by adding the 'header=0' option when reading int the tables
df = pandas.read_html('https://en.wikipedia.org/wiki/World_Happiness_Report',header=0)[0]
df.head()

In [ ]:
#Now we can save it to a local file using df.to_csv()
df.to_csv("Happiness.csv", # The output filename
          index=False,     # We opt not to write out the index
          encoding='utf8') # This deals with issues surrounding countries with odd characters

In [ ]:
#...or we can examine it
#Here is as quick preview of pandas' plotting capability
%matplotlib inline
df.plot.scatter(x='Social support',y='Healthy life expectancy');