Reading HTML Tables into DataFrame


In [1]:
import urllib2, argparse

In [31]:
from bs4 import BeautifulSoup

In [32]:
import pandas as pd

In [33]:
link = "https://www.globalpolicy.org/component/content/article/109/27519.html"
  • I'd previously coded this in a python file so it would be easier for me to find errors.

In [34]:
from week_7_code import *

I used a define function that can be used on any url to load the link into python.


In [35]:
download_link = url_download(link)

After using the request i've prepared a function to parse the site, find the table and return the values, index and title into python as a tuple.


In [36]:
table_data = parse_site(download_link)
  • After parsing the table, what is return is a tuple of lists, that include the index, data, and title of the table.

In [37]:
talbe_inf = zip(table_data[1], [n[0] for n in table_data[2]], [n[1] for n in table_data[2]])

In order to create a DataFrame i've ziped the cells in the data with the year columns so that it can be further used in the next process of instantiating the DataFrame in the variable table.


In [43]:
table = pd.DataFrame(data = talbe_inf, columns = list(table_data[0]), index = table_data[1])

In [44]:
print table


      Year Internet Users (millions)  \
1995  1995                        16   
1996  1996                        36   
1997  1997                        70   
1998  1998                       147   
1999  1999                       248   
2000  2000                       361   
2001  2001                       513   
2002  2002                       587   
2003  2003                       719   
2004  2004                       817   
2005  2005                     1,018   
2006  2006                     1,093   
2007  2007                     1,319   
2008  2008                     1,565   

     Internet Users as Percentage of World Population  
1995                                              0.4  
1996                                              0.9  
1997                                              1.7  
1998                                              3.6  
1999                                              4.1  
2000                                              5.8  
2001                                              8.6  
2002                                              9.4  
2003                                             11.1  
2004                                             12.7  
2005                                             15.7  
2006                                             16.7  
2007                                             20.0  
2008                                             23.3  
  • Due to lack of space the table has printed in this way, but other than that the DataFrame object is identical to the data in the HTML table and may be analyzed further now.