In [1]:
%matplotlib inline
import pandas as pd

Ingest wikipedia tables

Read in a whole wikipedia page as a list of data frames


In [2]:
wiki_df = pd.read_html("https://en.wikipedia.org/w/index.php?title=List_of_James_Bond_films&oldid=688916363", header=0)

Pandas read_html will return all the tables in the web page, as a list of dataframes


In [3]:
type(wiki_df)


Out[3]:
list

The table we want is the second (the first is a revision message). Using Python slices we get only the rows we want.


In [4]:
df = wiki_df[1][1:24]

In [5]:
df[['Title','Box office.1']]


Out[5]:
Title Box office.1
1 Dr. No 448.8
2 From Russia with Love 543.8
3 Goldfinger 820.4
4 Thunderball 848.1
5 You Only Live Twice 514.2
6 On Her Majesty's Secret Service 291.5
7 Diamonds Are Forever 442.5
8 Live and Let Die 460.3
9 man with !The Man with the Golden Gun 334.0
10 spy who !The Spy Who Loved Me 533.0
11 Moonraker 535.0
12 For Your Eyes Only 449.4
13 Octopussy 373.8
14 view !A View to a Kill 275.2
15 living !The Living Daylights 313.5
16 Licence to Kill 250.9
17 GoldenEye 518.5
18 Tomorrow Never Dies 463.2
19 world !The World Is Not Enough 439.5
20 Die Another Day 465.4
21 Casino Royale 581.5
22 Quantum of Solace 514.2
23 Skyfall 879.8

Hard to quickly see the trend in a table format. How 'bout a pretty graph? Pandas plot might be all you need. Usually dataframe.plot() is enough, but we'll add a title, a data table below, and some average dash lines.


In [6]:
ax = df.plot(table=True, xticks=[], title="Bond movies in 2005 dollars (million)", figsize=(17,11))
ax.hlines(y=df.mean()[0], xmin=0, xmax=23, color='b', alpha=0.5, linestyle='dashed', label='Box office average')
ax.hlines(y=df.mean()[1], xmin=0, xmax=23, color='g', alpha=0.5, linestyle='dashed', label='Budget average')


Out[6]:
<matplotlib.collections.LineCollection at 0x10a5302b0>

In [ ]: