FirstName_LastName_airports.ipynb
and submit it via Moodle.
In [1]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from IPython.display import SVG, display_svg
This is where we are going:
In [2]:
# run this cell if there's no image
resp = requests.get("https://rawgit.com/INFO490/spring2015/master/week09/images/top20.svg")
SVG(resp.content)
Out[2]:
The circles are the top 20 airports in the U.S. The size of each circle is proportional to the total number of arrivals and departures in 2001. The redder the color, the higher the temperature; the bluer the color, the lower the temperature. Thus, you will visualize three pieces of information in one plot: the location of major airports, the size of the airports, and the current temperature.
In the first part of this week's assignment, you will add the number of departures and the number of arrivals in 2001 to find which 20 airports had the most number of flights.
You should use the Dest
and Origin
columns of the 2001 flight data 2001.csv
.
Note that each airport is identified by IATA codes.
In [3]:
dest_origin = pd.read_csv('/data/airline/2001.csv', encoding='latin-1', usecols=('Dest', 'Origin'))
print(dest_origin.head())
Count the total number of departures from and arrivals to each airport.
In other words, first count the number of times each airport appears in the Origin
column
to get
Dest
ABE 5262
ABI 2567
ABQ 36229
ACT 2682
ADQ 726
(only the first 5 columns are shown).
Then, count the number of times each airport apears in the Dest
column to get
Origin
ABE 5262
ABI 2567
ABQ 36248
ACT 2686
ACY 1
Add them up get the total number,
and store the result in a column named flights
,
flights
ABE 10524
ABI 5134
ABQ 72477
ACT 5368
ACY 1
Note that some airports have no departures or arrivals (e.g. ACY
).
Hint: I would use groupby(...).size()
on Dest
and Origin
columns to get
the number of departures and arrivals, respectively.
I would then use
pandas.DataFrame.add()
(with fill_value
and columns
parameters).
In [ ]:
def get_flights(df):
'''Takes a dataframe that has two columns Dest and Origin
and returns a dataframe that has a column named flights
and is indexed by IATA codes.
Parameters
----------
df: pandas.DataFrame
Returns
-------
pandas.DataFrame
'''
result = pd.DataFrame()
# your code goes here
return result
In the end, when you execute
flights = get_flights(dest_origin)
print(flights)
you should get
flights
ABE 10524
ABI 5134
ABQ 72477
ACT 5368
ACY 1
ADQ 1452
AKN 568
ALB 32713
AMA 12267
ANC 42381
APF 725
ATL 503163
AUS 85809
AVL 3172
AVP 2893
AZO 5290
BDL 71983
BET 2306
BFL 3338
BGM 751
BGR 7417
BHM 37566
BIL 6249
BIS 2779
BMI 2869
BNA 112603
BOI 24152
BOS 266032
BPT 3481
BQN 518
.. ...
SHV 12011
SIT 2758
SJC 144653
SJT 4505
SJU 52957
SLC 152859
SMF 80394
SNA 86871
SPS 3985
SRQ 9044
STL 324477
STT 6723
STX 1817
SUX 546
SWF 2386
SYR 22281
TLH 2957
TOL 4483
TPA 137286
TRI 1095
TUL 45562
TUS 39101
TVC 5067
TXK 3475
TYR 6361
TYS 11131
VPS 3455
WRG 1452
XNA 11749
YAK 1450
[231 rows x 1 columns]
In [ ]:
flights = get_flights(dest_origin)
print(flights)
To keep the problem simple, we will use only the top 20 airports.
flights
ORD 682636
DFW 624361
ATL 503163
LAX 450019
PHX 368631
STL 324477
DTW 297522
MSP 284955
LAS 272293
BOS 266032
DEN 265184
IAH 257193
CLT 256626
SFO 243473
EWR 241016
PHL 239390
LGA 232964
PIT 212738
SEA 205486
BWI 199674
In [ ]:
top20 = flights.sort('flights', ascending=False)[:20]
print(top20)
As mentioned earlier, our airports are identified by IATA codes in the dataframe.
However, the circles in the SVG file are identified by the city names
(e.g. Chicago or San Francisco).
For example, the circle that corresponds to O'Hare Airport is represented
by the following code in
airports.svg
:
<circle cx="367.68198" cy="118.095" stroke="black" stroke-width="1" r="5" fill="red"
id="Chicago"/>
So we need to match the IATA codes with the city names. You could use a supplementary data set such as airports.csv that contains all this information, but let's pretend that no such file exists and we have to gather this information ourselves.
FAA provides an XML service, which lets us get the airport status, including known delays and weather data. You should read http://services.faa.gov/docs/services/airport/ and try a few sample XML request to make sure you understand how it works.
From the XML response, we will extract the city name of the airport as well as the current temperature.
xml_city
that takes a url of an XML request (str)
and returns the city name (str).xml_temp
that takes a url of an XML request (str)
and returns the current temperature (float).
In [ ]:
def xml_city(url):
'''
Takes a url (str) with IATA code and returns the city name (str).
Parameter
---------
url: A str, e.g. "http://services.faa.gov/airport/status/SFO?format=application/xml"
Returns
-------
A str.
Example
-------
>>> print(xml_city("http://services.faa.gov/airport/status/SFO?format=application/xml"))
San Francisco
'''
result = ""
# your code goes here
return result
def xml_temp(url):
'''
Takes a url (str) with IATA code and returns the current temperature (float).
Parameter
---------
url: A str, e.g. "http://services.faa.gov/airport/status/SFO?format=application/xml"
Returns
-------
A float.
Example
-------
>>> print(xml_temp("http://services.faa.gov/airport/status/SFO?format=application/xml"))
71.0
'''
result = ""
# your code goes here
return result
When you run the following cell, you should get
The current temperature at San Francisco (<class 'str'>) is 71.0 (<class 'float'>) degrees Fahrenheit.
(The temperature may be different.)
In [ ]:
sfo_city = xml_city("http://services.faa.gov/airport/status/SFO?format=application/xml")
sfo_temp = xml_temp("http://services.faa.gov/airport/status/SFO?format=application/xml")
print("The current temperature at {0} ({1}) is {2} ({3}) degrees Fahrenheit."
"".format(sfo_city, type(sfo_city), sfo_temp, type(sfo_temp)))
city
and temp
.If you had a dataframe that looks like
print(top20.head(3))
flights
ORD 682636
DFW 624361
ATL 503163
then add_city_temp()
should return
top20 = add_city_temp(top20)
print(top20.head(3))
flights city temp
ORD 682636 Chicago 54
DFW 624361 Dallas-Ft Worth 68
ATL 503163 Atlanta 63
(The temperatures will be different in your case.)
Hint: Create a new column (e.g. url
) and use the indices of the DataFrame
to create appropriate links http://services.faa.gov/airport/status/XXX?format=application/xml
for each airport.
Use pandas.DataFrame.apply()
and the xml_city()
function to create a column named city
.
Similary, use pandas.DataFrame.apply()
with xml_temp()
to create a column named temp
.
Finally, use
pandas.DataFrame.drop()
In [ ]:
def add_city_temp(df):
'''Takes a Pandas DataFrame that has a column named flights
and is indexed by IATA codes.
Returns a DataFrame with two additional columns:
city (str) and temp (float).
Parameter
---------
df: A pandas.DataFrame
Returns
-------
A pandas.DataFrame
'''
result = pd.DataFrame()
# your code goes here
return result
When you run the following cell, you should get
flights city temp
ORD 682636 Chicago 54
DFW 624361 Dallas-Ft Worth 68
ATL 503163 Atlanta 63
(The temperatures may be different.)
In [ ]:
top20 = add_city_temp(top20)
print(top20.head(3))
In Problem 3, it will be easier if the dataframe is indexed by
city
instead of IATA codes.
When you run the following cell, you should get
index flights temp
city
Chicago ORD 682636 54
Dallas-Ft Worth DFW 624361 68
Atlanta ATL 503163 63
In [ ]:
top20 = top20.reset_index().set_index('city')
print(top20.head(3))
top20
dataframe.Run the following cell to display the SVG template.
In [4]:
resp = requests.get("https://rawgit.com/INFO490/spring2015/master/week09/images/airports.svg")
usairports = resp.content
display_svg(usairports, raw=True)
You should use BeautifulSoup
to adjust the size and color of the circles
to create a figure similar to the following.
Again, the radius of a circle is proportional to the total traffic
of that airport.
The color of a circle depends on the current temperature at that airport.
In [5]:
resp = requests.get("https://rawgit.com/INFO490/spring2015/master/week09/images/top20.svg")
SVG(resp.content)
Out[5]:
(Your figure does not have to look exactly the same. Choose a different set of sizes and colors if you wish.)
Lesson 3 covers how to use BeautifulSoup
to parse an SVG file (which is in XML format),
search for the FIPS code of each county, and replace the colors using a dataframe (unemployment data).
Although we are searching for different objects (circle
) and
using a different data set (top20
), the same principle applies.
Remember that each circle is represented by the following XML code:
<circle cx="367.68198" cy="118.095" stroke="black" stroke-width="1" r="5" fill="red"
id="Chicago"/>
So you should use BeautifulSoup
to parse the usairports
(the contents of airports.svg
file)
and search for all instances of circle
.
Then, iterate over each instance of circle
and replace the r
variable with the number of flights flights
in the top20
dataframe
(multiplied/divided by some proportionaly constant
since a circle with a radius r="682636"
would not work).
You should also replace the fill
variable with a color the current temperature temp
.
After replacing all instances of circle
,
display the result with display_svg(soup.prettify(), raw=True)
.
In [ ]:
# your code goes here
In [ ]: