In [ ]:
import sqlite3 as sl
import pandas as pd
import numpy as np
In this problem, we will use the databse we have created in week 10 and use SQLite and Pandas to extract some summary information for flights to and from Chicago (ORD).
The database we have created in week 10 has a table named flights
,
which comes from importing 2001.csv
.
We will use only the flights
table, but not the iata
table.
So, even if you couldn't complete week 10 assignment,
you should be able to do this assignment,
if you know how to use the provided schema
flights.sql
to import 2001.csv
.
In [ ]:
# edit this path to point to your database
db_path = "/data/week10.db"
Write a function named get_from_ord()
that takes a string (the path to the database)
and returns a Pandas DataFrame.
This function calculates the average departure delays at each destination airport
when the origin airport was the O'Hare airport.
You should
read_sql()
function to read the SQL database,originCode
is ORD
),destinationCode
,departureDelay
of all flights that departed from ORD
,avgDepartureDelay
, anddestinationCode
and avgDepartureDelay
.Maybe that was confusing. When you run the following code,
from_ord = get_from_ord(db_path)
print(from_ord)
you should get
destinationCode avgDepartureDelay
0 JAC 26.968254
1 SBN 24.555556
2 MQT 24.419355
3 SRQ 23.250000
4 ONT 22.614828
5 JFK 20.711111
6 BOI 17.600484
7 SMF 17.458277
8 ABE 17.077313
9 COS 16.761834
10 BTV 16.755924
11 GEG 16.532067
12 BTR 16.362620
13 OAK 15.542587
14 GSP 15.537936
15 SFO 15.449473
16 ABQ 15.197390
17 ORF 15.150380
18 SJU 14.944480
19 MHT 14.940486
20 GSO 14.768665
21 TOL 14.632429
22 PHX 14.581101
23 PDX 14.497236
24 CMH 14.453364
25 LNK 14.367847
26 TVC 14.254834
27 GRR 14.194545
28 XNA 14.182957
29 ORH 14.168605
30 DLH 14.136410
31 LAS 14.089014
32 JAX 13.933687
33 SEA 13.852752
34 RIC 13.832230
35 LAX 13.756401
36 IND 13.728835
37 MDT 13.722720
38 CMI 13.686726
39 GRB 13.633869
40 SLC 13.613793
41 TPA 13.598235
42 DAY 13.595689
43 ICT 13.580737
44 DSM 13.502440
45 OMA 13.456912
46 SAN 13.384536
47 AZO 13.291948
48 CHA 13.255556
49 BMI 13.223060
50 EVV 13.162393
51 HSV 13.074033
52 MSN 13.033333
53 PVD 12.931463
54 LGA 12.888844
55 ANC 12.862963
56 CID 12.851332
57 CLE 12.710920
58 PIA 12.564820
59 PWM 12.562319
... ...
[117 rows x 2 columns]
In [ ]:
def get_from_ord(database):
'''
Takes a string and returns a pandas dataframe.
Parameters
----------
database: A str. The path and/or the file name to the SQL database.
Returns
-------
A pandas dataframe.
'''
#### your code goes here
return from_ord
In [ ]:
from_ord = get_from_ord(db_path)
print(from_ord)
The get_to_ord()
function is similar to the get_from_ord()
function
but it now calculates the average arrival delays at each airport
when the destination airport was ORD
.
You should get
to_ord = get_to_ord(db_path)
print(to_ord)
originCode avgArrivalDelay
0 JAC 34.629032
1 JFK 23.977778
2 SBN 20.000000
3 LNK 14.125341
4 DEN 13.412446
5 PHX 12.892532
6 EGE 12.062069
7 OAK 11.700980
8 SLC 11.562592
9 ICT 11.351094
10 LAS 11.010512
11 RNO 11.007338
12 IAH 10.784456
13 TVC 10.425784
14 XNA 10.255620
15 CLT 10.220210
16 ATL 10.058884
17 ORF 9.996966
18 PHL 9.953525
19 DAY 9.922723
20 DTW 9.869015
21 STL 9.864484
22 ONT 9.830571
23 IAD 9.793168
24 SMF 9.626153
25 BOS 9.513652
26 RIC 9.407848
27 CMI 9.387269
28 GSO 9.329825
29 BWI 9.291394
30 DCA 9.228687
31 MEM 9.163850
32 CVG 9.130196
33 MQT 9.062044
34 BTV 8.887435
35 JAX 8.874667
36 MIA 8.677982
37 ABE 8.588571
38 RDU 8.554770
39 SJC 8.495950
40 CMH 8.449122
41 IND 8.357289
42 PDX 8.339041
43 MSP 8.101055
44 SNA 8.015702
45 FLL 7.996124
46 HPN 7.878896
47 PWM 7.792250
48 GSP 7.523444
49 EVV 7.363748
50 LGA 7.357276
51 PIT 7.306919
52 TYS 7.072993
53 FWA 7.049932
54 DFW 7.024954
55 SAN 6.936231
56 GRR 6.932374
57 BDL 6.850943
58 LAX 6.803611
59 MCI 6.714703
... ...
[116 rows x 2 columns]
In [ ]:
def get_to_ord(database):
'''
Takes a string and returns a pandas dataframe.
Parameters
----------
database: A str. The path and/or the file name to the SQL database.
Returns
-------
A pandas dataframe.
'''
#### your code goes here
return to_ord
In [ ]:
to_ord = get_to_ord(db_path)
print(to_ord)
The pd_to_sql()
function takes a Pandas DataFrame (to_ord
or from_ord
),
a string (path to the database),
and another string (name of the SQL table).
You should use Pandas to_sql()
function to create in the SQL database
a table that matches df
, the DataFrame that is passed as the first
argument to the function.
In [ ]:
def pd_to_sql(df, database, table):
'''
Converts a dataframe to a table in an SQL database.
Parameters
----------
df: A pandas.DataFrame, e.g. to_ord or from_ord.
database: A str. The path and/or the file name to the SQL database
where the table is to be created.
table: A str. The name of SQL table where df will be stored.
'''
#### your code goes here
return None
When you run the following code cells,
pd_to_sql(from_ord[:10], db_path, 'topFromORD')
!sqlite3 $db_path "SELECT * FROM topFromORD"
you should get
JAC|26.968253968254
SBN|24.5555555555556
MQT|24.4193548387097
SRQ|23.25
ONT|22.6148282097649
JFK|20.7111111111111
BOI|17.6004842615012
SMF|17.4582772543742
ABE|17.0773130544994
COS|16.7618343195266
In [ ]:
pd_to_sql(from_ord[:10], db_path, 'topFromORD')
In [ ]:
%%bash
sqlite3 /data/sql/test "SELECT * FROM topFromORD"
We use the same funtion, pd_to_sql()
, to create a table for to_ord
.
pd_to_sql(database, to_ord[:10], "topToORD")
!sqlite3 $db_path "SELECT * FROM topToORD"
JAC|34.6290322580645
JFK|23.9777777777778
SBN|20.0
LNK|14.125340599455
DEN|13.4124455507156
PHX|12.8925318761384
EGE|12.0620689655172
OAK|11.7009803921569
SLC|11.5625915080527
ICT|11.3510941960038
In [ ]:
pd_to_sql(to_ord[:10], db_path, "topToORD")
In [ ]:
%%bash
sqlite3 /data/sql/test "SELECT * FROM topToORD"
Now we do a simple JOIN in our database to return
the airports that exist in both the topFromORD
and topToORD
tables.
The join_the_worst()
function takes three strings:
the path to the SQL database,
the name of the first table to be joined,
and the name of the second table to be joined.
It should return a Pandas DataFrame
that is indexed by the destinationCode
column in table1
(topFromORD
)
and returns two columns:
avgDepartureDelay
from table1
(topFromORD
) and
avgArrivalDelay
from table2
(topToORD
).
And two tables should of course be joined on
the IATA codes (destinationCode
and originCode
).
In [ ]:
def join_the_worst(database, table1, table2):
'''
Joins two tables, table1 and table2, in database on IATA codes.
Returns the result in a Pandas DataFrame.
Parameters
----------
database: A str. The path and/or the file name of the SQL database.
table1: A str. The name of the first table to be joined.
table2: A str. The name of the second table to be joined.
Returns
-------
A pandas Dataframe.
'''
#### your code goes here
return df
When I run,
join_the_worst(db_path, 'topFromORD', 'topToORD')
print(the_worst)
I get
avgDepartureDelay avgArrivalDelay
destinationCode
JAC 26.968254 34.629032
SBN 24.555556 20.000000
JFK 20.711111 23.977778
[3 rows x 2 columns]
In [ ]:
the_worst = join_the_worst(db_path, 'topFromORD', 'topToORD')
print(the_worst)
In [ ]: