Scientific programming with the SciPy stack

Pandas

Import libraries and check versions.


In [1]:
import pandas as pd
import numpy as np
import sys
print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
print('Numpy version ' + np.__version__)


Python version 3.4.4 |Continuum Analytics, Inc.| (default, Feb 16 2016, 09:54:04) [MSC v.1600 64 bit (AMD64)]
Pandas version 0.17.1
Numpy version 1.9.3

Read the data and get a row count. Data source: U.S. Department of Transportation, TranStats database. Air Carrier Statistics Table T-100 Domestic Market (All Carriers): "This table contains domestic market data reported by both U.S. and foreign air carriers, including carrier, origin, destination, and service class for enplaned passengers, freight and mail when both origin and destination airports are located within the boundaries of the United States and its territories." -- 2015


In [2]:
file_path = r'data\T100_2015.csv.gz'
df = pd.read_csv(file_path, header=0)
df.count()


Out[2]:
PASSENGERS             168313
UNIQUE_CARRIER         168155
UNIQUE_CARRIER_NAME    168155
ORIGIN_AIRPORT_ID      168313
ORIGIN                 168313
DEST_AIRPORT_ID        168313
DEST                   168313
YEAR                   168313
MONTH                  168313
Unnamed: 9                  0
dtype: int64

In [3]:
df.head(n=10)


Out[3]:
PASSENGERS UNIQUE_CARRIER UNIQUE_CARRIER_NAME ORIGIN_AIRPORT_ID ORIGIN DEST_AIRPORT_ID DEST YEAR MONTH Unnamed: 9
0 0 1SQ Star Marianas Air Inc. 12016 GUM 14582 ROP 2015 1 NaN
1 0 1SQ Star Marianas Air Inc. 14582 ROP 12016 GUM 2015 1 NaN
2 0 2E Smokey Bay Air Inc. 12649 KEB 14942 SOV 2015 1 NaN
3 0 2O Island Air Service 10170 ADQ 10278 ALZ 2015 1 NaN
4 0 2O Island Air Service 10170 ADQ 12785 KPY 2015 1 NaN
5 0 2O Island Air Service 12785 KPY 10170 ADQ 2015 1 NaN
6 0 2O Island Air Service 12866 KYK 13934 ORI 2015 1 NaN
7 0 2O Island Air Service 15091 SYB 10170 ADQ 2015 1 NaN
8 0 4EQ Tanana Air Service 13196 MCG 12087 HCR 2015 1 NaN
9 0 4EQ Tanana Air Service 13196 MCG 12676 KGX 2015 1 NaN

In [4]:
df = pd.read_csv(file_path, header=0, usecols=["PASSENGERS", "ORIGIN", "DEST"])

In [5]:
df.head(n=10)


Out[5]:
PASSENGERS ORIGIN DEST
0 0 GUM ROP
1 0 ROP GUM
2 0 KEB SOV
3 0 ADQ ALZ
4 0 ADQ KPY
5 0 KPY ADQ
6 0 KYK ORI
7 0 SYB ADQ
8 0 MCG HCR
9 0 MCG KGX

In [6]:
print('Min: ', df['PASSENGERS'].min())
print('Max: ', df['PASSENGERS'].max())
print('Mean: ', df['PASSENGERS'].mean())


Min:  0.0
Max:  90955.0
Mean:  2765.0627759

In [7]:
df = df.query('PASSENGERS > 10000')

In [8]:
print('Min: ', df['PASSENGERS'].min())
print('Max: ', df['PASSENGERS'].max())
print('Mean: ', df['PASSENGERS'].mean())


Min:  10004.0
Max:  90955.0
Mean:  21241.0201054

In [9]:
OriginToDestination = df.groupby(['ORIGIN', 'DEST'], as_index=False).agg({'PASSENGERS':sum,})
OriginToDestination.head(n=10)


Out[9]:
ORIGIN DEST PASSENGERS
0 ABQ ATL 62372
1 ABQ DAL 105677
2 ABQ DFW 155093
3 ABQ LAS 80928
4 ABQ PHX 139732
5 ACY FLL 70595
6 ACY MCO 51451
7 AGS ATL 10475
8 ALB ATL 34836
9 ALB BWI 139475

In [10]:
OriginToDestination = pd.pivot_table(OriginToDestination, values='PASSENGERS', index=['ORIGIN'], columns=['DEST'], aggfunc=np.sum)
OriginToDestination.head()


Out[10]:
DEST ABQ ACY AGS ALB AMA ANC ATL AUS BDL BHM ... STL STT SYR TLH TPA TUL TUS TYS VPS XNA
ORIGIN
ABQ NaN NaN NaN NaN NaN NaN 62372 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
ACY NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
AGS NaN NaN NaN NaN NaN NaN 10475 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
ALB NaN NaN NaN NaN NaN NaN 34836 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
AMA NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 138 columns


In [11]:
OriginToDestination.fillna(0)


Out[11]:
DEST ABQ ACY AGS ALB AMA ANC ATL AUS BDL BHM ... STL STT SYR TLH TPA TUL TUS TYS VPS XNA
ORIGIN
ABQ 0 0 0 0 0 0 62372 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
ACY 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AGS 0 0 0 0 0 0 10475 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
ALB 0 0 0 0 0 0 34836 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AMA 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
ANC 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
ASE 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
ATL 51467 0 10043 45742 0 0 0 240178 212391 248491 ... 265919 44196 68519 39320 664247 10816 0 0 117362 0
AUS 0 0 0 0 0 0 235979 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
BDL 0 0 0 0 0 0 212599 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
BHM 0 0 0 0 0 0 250153 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
BIL 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
BLI 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
BNA 0 0 0 0 0 0 281863 0 0 0 ... 0 0 0 0 60630 0 0 0 0 0
BOI 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
BOS 0 0 0 0 0 0 436831 0 0 0 ... 0 0 0 0 105330 0 0 0 0 0
BTR 0 0 0 0 0 0 21624 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
BUF 0 0 0 0 0 0 138912 0 0 0 ... 0 0 0 0 10322 0 0 0 0 0
BUR 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
BWI 0 0 0 141410 0 0 479963 0 174733 0 ... 79780 0 0 0 193696 0 0 0 0 0
BZN 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
CAE 0 0 0 0 0 0 57308 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
CAK 0 0 0 0 0 0 104286 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
CHA 0 0 0 0 0 0 20648 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
CHS 0 0 0 0 0 0 273549 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
CLE 0 0 0 0 0 0 172874 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
CLT 0 0 0 0 0 0 602242 0 142909 10133 ... 0 0 0 0 290482 0 0 41949 0 0
CMH 0 0 0 0 0 0 239172 0 0 0 ... 0 0 0 0 10432 0 0 0 0 0
COS 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
CVG 0 0 0 0 0 0 203260 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
PVD 0 0 0 0 0 0 71688 0 0 0 ... 0 0 0 0 10329 0 0 0 0 0
PWM 0 0 0 0 0 0 36657 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
RDU 0 0 0 0 0 0 402583 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
RIC 0 0 0 0 0 0 298046 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
RNO 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
ROC 0 0 0 0 0 0 98762 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
RSW 0 0 0 0 0 0 371773 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
SAN 0 0 0 0 0 0 245588 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
SAT 0 0 0 0 0 0 236948 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
SAV 0 0 0 0 0 0 238005 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
SDF 0 0 0 0 0 0 214902 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
SEA 0 0 0 0 0 584217 284703 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
SFO 0 0 0 0 0 0 303647 34764 0 0 ... 0 0 0 0 0 0 0 0 0 0
SGF 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
SJC 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
SJU 0 0 0 0 0 0 172901 0 0 0 ... 0 0 0 0 23694 0 0 0 0 0
SLC 0 0 0 0 0 0 333425 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
SMF 0 0 0 0 0 0 10828 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
SNA 0 0 0 0 0 0 119344 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
SRQ 0 0 0 0 0 0 199802 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
STL 0 0 0 0 0 0 265550 0 0 0 ... 0 0 0 0 10273 0 0 0 0 0
STT 0 0 0 0 0 0 66111 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
SYR 0 0 0 0 0 0 47614 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
TLH 0 0 0 0 0 0 40624 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
TPA 0 0 0 0 0 0 674715 0 0 0 ... 11115 0 0 0 0 0 0 0 0 0
TUL 0 0 0 0 0 0 10885 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
TUS 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
TYS 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
VPS 0 0 0 0 0 0 107540 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
XNA 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

139 rows × 138 columns

SymPy

SymPy is a Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible.


In [13]:
import sympy
from sympy import *
from sympy.stats import *
from sympy import symbols
from sympy.plotting import plot
from sympy.interactive import printing
printing.init_printing(use_latex=True)
print('Sympy version ' + sympy.__version__)


Sympy version 1.0

This example was gleaned from: Rocklin, Matthew, and Andy R. Terrel. "Symbolic Statistics with SymPy." Computing in Science & Engineering 14.3 (2012): 88-93.

Problem: Data assimilation -- we want to assimilate new measurements into a set of old measurements. Both sets of measurements have uncertainty. For example, ACS estimates updated with local data.

Assume we've estimated that the temperature outside is 30 degrees. However, there is certainly uncertainty is our estimate. Let's say +- 3 degrees. In Sympy, we can model this with a normal random variable.


In [14]:
T = Normal('T', 30, 3)

What is the probability that the temperature is actually greater than 33 degrees?

We can use Sympy's integration engine to calculate a precise answer.


In [16]:
P(T > 33)


Out[16]:
$$\frac{\sqrt{2}}{4 \sqrt{\pi}} \left(- \sqrt{2} \sqrt{\pi} \operatorname{erf}{\left (\frac{\sqrt{2}}{2} \right )} + \sqrt{2} \sqrt{\pi}\right)$$

In [17]:
N(P(T > 33))


Out[17]:
$$0.158655253931457$$

Assume we now have a thermometer and can measure the temperature. However, there is still uncertainty involved.


In [18]:
noise = Normal('noise', 0, 1.5)
observation = T + noise

We now have two measurements -- 30 +- 3 degrees and 26 +- 1.5 degrees. How do we combine them? 30 +- 3 was our prior measurement. We want to cacluate a better estimate of the temperature (posterior) given an observation of 26 degrees.


In [19]:
T_posterior = given(T, Eq(observation, 26))

In [ ]:


In [ ]: