Python数据科学101

1. 配置系统

  • Python
  • JDK
  • 创建C:\Hadoop\bin
  • 在这里下载windows版的hadoop https://github.com/steveloughran/winutils 拷贝winutils到C:\Hadoop\bin下面
  • 创建HADOOP_HOME环境变量,指向C:\Hadoop
  • 创建C:\temp\hive文件夹
  • 运行c:\hadoop\bin\winutils chmod 777 \temp\hive
  • 下载Spark: https://spark.apache.org/downloads.html
  • 解压下载的Spark的文件到C:\SPARK目录下,其它操作系统的放到home目录
  • 创建SPARK_HOME,指向C:\SPARK
  • 运行c:\spark\bin\spark-shell看看是否安装成功

2. 使用Python

  • 安装Anaconda
  • 检查conda: conda --version
  • 检查安装的包: conda list
  • 升级: conda update conda

3. 实验环境

输入jupyter notebook


In [1]:
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 3*np.pi, 500)
plt.plot(x, np.sin(x**2))
plt.title('Sine wave')


Out[1]:
Text(0.5,1,'Sine wave')

4. Pandas简介

最重要的是DataFrameSeries


In [2]:
import numpy as np
import pandas as pd

4.1 Series

创建一个series,包含空值NaN


In [3]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s[4] # 6.0


Out[3]:
6.0

4.2 Dataframes


In [4]:
df = pd.DataFrame({'data': ['2016-01-01', '2016-01-02', '2016-01-03'], 'qty': [20, 30, 40]})
df


Out[4]:
data qty
0 2016-01-01 20
1 2016-01-02 30
2 2016-01-03 40

更大的数据应当从文件里获取


In [5]:
rain = pd.read_csv('data/rainfall/rainfall.csv')
rain


Out[5]:
City Month Rainfall
0 San Francisco Jan 10.0
1 Seattle Jan 30.0
2 Los Angeles Jan 2.0
3 Seattle Feb 20.0
4 San Francisco Feb 4.0
5 Los Angeles Feb 0.0
6 Seattle Mar 22.0
7 San Francisco Mar 4.0
8 Los Angeles Mar NaN
9 Seattle Apr NaN
10 San Francisco Apr 5.0
11 Los Angeles Apr 4.0

In [6]:
# 加载一列
rain['City']


Out[6]:
0     San Francisco
1           Seattle
2       Los Angeles
3           Seattle
4     San Francisco
5       Los Angeles
6           Seattle
7     San Francisco
8       Los Angeles
9           Seattle
10    San Francisco
11      Los Angeles
Name: City, dtype: object

In [7]:
# 加载一行(第二行)
rain.loc[[1]]


Out[7]:
City Month Rainfall
1 Seattle Jan 30.0

In [8]:
# 第一行和第二行
rain.loc[0:1]


Out[8]:
City Month Rainfall
0 San Francisco Jan 10.0
1 Seattle Jan 30.0

4.3 过滤


In [10]:
# 查找所有降雨量小于10的数据
rain[rain['Rainfall'] < 10]


Out[10]:
City Month Rainfall
2 Los Angeles Jan 2.0
4 San Francisco Feb 4.0
5 Los Angeles Feb 0.0
7 San Francisco Mar 4.0
10 San Francisco Apr 5.0
11 Los Angeles Apr 4.0

查找4月份的降雨


In [11]:
rain[rain['Month'] == 'Apr']


Out[11]:
City Month Rainfall
9 Seattle Apr NaN
10 San Francisco Apr 5.0
11 Los Angeles Apr 4.0

查找Los Angeles的数据


In [12]:
rain[rain['City'] == 'Los Angeles']


Out[12]:
City Month Rainfall
2 Los Angeles Jan 2.0
5 Los Angeles Feb 0.0
8 Los Angeles Mar NaN
11 Los Angeles Apr 4.0

4.4 给行起名(Naming Rows)


In [13]:
rain = rain.set_index(rain['City'] + rain['Month'])

注意,当我们修改dataframe时,其实是在创建一个副本,因此要把这个值再赋值给原有的dataframe


In [14]:
rain.loc['San FranciscoApr']


Out[14]:
City        San Francisco
Month                 Apr
Rainfall                5
Name: San FranciscoApr, dtype: object

5. Pandas 例子


In [15]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [16]:
df = pd.read_csv('data/nycflights13/flights.csv.gz')

In [17]:
df


Out[17]:
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
0 2013 1 1 517.0 515 2.0 830.0 819 11.0 UA 1545 N14228 EWR IAH 227.0 1400 5 15 2013-01-01 05:00:00
1 2013 1 1 533.0 529 4.0 850.0 830 20.0 UA 1714 N24211 LGA IAH 227.0 1416 5 29 2013-01-01 05:00:00
2 2013 1 1 542.0 540 2.0 923.0 850 33.0 AA 1141 N619AA JFK MIA 160.0 1089 5 40 2013-01-01 05:00:00
3 2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0 B6 725 N804JB JFK BQN 183.0 1576 5 45 2013-01-01 05:00:00
4 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL 461 N668DN LGA ATL 116.0 762 6 0 2013-01-01 06:00:00
5 2013 1 1 554.0 558 -4.0 740.0 728 12.0 UA 1696 N39463 EWR ORD 150.0 719 5 58 2013-01-01 05:00:00
6 2013 1 1 555.0 600 -5.0 913.0 854 19.0 B6 507 N516JB EWR FLL 158.0 1065 6 0 2013-01-01 06:00:00
7 2013 1 1 557.0 600 -3.0 709.0 723 -14.0 EV 5708 N829AS LGA IAD 53.0 229 6 0 2013-01-01 06:00:00
8 2013 1 1 557.0 600 -3.0 838.0 846 -8.0 B6 79 N593JB JFK MCO 140.0 944 6 0 2013-01-01 06:00:00
9 2013 1 1 558.0 600 -2.0 753.0 745 8.0 AA 301 N3ALAA LGA ORD 138.0 733 6 0 2013-01-01 06:00:00
10 2013 1 1 558.0 600 -2.0 849.0 851 -2.0 B6 49 N793JB JFK PBI 149.0 1028 6 0 2013-01-01 06:00:00
11 2013 1 1 558.0 600 -2.0 853.0 856 -3.0 B6 71 N657JB JFK TPA 158.0 1005 6 0 2013-01-01 06:00:00
12 2013 1 1 558.0 600 -2.0 924.0 917 7.0 UA 194 N29129 JFK LAX 345.0 2475 6 0 2013-01-01 06:00:00
13 2013 1 1 558.0 600 -2.0 923.0 937 -14.0 UA 1124 N53441 EWR SFO 361.0 2565 6 0 2013-01-01 06:00:00
14 2013 1 1 559.0 600 -1.0 941.0 910 31.0 AA 707 N3DUAA LGA DFW 257.0 1389 6 0 2013-01-01 06:00:00
15 2013 1 1 559.0 559 0.0 702.0 706 -4.0 B6 1806 N708JB JFK BOS 44.0 187 5 59 2013-01-01 05:00:00
16 2013 1 1 559.0 600 -1.0 854.0 902 -8.0 UA 1187 N76515 EWR LAS 337.0 2227 6 0 2013-01-01 06:00:00
17 2013 1 1 600.0 600 0.0 851.0 858 -7.0 B6 371 N595JB LGA FLL 152.0 1076 6 0 2013-01-01 06:00:00
18 2013 1 1 600.0 600 0.0 837.0 825 12.0 MQ 4650 N542MQ LGA ATL 134.0 762 6 0 2013-01-01 06:00:00
19 2013 1 1 601.0 600 1.0 844.0 850 -6.0 B6 343 N644JB EWR PBI 147.0 1023 6 0 2013-01-01 06:00:00
20 2013 1 1 602.0 610 -8.0 812.0 820 -8.0 DL 1919 N971DL LGA MSP 170.0 1020 6 10 2013-01-01 06:00:00
21 2013 1 1 602.0 605 -3.0 821.0 805 16.0 MQ 4401 N730MQ LGA DTW 105.0 502 6 5 2013-01-01 06:00:00
22 2013 1 1 606.0 610 -4.0 858.0 910 -12.0 AA 1895 N633AA EWR MIA 152.0 1085 6 10 2013-01-01 06:00:00
23 2013 1 1 606.0 610 -4.0 837.0 845 -8.0 DL 1743 N3739P JFK ATL 128.0 760 6 10 2013-01-01 06:00:00
24 2013 1 1 607.0 607 0.0 858.0 915 -17.0 UA 1077 N53442 EWR MIA 157.0 1085 6 7 2013-01-01 06:00:00
25 2013 1 1 608.0 600 8.0 807.0 735 32.0 MQ 3768 N9EAMQ EWR ORD 139.0 719 6 0 2013-01-01 06:00:00
26 2013 1 1 611.0 600 11.0 945.0 931 14.0 UA 303 N532UA JFK SFO 366.0 2586 6 0 2013-01-01 06:00:00
27 2013 1 1 613.0 610 3.0 925.0 921 4.0 B6 135 N635JB JFK RSW 175.0 1074 6 10 2013-01-01 06:00:00
28 2013 1 1 615.0 615 0.0 1039.0 1100 -21.0 B6 709 N794JB JFK SJU 182.0 1598 6 15 2013-01-01 06:00:00
29 2013 1 1 615.0 615 0.0 833.0 842 -9.0 DL 575 N326NB EWR ATL 120.0 746 6 15 2013-01-01 06:00:00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
336746 2013 9 30 2123.0 2125 -2.0 2223.0 2247 -24.0 EV 5489 N712EV LGA CHO 45.0 305 21 25 2013-09-30 21:00:00
336747 2013 9 30 2127.0 2129 -2.0 2314.0 2323 -9.0 EV 3833 N16546 EWR CLT 72.0 529 21 29 2013-09-30 21:00:00
336748 2013 9 30 2128.0 2130 -2.0 2328.0 2359 -31.0 B6 97 N807JB JFK DEN 213.0 1626 21 30 2013-09-30 21:00:00
336749 2013 9 30 2129.0 2059 30.0 2230.0 2232 -2.0 EV 5048 N751EV LGA RIC 45.0 292 20 59 2013-09-30 20:00:00
336750 2013 9 30 2131.0 2140 -9.0 2225.0 2255 -30.0 MQ 3621 N807MQ JFK DCA 36.0 213 21 40 2013-09-30 21:00:00
336751 2013 9 30 2140.0 2140 0.0 10.0 40 -30.0 AA 185 N335AA JFK LAX 298.0 2475 21 40 2013-09-30 21:00:00
336752 2013 9 30 2142.0 2129 13.0 2250.0 2239 11.0 EV 4509 N12957 EWR PWM 47.0 284 21 29 2013-09-30 21:00:00
336753 2013 9 30 2145.0 2145 0.0 115.0 140 -25.0 B6 1103 N633JB JFK SJU 192.0 1598 21 45 2013-09-30 21:00:00
336754 2013 9 30 2147.0 2137 10.0 30.0 27 3.0 B6 1371 N627JB LGA FLL 139.0 1076 21 37 2013-09-30 21:00:00
336755 2013 9 30 2149.0 2156 -7.0 2245.0 2308 -23.0 UA 523 N813UA EWR BOS 37.0 200 21 56 2013-09-30 21:00:00
336756 2013 9 30 2150.0 2159 -9.0 2250.0 2306 -16.0 EV 3842 N10575 EWR MHT 39.0 209 21 59 2013-09-30 21:00:00
336757 2013 9 30 2159.0 1845 194.0 2344.0 2030 194.0 9E 3320 N906XJ JFK BUF 50.0 301 18 45 2013-09-30 18:00:00
336758 2013 9 30 2203.0 2205 -2.0 2339.0 2331 8.0 EV 5311 N722EV LGA BGR 61.0 378 22 5 2013-09-30 22:00:00
336759 2013 9 30 2207.0 2140 27.0 2257.0 2250 7.0 MQ 3660 N532MQ LGA BNA 97.0 764 21 40 2013-09-30 21:00:00
336760 2013 9 30 2211.0 2059 72.0 2339.0 2242 57.0 EV 4672 N12145 EWR STL 120.0 872 20 59 2013-09-30 20:00:00
336761 2013 9 30 2231.0 2245 -14.0 2335.0 2356 -21.0 B6 108 N193JB JFK PWM 48.0 273 22 45 2013-09-30 22:00:00
336762 2013 9 30 2233.0 2113 80.0 112.0 30 42.0 UA 471 N578UA EWR SFO 318.0 2565 21 13 2013-09-30 21:00:00
336763 2013 9 30 2235.0 2001 154.0 59.0 2249 130.0 B6 1083 N804JB JFK MCO 123.0 944 20 1 2013-09-30 20:00:00
336764 2013 9 30 2237.0 2245 -8.0 2345.0 2353 -8.0 B6 234 N318JB JFK BTV 43.0 266 22 45 2013-09-30 22:00:00
336765 2013 9 30 2240.0 2245 -5.0 2334.0 2351 -17.0 B6 1816 N354JB JFK SYR 41.0 209 22 45 2013-09-30 22:00:00
336766 2013 9 30 2240.0 2250 -10.0 2347.0 7 -20.0 B6 2002 N281JB JFK BUF 52.0 301 22 50 2013-09-30 22:00:00
336767 2013 9 30 2241.0 2246 -5.0 2345.0 1 -16.0 B6 486 N346JB JFK ROC 47.0 264 22 46 2013-09-30 22:00:00
336768 2013 9 30 2307.0 2255 12.0 2359.0 2358 1.0 B6 718 N565JB JFK BOS 33.0 187 22 55 2013-09-30 22:00:00
336769 2013 9 30 2349.0 2359 -10.0 325.0 350 -25.0 B6 745 N516JB JFK PSE 196.0 1617 23 59 2013-09-30 23:00:00
336770 2013 9 30 NaN 1842 NaN NaN 2019 NaN EV 5274 N740EV LGA BNA NaN 764 18 42 2013-09-30 18:00:00
336771 2013 9 30 NaN 1455 NaN NaN 1634 NaN 9E 3393 NaN JFK DCA NaN 213 14 55 2013-09-30 14:00:00
336772 2013 9 30 NaN 2200 NaN NaN 2312 NaN 9E 3525 NaN LGA SYR NaN 198 22 0 2013-09-30 22:00:00
336773 2013 9 30 NaN 1210 NaN NaN 1330 NaN MQ 3461 N535MQ LGA BNA NaN 764 12 10 2013-09-30 12:00:00
336774 2013 9 30 NaN 1159 NaN NaN 1344 NaN MQ 3572 N511MQ LGA CLE NaN 419 11 59 2013-09-30 11:00:00
336775 2013 9 30 NaN 840 NaN NaN 1020 NaN MQ 3531 N839MQ LGA RDU NaN 431 8 40 2013-09-30 08:00:00

336776 rows × 19 columns

这里我们主要关注统计数据和可视化。我们来看一下按月统计的晚点时间的均值。


In [18]:
mean_delay_by_month = df.groupby(['month'])['arr_delay'].mean()
mean_delay_by_month


Out[18]:
month
1      6.129972
2      5.613019
3      5.807577
4     11.176063
5      3.521509
6     16.481330
7     16.711307
8      6.040652
9     -4.018364
10    -0.167063
11     0.461347
12    14.870355
Name: arr_delay, dtype: float64

In [19]:
mean_month_plt = mean_delay_by_month.plot(kind='bar', title='Mean Delay By Month')
mean_month_plt


Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x259b18d6ac8>

注意,这里9、10月均值会有负值。


In [20]:
mean_delay_by_month_ord = df[(df.dest == 'ORD')].groupby(['month'])['arr_delay'].mean()
print("Flights to Chicago (ORD)")
print(mean_delay_by_month_ord)

mean_month_plt_ord = mean_delay_by_month_ord.plot(kind='bar', title="Mean Delay By Month (Chicago)")
mean_month_plt_ord


Flights to Chicago (ORD)
month
1      7.287694
2      3.680794
3     -2.702473
4     19.179352
5      7.938280
6     13.299376
7      8.405514
8      4.256851
9     -4.745370
10    -1.597090
11     2.071058
12    16.462817
Name: arr_delay, dtype: float64
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x259b19e7f98>

In [21]:
# 再看看Los Angeles进行比较一下
mean_delay_by_month_lax = df[(df.dest == 'LAX')].groupby(['month'])['arr_delay'].mean()
print("Flights to Chicago (LAX)")
print(mean_delay_by_month_lax)

mean_month_plt_lax = mean_delay_by_month_lax.plot(kind='bar', title="Mean Delay By Month (Los Angeles)")
mean_month_plt_lax


Flights to Chicago (LAX)
month
1     -4.160312
2     -7.601824
3     -5.280928
4      3.085153
5     -7.150657
6     13.007027
7      8.191432
8      1.028667
9     -8.719044
10    -1.205694
11    -0.103290
12    10.724460
Name: arr_delay, dtype: float64
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x259b19d8550>

从上面的图表中我们可以直观的看到一些特征。现在我们再来看看每个航空公司晚点的情况,并进行一些可视化。


In [22]:
# 看看是否不同的航空公司对晚点会有不同的影响
df[['carrier', 'arr_delay']].groupby('carrier').mean().plot(kind='bar', figsize=(12, 8))
plt.xticks(rotation=0)
plt.xlabel('Carrier')
plt.ylabel('Average Delay in Min')
plt.title('Average Arrival Delay by Carrier in 2008, All airports')

df[['carrier', 'dep_delay']].groupby('carrier').mean().plot(kind='bar', figsize=(12, 8))
plt.xticks(rotation=0)
plt.xlabel('Carrier')
plt.ylabel('Average Delay in Min')
plt.title('Average Departure Delay by Carrier in 2008, All airports')


Out[22]:
Text(0.5,1,'Average Departure Delay by Carrier in 2008, All airports')

从上面的图表里我们可以看到F9(Front Airlines)几乎是最经常晚点的,而夏威夷(HA)在这方面表现最好。

5.3 Joins

我们有多个数据集,天气、机场的。现在我们来看一下如何把两个表连接在一起


In [25]:
weather = pd.read_csv('data/nycflights13/weather.csv.gz')
weather


Out[25]:
origin year month day hour temp dewp humid wind_dir wind_speed wind_gust precip pressure visib time_hour
0 EWR 2013 1 1 0 37.04 21.92 53.97 230.0 10.35702 11.918651 0.00 1013.9 10.00 2012-12-31 19:00:00
1 EWR 2013 1 1 1 37.04 21.92 53.97 230.0 13.80936 15.891535 0.00 1013.0 10.00 2012-12-31 20:00:00
2 EWR 2013 1 1 2 37.94 21.92 52.09 230.0 12.65858 14.567241 0.00 1012.6 10.00 2012-12-31 21:00:00
3 EWR 2013 1 1 3 37.94 23.00 54.51 230.0 13.80936 15.891535 0.00 1012.7 10.00 2012-12-31 22:00:00
4 EWR 2013 1 1 4 37.94 24.08 57.04 240.0 14.96014 17.215830 0.00 1012.8 10.00 2012-12-31 23:00:00
5 EWR 2013 1 1 6 39.02 26.06 59.37 270.0 10.35702 11.918651 0.00 1012.0 10.00 2013-01-01 01:00:00
6 EWR 2013 1 1 7 39.02 26.96 61.63 250.0 8.05546 9.270062 0.00 1012.3 10.00 2013-01-01 02:00:00
7 EWR 2013 1 1 8 39.02 28.04 64.43 240.0 11.50780 13.242946 0.00 1012.5 10.00 2013-01-01 03:00:00
8 EWR 2013 1 1 9 39.92 28.04 62.21 250.0 12.65858 14.567241 0.00 1012.2 10.00 2013-01-01 04:00:00
9 EWR 2013 1 1 10 39.02 28.04 64.43 260.0 12.65858 14.567241 0.00 1011.9 10.00 2013-01-01 05:00:00
10 EWR 2013 1 1 11 37.94 28.04 67.21 240.0 11.50780 13.242946 0.00 1012.4 10.00 2013-01-01 06:00:00
11 EWR 2013 1 1 12 39.02 28.04 64.43 240.0 14.96014 17.215830 0.00 1012.2 10.00 2013-01-01 07:00:00
12 EWR 2013 1 1 13 39.92 28.04 62.21 250.0 10.35702 11.918651 0.00 1012.2 10.00 2013-01-01 08:00:00
13 EWR 2013 1 1 14 39.92 28.04 62.21 260.0 14.96014 17.215830 0.00 1012.7 10.00 2013-01-01 09:00:00
14 EWR 2013 1 1 15 41.00 28.04 59.65 260.0 13.80936 15.891535 0.00 1012.4 10.00 2013-01-01 10:00:00
15 EWR 2013 1 1 16 41.00 26.96 57.06 260.0 14.96014 17.215830 0.00 1011.4 10.00 2013-01-01 11:00:00
16 EWR 2013 1 1 17 39.20 28.40 64.93 270.0 16.11092 18.540125 0.00 NaN 10.00 2013-01-01 12:00:00
17 EWR 2013 1 1 18 39.20 28.40 64.93 330.0 14.96014 17.215830 0.00 NaN 10.00 2013-01-01 13:00:00
18 EWR 2013 1 1 19 39.02 24.08 54.68 280.0 13.80936 15.891535 0.00 1010.8 10.00 2013-01-01 14:00:00
19 EWR 2013 1 1 20 37.94 24.08 57.04 290.0 9.20624 10.594357 0.00 1011.9 10.00 2013-01-01 15:00:00
20 EWR 2013 1 1 21 37.04 19.94 49.62 300.0 13.80936 15.891535 0.00 1012.1 10.00 2013-01-01 16:00:00
21 EWR 2013 1 1 22 35.96 19.04 49.83 330.0 11.50780 13.242946 0.00 1013.2 10.00 2013-01-01 17:00:00
22 EWR 2013 1 1 23 33.98 15.08 45.43 310.0 12.65858 14.567241 0.00 1014.1 10.00 2013-01-01 18:00:00
23 EWR 2013 1 2 0 33.08 12.92 42.84 320.0 10.35702 11.918651 0.00 1014.4 10.00 2013-01-01 19:00:00
24 EWR 2013 1 2 1 32.00 15.08 49.19 310.0 14.96014 17.215830 0.00 1015.2 10.00 2013-01-01 20:00:00
25 EWR 2013 1 2 2 30.02 12.92 48.48 320.0 18.41248 21.188714 0.00 1016.0 10.00 2013-01-01 21:00:00
26 EWR 2013 1 2 3 28.94 12.02 48.69 320.0 18.41248 21.188714 0.00 1016.5 10.00 2013-01-01 22:00:00
27 EWR 2013 1 2 4 28.04 10.94 48.15 310.0 16.11092 18.540125 0.00 1016.4 10.00 2013-01-01 23:00:00
28 EWR 2013 1 2 5 26.96 10.94 50.34 310.0 14.96014 17.215830 0.00 1016.3 10.00 2013-01-02 00:00:00
29 EWR 2013 1 2 6 26.06 10.94 52.25 330.0 12.65858 14.567241 0.00 1016.3 10.00 2013-01-02 01:00:00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
26100 LGA 2013 12 29 18 42.80 37.40 81.07 70.0 9.20624 10.594357 0.06 NaN 2.50 2013-12-29 13:00:00
26101 LGA 2013 12 29 19 39.20 37.40 93.19 40.0 16.11092 18.540125 0.06 NaN 1.75 2013-12-29 14:00:00
26102 LGA 2013 12 29 20 41.00 39.02 92.59 40.0 13.80936 15.891535 0.37 999.9 1.50 2013-12-29 15:00:00
26103 LGA 2013 12 29 21 41.00 37.94 88.76 350.0 8.05546 9.270062 0.28 998.7 1.50 2013-12-29 16:00:00
26104 LGA 2013 12 29 22 41.00 39.20 93.24 50.0 20.71404 23.837303 0.03 NaN 2.00 2013-12-29 17:00:00
26105 LGA 2013 12 29 23 42.08 39.02 88.81 330.0 14.96014 17.215830 0.00 997.2 3.00 2013-12-29 18:00:00
26106 LGA 2013 12 30 0 42.80 37.40 81.07 320.0 17.26170 19.864419 0.00 NaN 8.00 2013-12-29 19:00:00
26107 LGA 2013 12 30 1 41.00 37.40 86.89 310.0 19.56326 22.513008 0.00 NaN 6.00 2013-12-29 20:00:00
26108 LGA 2013 12 30 2 42.80 37.40 81.07 320.0 16.11092 18.540125 0.00 NaN 10.00 2013-12-29 21:00:00
26109 LGA 2013 12 30 3 42.98 37.04 79.38 300.0 9.20624 10.594357 0.00 1003.8 10.00 2013-12-29 22:00:00
26110 LGA 2013 12 30 4 42.98 35.06 73.39 310.0 17.26170 19.864419 0.00 1005.1 10.00 2013-12-29 23:00:00
26111 LGA 2013 12 30 5 42.08 33.98 72.78 320.0 11.50780 13.242946 0.00 1005.9 10.00 2013-12-30 00:00:00
26112 LGA 2013 12 30 6 42.08 33.98 72.78 250.0 9.20624 10.594357 0.00 1007.6 10.00 2013-12-30 01:00:00
26113 LGA 2013 12 30 7 41.00 33.98 75.88 240.0 8.05546 9.270062 0.00 1008.3 10.00 2013-12-30 02:00:00
26114 LGA 2013 12 30 8 42.98 33.98 70.30 270.0 9.20624 10.594357 0.00 1008.2 10.00 2013-12-30 03:00:00
26115 LGA 2013 12 30 9 41.00 33.08 73.19 0.0 0.00000 0.000000 0.00 1008.9 10.00 2013-12-30 04:00:00
26116 LGA 2013 12 30 10 42.98 33.08 67.81 250.0 10.35702 11.918651 0.00 1009.2 10.00 2013-12-30 05:00:00
26117 LGA 2013 12 30 11 42.98 33.98 70.30 230.0 6.90468 7.945768 0.00 1010.8 10.00 2013-12-30 06:00:00
26118 LGA 2013 12 30 12 44.06 35.06 70.42 240.0 11.50780 13.242946 0.00 1011.9 10.00 2013-12-30 07:00:00
26119 LGA 2013 12 30 13 44.06 33.98 67.45 260.0 11.50780 13.242946 0.00 1012.9 10.00 2013-12-30 08:00:00
26120 LGA 2013 12 30 14 44.06 33.08 65.07 260.0 13.80936 15.891535 0.00 1013.7 10.00 2013-12-30 09:00:00
26121 LGA 2013 12 30 15 42.80 33.80 70.28 330.0 16.11092 18.540125 0.00 NaN 10.00 2013-12-30 10:00:00
26122 LGA 2013 12 30 16 41.00 28.40 60.54 340.0 13.80936 15.891535 0.00 NaN 10.00 2013-12-30 11:00:00
26123 LGA 2013 12 30 17 37.94 23.00 54.51 330.0 21.86482 25.161598 0.00 1015.7 10.00 2013-12-30 12:00:00
26124 LGA 2013 12 30 18 37.04 21.92 53.97 340.0 17.26170 19.864419 0.00 1016.5 10.00 2013-12-30 13:00:00
26125 LGA 2013 12 30 19 35.96 19.94 51.78 340.0 13.80936 15.891535 0.00 1017.1 10.00 2013-12-30 14:00:00
26126 LGA 2013 12 30 20 33.98 17.06 49.51 330.0 17.26170 19.864419 0.00 1018.8 10.00 2013-12-30 15:00:00
26127 LGA 2013 12 30 21 32.00 15.08 49.19 340.0 14.96014 17.215830 0.00 1019.5 10.00 2013-12-30 16:00:00
26128 LGA 2013 12 30 22 30.92 12.92 46.74 320.0 17.26170 19.864419 0.00 1019.9 10.00 2013-12-30 17:00:00
26129 LGA 2013 12 30 23 28.94 10.94 46.41 330.0 18.41248 21.188714 0.00 1020.9 10.00 2013-12-30 18:00:00

26130 rows × 15 columns


In [26]:
df_withweather = pd.merge(df, weather, how='left', on=['year', 'month', 'day', 'hour'])
df_withweather


Out[26]:
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier ... temp dewp humid wind_dir wind_speed wind_gust precip pressure visib time_hour_y
0 2013 1 1 517.0 515 2.0 830.0 819 11.0 UA ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2013 1 1 533.0 529 4.0 850.0 830 20.0 UA ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2013 1 1 542.0 540 2.0 923.0 850 33.0 AA ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0 B6 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL ... 39.02 26.06 59.37 270.0 10.35702 11.918651 0.0 1012.0 10.0 2013-01-01 01:00:00
5 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL ... 39.02 26.06 59.37 260.0 12.65858 14.567241 0.0 1012.6 10.0 2013-01-01 01:00:00
6 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL ... 39.92 26.06 57.33 260.0 13.80936 15.891535 0.0 1011.9 10.0 2013-01-01 01:00:00
7 2013 1 1 554.0 558 -4.0 740.0 728 12.0 UA ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 2013 1 1 555.0 600 -5.0 913.0 854 19.0 B6 ... 39.02 26.06 59.37 270.0 10.35702 11.918651 0.0 1012.0 10.0 2013-01-01 01:00:00
9 2013 1 1 555.0 600 -5.0 913.0 854 19.0 B6 ... 39.02 26.06 59.37 260.0 12.65858 14.567241 0.0 1012.6 10.0 2013-01-01 01:00:00
10 2013 1 1 555.0 600 -5.0 913.0 854 19.0 B6 ... 39.92 26.06 57.33 260.0 13.80936 15.891535 0.0 1011.9 10.0 2013-01-01 01:00:00
11 2013 1 1 557.0 600 -3.0 709.0 723 -14.0 EV ... 39.02 26.06 59.37 270.0 10.35702 11.918651 0.0 1012.0 10.0 2013-01-01 01:00:00
12 2013 1 1 557.0 600 -3.0 709.0 723 -14.0 EV ... 39.02 26.06 59.37 260.0 12.65858 14.567241 0.0 1012.6 10.0 2013-01-01 01:00:00
13 2013 1 1 557.0 600 -3.0 709.0 723 -14.0 EV ... 39.92 26.06 57.33 260.0 13.80936 15.891535 0.0 1011.9 10.0 2013-01-01 01:00:00
14 2013 1 1 557.0 600 -3.0 838.0 846 -8.0 B6 ... 39.02 26.06 59.37 270.0 10.35702 11.918651 0.0 1012.0 10.0 2013-01-01 01:00:00
15 2013 1 1 557.0 600 -3.0 838.0 846 -8.0 B6 ... 39.02 26.06 59.37 260.0 12.65858 14.567241 0.0 1012.6 10.0 2013-01-01 01:00:00
16 2013 1 1 557.0 600 -3.0 838.0 846 -8.0 B6 ... 39.92 26.06 57.33 260.0 13.80936 15.891535 0.0 1011.9 10.0 2013-01-01 01:00:00
17 2013 1 1 558.0 600 -2.0 753.0 745 8.0 AA ... 39.02 26.06 59.37 270.0 10.35702 11.918651 0.0 1012.0 10.0 2013-01-01 01:00:00
18 2013 1 1 558.0 600 -2.0 753.0 745 8.0 AA ... 39.02 26.06 59.37 260.0 12.65858 14.567241 0.0 1012.6 10.0 2013-01-01 01:00:00
19 2013 1 1 558.0 600 -2.0 753.0 745 8.0 AA ... 39.92 26.06 57.33 260.0 13.80936 15.891535 0.0 1011.9 10.0 2013-01-01 01:00:00
20 2013 1 1 558.0 600 -2.0 849.0 851 -2.0 B6 ... 39.02 26.06 59.37 270.0 10.35702 11.918651 0.0 1012.0 10.0 2013-01-01 01:00:00
21 2013 1 1 558.0 600 -2.0 849.0 851 -2.0 B6 ... 39.02 26.06 59.37 260.0 12.65858 14.567241 0.0 1012.6 10.0 2013-01-01 01:00:00
22 2013 1 1 558.0 600 -2.0 849.0 851 -2.0 B6 ... 39.92 26.06 57.33 260.0 13.80936 15.891535 0.0 1011.9 10.0 2013-01-01 01:00:00
23 2013 1 1 558.0 600 -2.0 853.0 856 -3.0 B6 ... 39.02 26.06 59.37 270.0 10.35702 11.918651 0.0 1012.0 10.0 2013-01-01 01:00:00
24 2013 1 1 558.0 600 -2.0 853.0 856 -3.0 B6 ... 39.02 26.06 59.37 260.0 12.65858 14.567241 0.0 1012.6 10.0 2013-01-01 01:00:00
25 2013 1 1 558.0 600 -2.0 853.0 856 -3.0 B6 ... 39.92 26.06 57.33 260.0 13.80936 15.891535 0.0 1011.9 10.0 2013-01-01 01:00:00
26 2013 1 1 558.0 600 -2.0 924.0 917 7.0 UA ... 39.02 26.06 59.37 270.0 10.35702 11.918651 0.0 1012.0 10.0 2013-01-01 01:00:00
27 2013 1 1 558.0 600 -2.0 924.0 917 7.0 UA ... 39.02 26.06 59.37 260.0 12.65858 14.567241 0.0 1012.6 10.0 2013-01-01 01:00:00
28 2013 1 1 558.0 600 -2.0 924.0 917 7.0 UA ... 39.92 26.06 57.33 260.0 13.80936 15.891535 0.0 1011.9 10.0 2013-01-01 01:00:00
29 2013 1 1 558.0 600 -2.0 923.0 937 -14.0 UA ... 39.02 26.06 59.37 270.0 10.35702 11.918651 0.0 1012.0 10.0 2013-01-01 01:00:00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1007561 2013 9 30 2240.0 2250 -10.0 2347.0 7 -20.0 B6 ... 68.00 53.06 58.80 150.0 5.75390 6.621473 0.0 1015.4 10.0 2013-09-30 17:00:00
1007562 2013 9 30 2240.0 2250 -10.0 2347.0 7 -20.0 B6 ... 64.04 55.94 74.94 180.0 6.90468 7.945768 0.0 1016.0 10.0 2013-09-30 17:00:00
1007563 2013 9 30 2240.0 2250 -10.0 2347.0 7 -20.0 B6 ... 66.92 51.98 58.65 160.0 9.20624 10.594357 0.0 1015.4 10.0 2013-09-30 17:00:00
1007564 2013 9 30 2241.0 2246 -5.0 2345.0 1 -16.0 B6 ... 68.00 53.06 58.80 150.0 5.75390 6.621473 0.0 1015.4 10.0 2013-09-30 17:00:00
1007565 2013 9 30 2241.0 2246 -5.0 2345.0 1 -16.0 B6 ... 64.04 55.94 74.94 180.0 6.90468 7.945768 0.0 1016.0 10.0 2013-09-30 17:00:00
1007566 2013 9 30 2241.0 2246 -5.0 2345.0 1 -16.0 B6 ... 66.92 51.98 58.65 160.0 9.20624 10.594357 0.0 1015.4 10.0 2013-09-30 17:00:00
1007567 2013 9 30 2307.0 2255 12.0 2359.0 2358 1.0 B6 ... 68.00 53.06 58.80 150.0 5.75390 6.621473 0.0 1015.4 10.0 2013-09-30 17:00:00
1007568 2013 9 30 2307.0 2255 12.0 2359.0 2358 1.0 B6 ... 64.04 55.94 74.94 180.0 6.90468 7.945768 0.0 1016.0 10.0 2013-09-30 17:00:00
1007569 2013 9 30 2307.0 2255 12.0 2359.0 2358 1.0 B6 ... 66.92 51.98 58.65 160.0 9.20624 10.594357 0.0 1015.4 10.0 2013-09-30 17:00:00
1007570 2013 9 30 2349.0 2359 -10.0 325.0 350 -25.0 B6 ... 66.02 53.96 65.07 160.0 4.60312 5.297178 0.0 1015.6 10.0 2013-09-30 18:00:00
1007571 2013 9 30 2349.0 2359 -10.0 325.0 350 -25.0 B6 ... 62.06 55.94 80.34 190.0 5.75390 6.621473 0.0 1016.0 10.0 2013-09-30 18:00:00
1007572 2013 9 30 2349.0 2359 -10.0 325.0 350 -25.0 B6 ... 64.94 53.96 67.57 200.0 9.20624 10.594357 0.0 1015.5 10.0 2013-09-30 18:00:00
1007573 2013 9 30 NaN 1842 NaN NaN 2019 NaN EV ... 71.96 46.94 40.90 90.0 5.75390 6.621473 0.0 1015.7 10.0 2013-09-30 13:00:00
1007574 2013 9 30 NaN 1842 NaN NaN 2019 NaN EV ... 68.00 55.04 63.21 190.0 11.50780 13.242946 0.0 1016.6 10.0 2013-09-30 13:00:00
1007575 2013 9 30 NaN 1842 NaN NaN 2019 NaN EV ... 71.96 46.94 40.90 NaN 3.45234 3.972884 0.0 1015.8 10.0 2013-09-30 13:00:00
1007576 2013 9 30 NaN 1455 NaN NaN 1634 NaN 9E ... 66.92 50.00 54.51 50.0 3.45234 3.972884 0.0 1018.3 10.0 2013-09-30 09:00:00
1007577 2013 9 30 NaN 1455 NaN NaN 1634 NaN 9E ... 68.00 48.92 50.44 20.0 5.75390 6.621473 0.0 1018.4 10.0 2013-09-30 09:00:00
1007578 2013 9 30 NaN 1455 NaN NaN 1634 NaN 9E ... 64.94 50.00 58.39 NaN 3.45234 3.972884 0.0 1018.2 10.0 2013-09-30 09:00:00
1007579 2013 9 30 NaN 2200 NaN NaN 2312 NaN 9E ... 68.00 53.06 58.80 150.0 5.75390 6.621473 0.0 1015.4 10.0 2013-09-30 17:00:00
1007580 2013 9 30 NaN 2200 NaN NaN 2312 NaN 9E ... 64.04 55.94 74.94 180.0 6.90468 7.945768 0.0 1016.0 10.0 2013-09-30 17:00:00
1007581 2013 9 30 NaN 2200 NaN NaN 2312 NaN 9E ... 66.92 51.98 58.65 160.0 9.20624 10.594357 0.0 1015.4 10.0 2013-09-30 17:00:00
1007582 2013 9 30 NaN 1210 NaN NaN 1330 NaN MQ ... 59.00 53.96 83.34 10.0 6.90468 7.945768 0.0 1018.8 10.0 2013-09-30 07:00:00
1007583 2013 9 30 NaN 1210 NaN NaN 1330 NaN MQ ... 62.06 53.96 74.75 40.0 5.75390 6.621473 0.0 1018.8 10.0 2013-09-30 07:00:00
1007584 2013 9 30 NaN 1210 NaN NaN 1330 NaN MQ ... 60.98 51.08 69.86 NaN 5.75390 6.621473 0.0 1018.6 10.0 2013-09-30 07:00:00
1007585 2013 9 30 NaN 1159 NaN NaN 1344 NaN MQ ... 55.94 53.96 93.05 20.0 5.75390 6.621473 0.0 1018.7 10.0 2013-09-30 06:00:00
1007586 2013 9 30 NaN 1159 NaN NaN 1344 NaN MQ ... 57.92 53.96 86.63 360.0 6.90468 7.945768 0.0 1018.9 10.0 2013-09-30 06:00:00
1007587 2013 9 30 NaN 1159 NaN NaN 1344 NaN MQ ... 60.08 51.98 74.56 360.0 4.60312 5.297178 0.0 1018.5 10.0 2013-09-30 06:00:00
1007588 2013 9 30 NaN 840 NaN NaN 1020 NaN MQ ... 53.96 51.08 89.95 350.0 4.60312 5.297178 0.0 1018.1 10.0 2013-09-30 03:00:00
1007589 2013 9 30 NaN 840 NaN NaN 1020 NaN MQ ... 53.06 50.00 89.31 30.0 3.45234 3.972884 0.0 1018.1 10.0 2013-09-30 03:00:00
1007590 2013 9 30 NaN 840 NaN NaN 1020 NaN MQ ... 59.00 53.06 80.64 10.0 3.45234 3.972884 0.0 1017.8 10.0 2013-09-30 03:00:00

1007591 rows × 30 columns


In [27]:
airports = pd.read_csv('data/nycflights13/airports.csv.gz')
airports


Out[27]:
faa name lat lon alt tz dst tzone
0 04G Lansdowne Airport 41.130472 -80.619583 1044 -5 A America/New_York
1 06A Moton Field Municipal Airport 32.460572 -85.680028 264 -6 A America/Chicago
2 06C Schaumburg Regional 41.989341 -88.101243 801 -6 A America/Chicago
3 06N Randall Airport 41.431912 -74.391561 523 -5 A America/New_York
4 09J Jekyll Island Airport 31.074472 -81.427778 11 -5 A America/New_York
5 0A9 Elizabethton Municipal Airport 36.371222 -82.173417 1593 -5 A America/New_York
6 0G6 Williams County Airport 41.467306 -84.506778 730 -5 A America/New_York
7 0G7 Finger Lakes Regional Airport 42.883565 -76.781232 492 -5 A America/New_York
8 0P2 Shoestring Aviation Airfield 39.794824 -76.647191 1000 -5 U America/New_York
9 0S9 Jefferson County Intl 48.053809 -122.810644 108 -8 A America/Los_Angeles
10 0W3 Harford County Airport 39.566838 -76.202403 409 -5 A America/New_York
11 10C Galt Field Airport 42.402889 -88.375111 875 -6 U America/Chicago
12 17G Port Bucyrus-Crawford County Airport 40.781556 -82.974806 1003 -5 A America/New_York
13 19A Jackson County Airport 34.175864 -83.561597 951 -5 U America/New_York
14 1A3 Martin Campbell Field Airport 35.015806 -84.346833 1789 -5 A America/New_York
15 1B9 Mansfield Municipal 42.000133 -71.196771 122 -5 A America/New_York
16 1C9 Frazier Lake Airpark 54.013333 -124.768333 152 -8 A America/Vancouver
17 1CS Clow International Airport 41.695974 -88.129231 670 -6 U America/Chicago
18 1G3 Kent State Airport 41.151389 -81.415111 1134 -5 A America/New_York
19 1G4 Grand Canyon West Airport 35.899904 -113.815674 4813 -7 A America/Phoenix
20 1H2 Effingham Memorial Airport 39.070000 -88.534000 585 -6 A America/Chicago
21 1OH Fortman Airport 40.555325 -84.386619 885 -5 U America/New_York
22 1RL Point Roberts Airpark 48.979722 -123.078889 10 -8 A America/Los_Angeles
23 23M Clarke CO 32.051700 -88.443400 320 -6 A America/Chicago
24 24C Lowell City Airport 42.953920 -85.343906 681 -5 A America/New_York
25 24J Suwannee County Airport 30.300125 -83.024694 104 -5 A America/New_York
26 25D Forest Lake Airport 45.247746 -92.994385 925 -6 A America/Chicago
27 29D Grove City Airport 41.146028 -80.167750 1371 -5 A America/New_York
28 2A0 Mark Anton Airport 35.486250 -84.931083 718 -5 A America/New_York
29 2B2 Plum Island Airport 42.795361 -70.839444 11 -5 A America/New_York
... ... ... ... ... ... ... ... ...
1428 X59 Valkaria Municipal 27.960861 -80.558333 26 -5 A America/New_York
1429 XFL Flagler County Airport 29.282100 -81.121200 33 -5 A America/New_York
1430 XNA NW Arkansas Regional 36.281869 -94.306811 1287 -6 A America/Chicago
1431 XZK Amherst Amtrak Station AMM 42.375000 -72.511389 258 -5 A America/New_York
1432 Y51 Municipal Airport 43.579360 -90.896474 1292 -6 A America/Chicago
1433 Y72 Bloyer Field 43.976222 -90.480611 966 -6 A America/Chicago
1434 YAK Yakutat 59.301200 -139.393700 33 -9 A \N
1435 YIP Willow Run 42.237928 -83.530408 716 -5 A America/New_York
1436 YKM Yakima Air Terminal McAllister Field 46.568200 -120.544000 1095 -8 A America/Los_Angeles
1437 YKN Chan Gurney 42.871100 -97.396900 1200 -6 A America/Chicago
1438 YNG Youngstown Warren Rgnl 41.260736 -80.679097 1196 -5 A America/New_York
1439 YUM Yuma Mcas Yuma Intl 32.656578 -114.605980 216 -7 N America/Phoenix
1440 Z84 Clear 64.301203 -149.120144 552 -9 A America/Anchorage
1441 ZBP Penn Station 39.307222 -76.615556 66 -5 A America/New_York
1442 ZFV Philadelphia 30th St Station 39.955700 -75.182000 0 -5 A America/New_York
1443 ZPH Municipal Airport 28.228056 -82.155833 90 -5 A America/New_York
1444 ZRA Atlantic City Rail Terminal 39.366500 -74.442000 8 -5 A America/New_York
1445 ZRD Train Station 37.534300 -77.429450 26 -5 A America/New_York
1446 ZRP Newark Penn Station 40.734722 -74.164167 0 -5 A America/New_York
1447 ZRT Hartford Union Station 41.768880 -72.681500 0 -5 A America/New_York
1448 ZRZ New Carrollton Rail Station 38.948000 -76.871900 39 -5 A America/New_York
1449 ZSF Springfield Amtrak Station 42.106000 -72.593054 65 -5 A America/New_York
1450 ZSY Scottsdale Airport 33.622889 -111.910528 1519 -7 A America/Phoenix
1451 ZTF Stamford Amtrak Station 41.046937 -73.541493 0 -5 A America/New_York
1452 ZTY Boston Back Bay Station 42.347800 -71.075000 20 -5 A America/New_York
1453 ZUN Black Rock 35.083228 -108.791778 6454 -7 A America/Denver
1454 ZVE New Haven Rail Station 41.298669 -72.925992 7 -5 A America/New_York
1455 ZWI Wilmington Amtrak Station 39.736667 -75.551667 0 -5 A America/New_York
1456 ZWU Washington Union Station 38.897460 -77.006430 76 -5 A America/New_York
1457 ZYP Penn Station 40.750500 -73.993500 35 -5 A America/New_York

1458 rows × 8 columns


In [28]:
df_withairport = pd.merge(df_withweather, airports, how='left', left_on='dest', right_on='faa')
df_withairport


Out[28]:
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier ... visib time_hour_y faa name lat lon alt tz dst tzone
0 2013 1 1 517.0 515 2.0 830.0 819 11.0 UA ... NaN NaN IAH George Bush Intercontinental 29.984433 -95.341442 97.0 -6.0 A America/Chicago
1 2013 1 1 533.0 529 4.0 850.0 830 20.0 UA ... NaN NaN IAH George Bush Intercontinental 29.984433 -95.341442 97.0 -6.0 A America/Chicago
2 2013 1 1 542.0 540 2.0 923.0 850 33.0 AA ... NaN NaN MIA Miami Intl 25.793250 -80.290556 8.0 -5.0 A America/New_York
3 2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0 B6 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL ... 10.0 2013-01-01 01:00:00 ATL Hartsfield Jackson Atlanta Intl 33.636719 -84.428067 1026.0 -5.0 A America/New_York
5 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL ... 10.0 2013-01-01 01:00:00 ATL Hartsfield Jackson Atlanta Intl 33.636719 -84.428067 1026.0 -5.0 A America/New_York
6 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL ... 10.0 2013-01-01 01:00:00 ATL Hartsfield Jackson Atlanta Intl 33.636719 -84.428067 1026.0 -5.0 A America/New_York
7 2013 1 1 554.0 558 -4.0 740.0 728 12.0 UA ... NaN NaN ORD Chicago Ohare Intl 41.978603 -87.904842 668.0 -6.0 A America/Chicago
8 2013 1 1 555.0 600 -5.0 913.0 854 19.0 B6 ... 10.0 2013-01-01 01:00:00 FLL Fort Lauderdale Hollywood Intl 26.072583 -80.152750 9.0 -5.0 A America/New_York
9 2013 1 1 555.0 600 -5.0 913.0 854 19.0 B6 ... 10.0 2013-01-01 01:00:00 FLL Fort Lauderdale Hollywood Intl 26.072583 -80.152750 9.0 -5.0 A America/New_York
10 2013 1 1 555.0 600 -5.0 913.0 854 19.0 B6 ... 10.0 2013-01-01 01:00:00 FLL Fort Lauderdale Hollywood Intl 26.072583 -80.152750 9.0 -5.0 A America/New_York
11 2013 1 1 557.0 600 -3.0 709.0 723 -14.0 EV ... 10.0 2013-01-01 01:00:00 IAD Washington Dulles Intl 38.944533 -77.455811 313.0 -5.0 A America/New_York
12 2013 1 1 557.0 600 -3.0 709.0 723 -14.0 EV ... 10.0 2013-01-01 01:00:00 IAD Washington Dulles Intl 38.944533 -77.455811 313.0 -5.0 A America/New_York
13 2013 1 1 557.0 600 -3.0 709.0 723 -14.0 EV ... 10.0 2013-01-01 01:00:00 IAD Washington Dulles Intl 38.944533 -77.455811 313.0 -5.0 A America/New_York
14 2013 1 1 557.0 600 -3.0 838.0 846 -8.0 B6 ... 10.0 2013-01-01 01:00:00 MCO Orlando Intl 28.429394 -81.308994 96.0 -5.0 A America/New_York
15 2013 1 1 557.0 600 -3.0 838.0 846 -8.0 B6 ... 10.0 2013-01-01 01:00:00 MCO Orlando Intl 28.429394 -81.308994 96.0 -5.0 A America/New_York
16 2013 1 1 557.0 600 -3.0 838.0 846 -8.0 B6 ... 10.0 2013-01-01 01:00:00 MCO Orlando Intl 28.429394 -81.308994 96.0 -5.0 A America/New_York
17 2013 1 1 558.0 600 -2.0 753.0 745 8.0 AA ... 10.0 2013-01-01 01:00:00 ORD Chicago Ohare Intl 41.978603 -87.904842 668.0 -6.0 A America/Chicago
18 2013 1 1 558.0 600 -2.0 753.0 745 8.0 AA ... 10.0 2013-01-01 01:00:00 ORD Chicago Ohare Intl 41.978603 -87.904842 668.0 -6.0 A America/Chicago
19 2013 1 1 558.0 600 -2.0 753.0 745 8.0 AA ... 10.0 2013-01-01 01:00:00 ORD Chicago Ohare Intl 41.978603 -87.904842 668.0 -6.0 A America/Chicago
20 2013 1 1 558.0 600 -2.0 849.0 851 -2.0 B6 ... 10.0 2013-01-01 01:00:00 PBI Palm Beach Intl 26.683161 -80.095589 19.0 -5.0 A America/New_York
21 2013 1 1 558.0 600 -2.0 849.0 851 -2.0 B6 ... 10.0 2013-01-01 01:00:00 PBI Palm Beach Intl 26.683161 -80.095589 19.0 -5.0 A America/New_York
22 2013 1 1 558.0 600 -2.0 849.0 851 -2.0 B6 ... 10.0 2013-01-01 01:00:00 PBI Palm Beach Intl 26.683161 -80.095589 19.0 -5.0 A America/New_York
23 2013 1 1 558.0 600 -2.0 853.0 856 -3.0 B6 ... 10.0 2013-01-01 01:00:00 TPA Tampa Intl 27.975472 -82.533250 26.0 -5.0 A America/New_York
24 2013 1 1 558.0 600 -2.0 853.0 856 -3.0 B6 ... 10.0 2013-01-01 01:00:00 TPA Tampa Intl 27.975472 -82.533250 26.0 -5.0 A America/New_York
25 2013 1 1 558.0 600 -2.0 853.0 856 -3.0 B6 ... 10.0 2013-01-01 01:00:00 TPA Tampa Intl 27.975472 -82.533250 26.0 -5.0 A America/New_York
26 2013 1 1 558.0 600 -2.0 924.0 917 7.0 UA ... 10.0 2013-01-01 01:00:00 LAX Los Angeles Intl 33.942536 -118.408075 126.0 -8.0 A America/Los_Angeles
27 2013 1 1 558.0 600 -2.0 924.0 917 7.0 UA ... 10.0 2013-01-01 01:00:00 LAX Los Angeles Intl 33.942536 -118.408075 126.0 -8.0 A America/Los_Angeles
28 2013 1 1 558.0 600 -2.0 924.0 917 7.0 UA ... 10.0 2013-01-01 01:00:00 LAX Los Angeles Intl 33.942536 -118.408075 126.0 -8.0 A America/Los_Angeles
29 2013 1 1 558.0 600 -2.0 923.0 937 -14.0 UA ... 10.0 2013-01-01 01:00:00 SFO San Francisco Intl 37.618972 -122.374889 13.0 -8.0 A America/Los_Angeles
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1007561 2013 9 30 2240.0 2250 -10.0 2347.0 7 -20.0 B6 ... 10.0 2013-09-30 17:00:00 BUF Buffalo Niagara Intl 42.940525 -78.732167 724.0 -5.0 A America/New_York
1007562 2013 9 30 2240.0 2250 -10.0 2347.0 7 -20.0 B6 ... 10.0 2013-09-30 17:00:00 BUF Buffalo Niagara Intl 42.940525 -78.732167 724.0 -5.0 A America/New_York
1007563 2013 9 30 2240.0 2250 -10.0 2347.0 7 -20.0 B6 ... 10.0 2013-09-30 17:00:00 BUF Buffalo Niagara Intl 42.940525 -78.732167 724.0 -5.0 A America/New_York
1007564 2013 9 30 2241.0 2246 -5.0 2345.0 1 -16.0 B6 ... 10.0 2013-09-30 17:00:00 ROC Greater Rochester Intl 43.118866 -77.672389 559.0 -5.0 A America/New_York
1007565 2013 9 30 2241.0 2246 -5.0 2345.0 1 -16.0 B6 ... 10.0 2013-09-30 17:00:00 ROC Greater Rochester Intl 43.118866 -77.672389 559.0 -5.0 A America/New_York
1007566 2013 9 30 2241.0 2246 -5.0 2345.0 1 -16.0 B6 ... 10.0 2013-09-30 17:00:00 ROC Greater Rochester Intl 43.118866 -77.672389 559.0 -5.0 A America/New_York
1007567 2013 9 30 2307.0 2255 12.0 2359.0 2358 1.0 B6 ... 10.0 2013-09-30 17:00:00 BOS General Edward Lawrence Logan Intl 42.364347 -71.005181 19.0 -5.0 A America/New_York
1007568 2013 9 30 2307.0 2255 12.0 2359.0 2358 1.0 B6 ... 10.0 2013-09-30 17:00:00 BOS General Edward Lawrence Logan Intl 42.364347 -71.005181 19.0 -5.0 A America/New_York
1007569 2013 9 30 2307.0 2255 12.0 2359.0 2358 1.0 B6 ... 10.0 2013-09-30 17:00:00 BOS General Edward Lawrence Logan Intl 42.364347 -71.005181 19.0 -5.0 A America/New_York
1007570 2013 9 30 2349.0 2359 -10.0 325.0 350 -25.0 B6 ... 10.0 2013-09-30 18:00:00 NaN NaN NaN NaN NaN NaN NaN NaN
1007571 2013 9 30 2349.0 2359 -10.0 325.0 350 -25.0 B6 ... 10.0 2013-09-30 18:00:00 NaN NaN NaN NaN NaN NaN NaN NaN
1007572 2013 9 30 2349.0 2359 -10.0 325.0 350 -25.0 B6 ... 10.0 2013-09-30 18:00:00 NaN NaN NaN NaN NaN NaN NaN NaN
1007573 2013 9 30 NaN 1842 NaN NaN 2019 NaN EV ... 10.0 2013-09-30 13:00:00 BNA Nashville Intl 36.124472 -86.678194 599.0 -6.0 A America/Chicago
1007574 2013 9 30 NaN 1842 NaN NaN 2019 NaN EV ... 10.0 2013-09-30 13:00:00 BNA Nashville Intl 36.124472 -86.678194 599.0 -6.0 A America/Chicago
1007575 2013 9 30 NaN 1842 NaN NaN 2019 NaN EV ... 10.0 2013-09-30 13:00:00 BNA Nashville Intl 36.124472 -86.678194 599.0 -6.0 A America/Chicago
1007576 2013 9 30 NaN 1455 NaN NaN 1634 NaN 9E ... 10.0 2013-09-30 09:00:00 DCA Ronald Reagan Washington Natl 38.852083 -77.037722 15.0 -5.0 A America/New_York
1007577 2013 9 30 NaN 1455 NaN NaN 1634 NaN 9E ... 10.0 2013-09-30 09:00:00 DCA Ronald Reagan Washington Natl 38.852083 -77.037722 15.0 -5.0 A America/New_York
1007578 2013 9 30 NaN 1455 NaN NaN 1634 NaN 9E ... 10.0 2013-09-30 09:00:00 DCA Ronald Reagan Washington Natl 38.852083 -77.037722 15.0 -5.0 A America/New_York
1007579 2013 9 30 NaN 2200 NaN NaN 2312 NaN 9E ... 10.0 2013-09-30 17:00:00 SYR Syracuse Hancock Intl 43.111187 -76.106311 421.0 -5.0 A America/New_York
1007580 2013 9 30 NaN 2200 NaN NaN 2312 NaN 9E ... 10.0 2013-09-30 17:00:00 SYR Syracuse Hancock Intl 43.111187 -76.106311 421.0 -5.0 A America/New_York
1007581 2013 9 30 NaN 2200 NaN NaN 2312 NaN 9E ... 10.0 2013-09-30 17:00:00 SYR Syracuse Hancock Intl 43.111187 -76.106311 421.0 -5.0 A America/New_York
1007582 2013 9 30 NaN 1210 NaN NaN 1330 NaN MQ ... 10.0 2013-09-30 07:00:00 BNA Nashville Intl 36.124472 -86.678194 599.0 -6.0 A America/Chicago
1007583 2013 9 30 NaN 1210 NaN NaN 1330 NaN MQ ... 10.0 2013-09-30 07:00:00 BNA Nashville Intl 36.124472 -86.678194 599.0 -6.0 A America/Chicago
1007584 2013 9 30 NaN 1210 NaN NaN 1330 NaN MQ ... 10.0 2013-09-30 07:00:00 BNA Nashville Intl 36.124472 -86.678194 599.0 -6.0 A America/Chicago
1007585 2013 9 30 NaN 1159 NaN NaN 1344 NaN MQ ... 10.0 2013-09-30 06:00:00 CLE Cleveland Hopkins Intl 41.411689 -81.849794 791.0 -5.0 A America/New_York
1007586 2013 9 30 NaN 1159 NaN NaN 1344 NaN MQ ... 10.0 2013-09-30 06:00:00 CLE Cleveland Hopkins Intl 41.411689 -81.849794 791.0 -5.0 A America/New_York
1007587 2013 9 30 NaN 1159 NaN NaN 1344 NaN MQ ... 10.0 2013-09-30 06:00:00 CLE Cleveland Hopkins Intl 41.411689 -81.849794 791.0 -5.0 A America/New_York
1007588 2013 9 30 NaN 840 NaN NaN 1020 NaN MQ ... 10.0 2013-09-30 03:00:00 RDU Raleigh Durham Intl 35.877639 -78.787472 435.0 -5.0 A America/New_York
1007589 2013 9 30 NaN 840 NaN NaN 1020 NaN MQ ... 10.0 2013-09-30 03:00:00 RDU Raleigh Durham Intl 35.877639 -78.787472 435.0 -5.0 A America/New_York
1007590 2013 9 30 NaN 840 NaN NaN 1020 NaN MQ ... 10.0 2013-09-30 03:00:00 RDU Raleigh Durham Intl 35.877639 -78.787472 435.0 -5.0 A America/New_York

1007591 rows × 38 columns

6 Numpy和SciPy

Numpy和SciPy是Python数据科学的CP。早期Python的list比较慢,并且对于处理矩阵和向量运算不太好,因此有了Numpy来解决这个问题。它引入了array-type的数据类型。

创建数组:


In [29]:
import numpy as np
a = np.array([1, 2, 3])
a


Out[29]:
array([1, 2, 3])

注意这里我们传的是列表,而不是np.array(1, 2, 3)

现在我们创建一个arange


In [30]:
np.arange(10)


Out[30]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [31]:
# 给序列乘以一个系数
np.arange(10) * np.pi


Out[31]:
array([  0.        ,   3.14159265,   6.28318531,   9.42477796,
        12.56637061,  15.70796327,  18.84955592,  21.99114858,
        25.13274123,  28.27433388])

我们也可以使用shape方法从一维数组创建多维数组


In [32]:
a = np.array([1, 2, 3, 4, 5, 6])
a.shape = (2, 3)
a


Out[32]:
array([[1, 2, 3],
       [4, 5, 6]])

6.1 矩阵Matrix


In [35]:
np.matrix('1 2; 3 4')


Out[35]:
matrix([[1, 2],
        [3, 4]])

In [40]:
#矩阵乘
a1 = np.matrix('1 2; 3 4')
a2 = np.matrix('3 4; 5 7')
a1 * a2


Out[40]:
matrix([[13, 18],
        [29, 40]])

In [41]:
#array转换为矩阵
mat_a = np.mat(a1)
mat_a


Out[41]:
matrix([[1, 2],
        [3, 4]])

6.2 稀疏矩阵(Sparse Matrices)


In [42]:
import numpy, scipy.sparse
n = 100000
x = (numpy.random.rand(n) * 2).astype(int).astype(float) #50%稀疏矩阵
x_csr = scipy.sparse.csr_matrix(x)
x_dok = scipy.sparse.dok_matrix(x.reshape(x_csr.shape))
x_dok


Out[42]:
<1x100000 sparse matrix of type '<class 'numpy.float64'>'
	with 50077 stored elements in Dictionary Of Keys format>

6.3 从CSV文件中加载数据


In [43]:
import csv
with open('data/array/array.csv', 'r') as csvfile:
    csvreader = csv.reader(csvfile)
    data = []
    for row in csvreader:
        row = [float(x) for x in row]
        data.append(row)
data


Out[43]:
[[2.0, 3.0, 4.0, 5.0], [3.0, 4.0, 5.0, 6.0], [7.0, 9.0, 9.0, 10.0]]

6.4 求解矩阵方程(Solving a matrix)


In [44]:
import numpy as np
import scipy as sp
a = np.array([[3, 2, 0], [1, -1, 0], [0, 5, 1]])
b = np.array([2, 4, -1])
x = np.linalg.solve(a, b)
x


Out[44]:
array([ 2., -2.,  9.])

In [45]:
#检查结果是否正确
np.dot(a, x) == b


Out[45]:
array([ True,  True,  True], dtype=bool)

7 Scikit-learn 简介

前面我们介绍了pandas和numpy、scipy。现在我们来介绍python机器库Scikit。首先需要先知道机器学习的两种:

  • 监督学习(Supervised Learning): 从训练集建立模型进行预测
  • 非监督学习(Unsupervised Learning): 从数据中推测模型,比如从文本中找出主题

Scikit-learn有一下特性:

  • 预处理(Preprocessing):为机器学习reshape数据
  • 降维处理(Dimensionality reduction):减少变量的重复
  • 分类(Classification): 预测分类
  • 回归(regression):预测连续变量
  • 聚类(Clustering):从数据中发现自然的模式
  • 模型选取(Model Selection):为数据找到最优模型

这里我们还是看nycflights13的数据集。


In [47]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder

flights = pd.read_csv('data/nycflights13/flights.csv.gz')
weather = pd.read_csv('data/nycflights13/weather.csv.gz')
airports = pd.read_csv('data/nycflights13/airports.csv.gz')

df_withweather = pd.merge(flights, weather, how='left', on=['year', 'month', 'day', 'hour'])
df = pd.merge(df_withweather, airports, how='left', left_on='dest', right_on='faa')

df = df.dropna()
df


Out[47]:
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier ... visib time_hour_y faa name lat lon alt tz dst tzone
4 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL ... 10.0 2013-01-01 01:00:00 ATL Hartsfield Jackson Atlanta Intl 33.636719 -84.428067 1026.0 -5.0 A America/New_York
5 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL ... 10.0 2013-01-01 01:00:00 ATL Hartsfield Jackson Atlanta Intl 33.636719 -84.428067 1026.0 -5.0 A America/New_York
6 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL ... 10.0 2013-01-01 01:00:00 ATL Hartsfield Jackson Atlanta Intl 33.636719 -84.428067 1026.0 -5.0 A America/New_York
8 2013 1 1 555.0 600 -5.0 913.0 854 19.0 B6 ... 10.0 2013-01-01 01:00:00 FLL Fort Lauderdale Hollywood Intl 26.072583 -80.152750 9.0 -5.0 A America/New_York
9 2013 1 1 555.0 600 -5.0 913.0 854 19.0 B6 ... 10.0 2013-01-01 01:00:00 FLL Fort Lauderdale Hollywood Intl 26.072583 -80.152750 9.0 -5.0 A America/New_York
10 2013 1 1 555.0 600 -5.0 913.0 854 19.0 B6 ... 10.0 2013-01-01 01:00:00 FLL Fort Lauderdale Hollywood Intl 26.072583 -80.152750 9.0 -5.0 A America/New_York
11 2013 1 1 557.0 600 -3.0 709.0 723 -14.0 EV ... 10.0 2013-01-01 01:00:00 IAD Washington Dulles Intl 38.944533 -77.455811 313.0 -5.0 A America/New_York
12 2013 1 1 557.0 600 -3.0 709.0 723 -14.0 EV ... 10.0 2013-01-01 01:00:00 IAD Washington Dulles Intl 38.944533 -77.455811 313.0 -5.0 A America/New_York
13 2013 1 1 557.0 600 -3.0 709.0 723 -14.0 EV ... 10.0 2013-01-01 01:00:00 IAD Washington Dulles Intl 38.944533 -77.455811 313.0 -5.0 A America/New_York
14 2013 1 1 557.0 600 -3.0 838.0 846 -8.0 B6 ... 10.0 2013-01-01 01:00:00 MCO Orlando Intl 28.429394 -81.308994 96.0 -5.0 A America/New_York
15 2013 1 1 557.0 600 -3.0 838.0 846 -8.0 B6 ... 10.0 2013-01-01 01:00:00 MCO Orlando Intl 28.429394 -81.308994 96.0 -5.0 A America/New_York
16 2013 1 1 557.0 600 -3.0 838.0 846 -8.0 B6 ... 10.0 2013-01-01 01:00:00 MCO Orlando Intl 28.429394 -81.308994 96.0 -5.0 A America/New_York
17 2013 1 1 558.0 600 -2.0 753.0 745 8.0 AA ... 10.0 2013-01-01 01:00:00 ORD Chicago Ohare Intl 41.978603 -87.904842 668.0 -6.0 A America/Chicago
18 2013 1 1 558.0 600 -2.0 753.0 745 8.0 AA ... 10.0 2013-01-01 01:00:00 ORD Chicago Ohare Intl 41.978603 -87.904842 668.0 -6.0 A America/Chicago
19 2013 1 1 558.0 600 -2.0 753.0 745 8.0 AA ... 10.0 2013-01-01 01:00:00 ORD Chicago Ohare Intl 41.978603 -87.904842 668.0 -6.0 A America/Chicago
20 2013 1 1 558.0 600 -2.0 849.0 851 -2.0 B6 ... 10.0 2013-01-01 01:00:00 PBI Palm Beach Intl 26.683161 -80.095589 19.0 -5.0 A America/New_York
21 2013 1 1 558.0 600 -2.0 849.0 851 -2.0 B6 ... 10.0 2013-01-01 01:00:00 PBI Palm Beach Intl 26.683161 -80.095589 19.0 -5.0 A America/New_York
22 2013 1 1 558.0 600 -2.0 849.0 851 -2.0 B6 ... 10.0 2013-01-01 01:00:00 PBI Palm Beach Intl 26.683161 -80.095589 19.0 -5.0 A America/New_York
23 2013 1 1 558.0 600 -2.0 853.0 856 -3.0 B6 ... 10.0 2013-01-01 01:00:00 TPA Tampa Intl 27.975472 -82.533250 26.0 -5.0 A America/New_York
24 2013 1 1 558.0 600 -2.0 853.0 856 -3.0 B6 ... 10.0 2013-01-01 01:00:00 TPA Tampa Intl 27.975472 -82.533250 26.0 -5.0 A America/New_York
25 2013 1 1 558.0 600 -2.0 853.0 856 -3.0 B6 ... 10.0 2013-01-01 01:00:00 TPA Tampa Intl 27.975472 -82.533250 26.0 -5.0 A America/New_York
26 2013 1 1 558.0 600 -2.0 924.0 917 7.0 UA ... 10.0 2013-01-01 01:00:00 LAX Los Angeles Intl 33.942536 -118.408075 126.0 -8.0 A America/Los_Angeles
27 2013 1 1 558.0 600 -2.0 924.0 917 7.0 UA ... 10.0 2013-01-01 01:00:00 LAX Los Angeles Intl 33.942536 -118.408075 126.0 -8.0 A America/Los_Angeles
28 2013 1 1 558.0 600 -2.0 924.0 917 7.0 UA ... 10.0 2013-01-01 01:00:00 LAX Los Angeles Intl 33.942536 -118.408075 126.0 -8.0 A America/Los_Angeles
29 2013 1 1 558.0 600 -2.0 923.0 937 -14.0 UA ... 10.0 2013-01-01 01:00:00 SFO San Francisco Intl 37.618972 -122.374889 13.0 -8.0 A America/Los_Angeles
30 2013 1 1 558.0 600 -2.0 923.0 937 -14.0 UA ... 10.0 2013-01-01 01:00:00 SFO San Francisco Intl 37.618972 -122.374889 13.0 -8.0 A America/Los_Angeles
31 2013 1 1 558.0 600 -2.0 923.0 937 -14.0 UA ... 10.0 2013-01-01 01:00:00 SFO San Francisco Intl 37.618972 -122.374889 13.0 -8.0 A America/Los_Angeles
32 2013 1 1 559.0 600 -1.0 941.0 910 31.0 AA ... 10.0 2013-01-01 01:00:00 DFW Dallas Fort Worth Intl 32.896828 -97.037997 607.0 -6.0 A America/Chicago
33 2013 1 1 559.0 600 -1.0 941.0 910 31.0 AA ... 10.0 2013-01-01 01:00:00 DFW Dallas Fort Worth Intl 32.896828 -97.037997 607.0 -6.0 A America/Chicago
34 2013 1 1 559.0 600 -1.0 941.0 910 31.0 AA ... 10.0 2013-01-01 01:00:00 DFW Dallas Fort Worth Intl 32.896828 -97.037997 607.0 -6.0 A America/Chicago
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1007540 2013 9 30 2207.0 2140 27.0 2257.0 2250 7.0 MQ ... 10.0 2013-09-30 16:00:00 BNA Nashville Intl 36.124472 -86.678194 599.0 -6.0 A America/Chicago
1007541 2013 9 30 2207.0 2140 27.0 2257.0 2250 7.0 MQ ... 10.0 2013-09-30 16:00:00 BNA Nashville Intl 36.124472 -86.678194 599.0 -6.0 A America/Chicago
1007542 2013 9 30 2207.0 2140 27.0 2257.0 2250 7.0 MQ ... 10.0 2013-09-30 16:00:00 BNA Nashville Intl 36.124472 -86.678194 599.0 -6.0 A America/Chicago
1007543 2013 9 30 2211.0 2059 72.0 2339.0 2242 57.0 EV ... 10.0 2013-09-30 15:00:00 STL Lambert St Louis Intl 38.748697 -90.370028 618.0 -6.0 A America/Chicago
1007544 2013 9 30 2211.0 2059 72.0 2339.0 2242 57.0 EV ... 10.0 2013-09-30 15:00:00 STL Lambert St Louis Intl 38.748697 -90.370028 618.0 -6.0 A America/Chicago
1007545 2013 9 30 2211.0 2059 72.0 2339.0 2242 57.0 EV ... 10.0 2013-09-30 15:00:00 STL Lambert St Louis Intl 38.748697 -90.370028 618.0 -6.0 A America/Chicago
1007546 2013 9 30 2231.0 2245 -14.0 2335.0 2356 -21.0 B6 ... 10.0 2013-09-30 17:00:00 PWM Portland Intl Jetport 43.646161 -70.309281 77.0 -5.0 A America/New_York
1007547 2013 9 30 2231.0 2245 -14.0 2335.0 2356 -21.0 B6 ... 10.0 2013-09-30 17:00:00 PWM Portland Intl Jetport 43.646161 -70.309281 77.0 -5.0 A America/New_York
1007548 2013 9 30 2231.0 2245 -14.0 2335.0 2356 -21.0 B6 ... 10.0 2013-09-30 17:00:00 PWM Portland Intl Jetport 43.646161 -70.309281 77.0 -5.0 A America/New_York
1007549 2013 9 30 2233.0 2113 80.0 112.0 30 42.0 UA ... 10.0 2013-09-30 16:00:00 SFO San Francisco Intl 37.618972 -122.374889 13.0 -8.0 A America/Los_Angeles
1007550 2013 9 30 2233.0 2113 80.0 112.0 30 42.0 UA ... 10.0 2013-09-30 16:00:00 SFO San Francisco Intl 37.618972 -122.374889 13.0 -8.0 A America/Los_Angeles
1007551 2013 9 30 2233.0 2113 80.0 112.0 30 42.0 UA ... 10.0 2013-09-30 16:00:00 SFO San Francisco Intl 37.618972 -122.374889 13.0 -8.0 A America/Los_Angeles
1007552 2013 9 30 2235.0 2001 154.0 59.0 2249 130.0 B6 ... 10.0 2013-09-30 15:00:00 MCO Orlando Intl 28.429394 -81.308994 96.0 -5.0 A America/New_York
1007553 2013 9 30 2235.0 2001 154.0 59.0 2249 130.0 B6 ... 10.0 2013-09-30 15:00:00 MCO Orlando Intl 28.429394 -81.308994 96.0 -5.0 A America/New_York
1007554 2013 9 30 2235.0 2001 154.0 59.0 2249 130.0 B6 ... 10.0 2013-09-30 15:00:00 MCO Orlando Intl 28.429394 -81.308994 96.0 -5.0 A America/New_York
1007555 2013 9 30 2237.0 2245 -8.0 2345.0 2353 -8.0 B6 ... 10.0 2013-09-30 17:00:00 BTV Burlington Intl 44.471861 -73.153278 335.0 -5.0 A America/New_York
1007556 2013 9 30 2237.0 2245 -8.0 2345.0 2353 -8.0 B6 ... 10.0 2013-09-30 17:00:00 BTV Burlington Intl 44.471861 -73.153278 335.0 -5.0 A America/New_York
1007557 2013 9 30 2237.0 2245 -8.0 2345.0 2353 -8.0 B6 ... 10.0 2013-09-30 17:00:00 BTV Burlington Intl 44.471861 -73.153278 335.0 -5.0 A America/New_York
1007558 2013 9 30 2240.0 2245 -5.0 2334.0 2351 -17.0 B6 ... 10.0 2013-09-30 17:00:00 SYR Syracuse Hancock Intl 43.111187 -76.106311 421.0 -5.0 A America/New_York
1007559 2013 9 30 2240.0 2245 -5.0 2334.0 2351 -17.0 B6 ... 10.0 2013-09-30 17:00:00 SYR Syracuse Hancock Intl 43.111187 -76.106311 421.0 -5.0 A America/New_York
1007560 2013 9 30 2240.0 2245 -5.0 2334.0 2351 -17.0 B6 ... 10.0 2013-09-30 17:00:00 SYR Syracuse Hancock Intl 43.111187 -76.106311 421.0 -5.0 A America/New_York
1007561 2013 9 30 2240.0 2250 -10.0 2347.0 7 -20.0 B6 ... 10.0 2013-09-30 17:00:00 BUF Buffalo Niagara Intl 42.940525 -78.732167 724.0 -5.0 A America/New_York
1007562 2013 9 30 2240.0 2250 -10.0 2347.0 7 -20.0 B6 ... 10.0 2013-09-30 17:00:00 BUF Buffalo Niagara Intl 42.940525 -78.732167 724.0 -5.0 A America/New_York
1007563 2013 9 30 2240.0 2250 -10.0 2347.0 7 -20.0 B6 ... 10.0 2013-09-30 17:00:00 BUF Buffalo Niagara Intl 42.940525 -78.732167 724.0 -5.0 A America/New_York
1007564 2013 9 30 2241.0 2246 -5.0 2345.0 1 -16.0 B6 ... 10.0 2013-09-30 17:00:00 ROC Greater Rochester Intl 43.118866 -77.672389 559.0 -5.0 A America/New_York
1007565 2013 9 30 2241.0 2246 -5.0 2345.0 1 -16.0 B6 ... 10.0 2013-09-30 17:00:00 ROC Greater Rochester Intl 43.118866 -77.672389 559.0 -5.0 A America/New_York
1007566 2013 9 30 2241.0 2246 -5.0 2345.0 1 -16.0 B6 ... 10.0 2013-09-30 17:00:00 ROC Greater Rochester Intl 43.118866 -77.672389 559.0 -5.0 A America/New_York
1007567 2013 9 30 2307.0 2255 12.0 2359.0 2358 1.0 B6 ... 10.0 2013-09-30 17:00:00 BOS General Edward Lawrence Logan Intl 42.364347 -71.005181 19.0 -5.0 A America/New_York
1007568 2013 9 30 2307.0 2255 12.0 2359.0 2358 1.0 B6 ... 10.0 2013-09-30 17:00:00 BOS General Edward Lawrence Logan Intl 42.364347 -71.005181 19.0 -5.0 A America/New_York
1007569 2013 9 30 2307.0 2255 12.0 2359.0 2358 1.0 B6 ... 10.0 2013-09-30 17:00:00 BOS General Edward Lawrence Logan Intl 42.364347 -71.005181 19.0 -5.0 A America/New_York

838182 rows × 38 columns

7.1 特征向量


In [48]:
pred = 'dep_delay'
features = ['month', 'day', 'dep_time', 'arr_time', 'carrier', 'dest', 'air_time',
           'distance', 'lat', 'lon', 'alt', 'dewp', 'humid', 'wind_speed', 'wind_gust',
           'precip', 'pressure', 'visib']
features_v = df[features]
pred_v = df[pred]

pd.options.mode.chained_assignment = None #default='warn'

# 因为航空公司不是一个数字,我们把它转化为数字哑变量
features_v['carrier'] = pd.factorize(features_v['carrier'])[0]

# dest也不是一个数字,我们也把它转为数字
features_v['dest'] = pd.factorize(features_v['dest'])[0]

In [49]:
features_v


Out[49]:
month day dep_time arr_time carrier dest air_time distance lat lon alt dewp humid wind_speed wind_gust precip pressure visib
4 1 1 554.0 812.0 0 0 116.0 762 33.636719 -84.428067 1026.0 26.06 59.37 10.35702 11.918651 0.0 1012.0 10.0
5 1 1 554.0 812.0 0 0 116.0 762 33.636719 -84.428067 1026.0 26.06 59.37 12.65858 14.567241 0.0 1012.6 10.0
6 1 1 554.0 812.0 0 0 116.0 762 33.636719 -84.428067 1026.0 26.06 57.33 13.80936 15.891535 0.0 1011.9 10.0
8 1 1 555.0 913.0 1 1 158.0 1065 26.072583 -80.152750 9.0 26.06 59.37 10.35702 11.918651 0.0 1012.0 10.0
9 1 1 555.0 913.0 1 1 158.0 1065 26.072583 -80.152750 9.0 26.06 59.37 12.65858 14.567241 0.0 1012.6 10.0
10 1 1 555.0 913.0 1 1 158.0 1065 26.072583 -80.152750 9.0 26.06 57.33 13.80936 15.891535 0.0 1011.9 10.0
11 1 1 557.0 709.0 2 2 53.0 229 38.944533 -77.455811 313.0 26.06 59.37 10.35702 11.918651 0.0 1012.0 10.0
12 1 1 557.0 709.0 2 2 53.0 229 38.944533 -77.455811 313.0 26.06 59.37 12.65858 14.567241 0.0 1012.6 10.0
13 1 1 557.0 709.0 2 2 53.0 229 38.944533 -77.455811 313.0 26.06 57.33 13.80936 15.891535 0.0 1011.9 10.0
14 1 1 557.0 838.0 1 3 140.0 944 28.429394 -81.308994 96.0 26.06 59.37 10.35702 11.918651 0.0 1012.0 10.0
15 1 1 557.0 838.0 1 3 140.0 944 28.429394 -81.308994 96.0 26.06 59.37 12.65858 14.567241 0.0 1012.6 10.0
16 1 1 557.0 838.0 1 3 140.0 944 28.429394 -81.308994 96.0 26.06 57.33 13.80936 15.891535 0.0 1011.9 10.0
17 1 1 558.0 753.0 3 4 138.0 733 41.978603 -87.904842 668.0 26.06 59.37 10.35702 11.918651 0.0 1012.0 10.0
18 1 1 558.0 753.0 3 4 138.0 733 41.978603 -87.904842 668.0 26.06 59.37 12.65858 14.567241 0.0 1012.6 10.0
19 1 1 558.0 753.0 3 4 138.0 733 41.978603 -87.904842 668.0 26.06 57.33 13.80936 15.891535 0.0 1011.9 10.0
20 1 1 558.0 849.0 1 5 149.0 1028 26.683161 -80.095589 19.0 26.06 59.37 10.35702 11.918651 0.0 1012.0 10.0
21 1 1 558.0 849.0 1 5 149.0 1028 26.683161 -80.095589 19.0 26.06 59.37 12.65858 14.567241 0.0 1012.6 10.0
22 1 1 558.0 849.0 1 5 149.0 1028 26.683161 -80.095589 19.0 26.06 57.33 13.80936 15.891535 0.0 1011.9 10.0
23 1 1 558.0 853.0 1 6 158.0 1005 27.975472 -82.533250 26.0 26.06 59.37 10.35702 11.918651 0.0 1012.0 10.0
24 1 1 558.0 853.0 1 6 158.0 1005 27.975472 -82.533250 26.0 26.06 59.37 12.65858 14.567241 0.0 1012.6 10.0
25 1 1 558.0 853.0 1 6 158.0 1005 27.975472 -82.533250 26.0 26.06 57.33 13.80936 15.891535 0.0 1011.9 10.0
26 1 1 558.0 924.0 4 7 345.0 2475 33.942536 -118.408075 126.0 26.06 59.37 10.35702 11.918651 0.0 1012.0 10.0
27 1 1 558.0 924.0 4 7 345.0 2475 33.942536 -118.408075 126.0 26.06 59.37 12.65858 14.567241 0.0 1012.6 10.0
28 1 1 558.0 924.0 4 7 345.0 2475 33.942536 -118.408075 126.0 26.06 57.33 13.80936 15.891535 0.0 1011.9 10.0
29 1 1 558.0 923.0 4 8 361.0 2565 37.618972 -122.374889 13.0 26.06 59.37 10.35702 11.918651 0.0 1012.0 10.0
30 1 1 558.0 923.0 4 8 361.0 2565 37.618972 -122.374889 13.0 26.06 59.37 12.65858 14.567241 0.0 1012.6 10.0
31 1 1 558.0 923.0 4 8 361.0 2565 37.618972 -122.374889 13.0 26.06 57.33 13.80936 15.891535 0.0 1011.9 10.0
32 1 1 559.0 941.0 3 9 257.0 1389 32.896828 -97.037997 607.0 26.06 59.37 10.35702 11.918651 0.0 1012.0 10.0
33 1 1 559.0 941.0 3 9 257.0 1389 32.896828 -97.037997 607.0 26.06 59.37 12.65858 14.567241 0.0 1012.6 10.0
34 1 1 559.0 941.0 3 9 257.0 1389 32.896828 -97.037997 607.0 26.06 57.33 13.80936 15.891535 0.0 1011.9 10.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1007540 9 30 2207.0 2257.0 5 45 97.0 764 36.124472 -86.678194 599.0 53.06 54.94 5.75390 6.621473 0.0 1015.0 10.0
1007541 9 30 2207.0 2257.0 5 45 97.0 764 36.124472 -86.678194 599.0 55.04 67.69 6.90468 7.945768 0.0 1015.6 10.0
1007542 9 30 2207.0 2257.0 5 45 97.0 764 36.124472 -86.678194 599.0 51.08 52.67 6.90468 7.945768 0.0 1015.0 10.0
1007543 9 30 2211.0 2339.0 2 40 120.0 872 38.748697 -90.370028 618.0 51.98 49.36 6.90468 7.945768 0.0 1015.2 10.0
1007544 9 30 2211.0 2339.0 2 40 120.0 872 38.748697 -90.370028 618.0 53.06 58.80 8.05546 9.270062 0.0 1015.8 10.0
1007545 9 30 2211.0 2339.0 2 40 120.0 872 38.748697 -90.370028 618.0 51.08 46.04 6.90468 7.945768 0.0 1015.1 10.0
1007546 9 30 2231.0 2335.0 1 51 48.0 273 43.646161 -70.309281 77.0 53.06 58.80 5.75390 6.621473 0.0 1015.4 10.0
1007547 9 30 2231.0 2335.0 1 51 48.0 273 43.646161 -70.309281 77.0 55.94 74.94 6.90468 7.945768 0.0 1016.0 10.0
1007548 9 30 2231.0 2335.0 1 51 48.0 273 43.646161 -70.309281 77.0 51.98 58.65 9.20624 10.594357 0.0 1015.4 10.0
1007549 9 30 2233.0 112.0 4 8 318.0 2565 37.618972 -122.374889 13.0 53.06 54.94 5.75390 6.621473 0.0 1015.0 10.0
1007550 9 30 2233.0 112.0 4 8 318.0 2565 37.618972 -122.374889 13.0 55.04 67.69 6.90468 7.945768 0.0 1015.6 10.0
1007551 9 30 2233.0 112.0 4 8 318.0 2565 37.618972 -122.374889 13.0 51.08 52.67 6.90468 7.945768 0.0 1015.0 10.0
1007552 9 30 2235.0 59.0 1 3 123.0 944 28.429394 -81.308994 96.0 51.98 49.36 6.90468 7.945768 0.0 1015.2 10.0
1007553 9 30 2235.0 59.0 1 3 123.0 944 28.429394 -81.308994 96.0 53.06 58.80 8.05546 9.270062 0.0 1015.8 10.0
1007554 9 30 2235.0 59.0 1 3 123.0 944 28.429394 -81.308994 96.0 51.08 46.04 6.90468 7.945768 0.0 1015.1 10.0
1007555 9 30 2237.0 2345.0 1 47 43.0 266 44.471861 -73.153278 335.0 53.06 58.80 5.75390 6.621473 0.0 1015.4 10.0
1007556 9 30 2237.0 2345.0 1 47 43.0 266 44.471861 -73.153278 335.0 55.94 74.94 6.90468 7.945768 0.0 1016.0 10.0
1007557 9 30 2237.0 2345.0 1 47 43.0 266 44.471861 -73.153278 335.0 51.98 58.65 9.20624 10.594357 0.0 1015.4 10.0
1007558 9 30 2240.0 2334.0 1 29 41.0 209 43.111187 -76.106311 421.0 53.06 58.80 5.75390 6.621473 0.0 1015.4 10.0
1007559 9 30 2240.0 2334.0 1 29 41.0 209 43.111187 -76.106311 421.0 55.94 74.94 6.90468 7.945768 0.0 1016.0 10.0
1007560 9 30 2240.0 2334.0 1 29 41.0 209 43.111187 -76.106311 421.0 51.98 58.65 9.20624 10.594357 0.0 1015.4 10.0
1007561 9 30 2240.0 2347.0 1 20 52.0 301 42.940525 -78.732167 724.0 53.06 58.80 5.75390 6.621473 0.0 1015.4 10.0
1007562 9 30 2240.0 2347.0 1 20 52.0 301 42.940525 -78.732167 724.0 55.94 74.94 6.90468 7.945768 0.0 1016.0 10.0
1007563 9 30 2240.0 2347.0 1 20 52.0 301 42.940525 -78.732167 724.0 51.98 58.65 9.20624 10.594357 0.0 1015.4 10.0
1007564 9 30 2241.0 2345.0 1 28 47.0 264 43.118866 -77.672389 559.0 53.06 58.80 5.75390 6.621473 0.0 1015.4 10.0
1007565 9 30 2241.0 2345.0 1 28 47.0 264 43.118866 -77.672389 559.0 55.94 74.94 6.90468 7.945768 0.0 1016.0 10.0
1007566 9 30 2241.0 2345.0 1 28 47.0 264 43.118866 -77.672389 559.0 51.98 58.65 9.20624 10.594357 0.0 1015.4 10.0
1007567 9 30 2307.0 2359.0 1 19 33.0 187 42.364347 -71.005181 19.0 53.06 58.80 5.75390 6.621473 0.0 1015.4 10.0
1007568 9 30 2307.0 2359.0 1 19 33.0 187 42.364347 -71.005181 19.0 55.94 74.94 6.90468 7.945768 0.0 1016.0 10.0
1007569 9 30 2307.0 2359.0 1 19 33.0 187 42.364347 -71.005181 19.0 51.98 58.65 9.20624 10.594357 0.0 1015.4 10.0

838182 rows × 18 columns

7.2 对特征向量进行标准化(Scaling the feature vector)


In [50]:
# 因为各个特征的维度各不相同,我们需要做标准化
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features_v)

In [51]:
scaled_features


Out[51]:
array([[-1.62498171, -1.69693092, -1.62908082, ..., -0.11420569,
        -0.79438586,  0.29213712],
       [-1.62498171, -1.69693092, -1.62908082, ..., -0.11420569,
        -0.71389571,  0.29213712],
       [-1.62498171, -1.69693092, -1.62908082, ..., -0.11420569,
        -0.80780089,  0.29213712],
       ..., 
       [ 0.71764151,  1.6218685 ,  1.98106144, ..., -0.11420569,
        -0.33827498,  0.29213712],
       [ 0.71764151,  1.6218685 ,  1.98106144, ..., -0.11420569,
        -0.25778483,  0.29213712],
       [ 0.71764151,  1.6218685 ,  1.98106144, ..., -0.11420569,
        -0.33827498,  0.29213712]])

7.3 特征降维(Reducing Dimensions)

我们使用PCA(Principle Component Analysis主成分析)把特征降维为2个


In [53]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_r = pca.fit(scaled_features).transform(scaled_features)

In [54]:
X_r


Out[54]:
array([[-0.32039321, -0.88647489],
       [-0.32303154, -0.72462068],
       [-0.32447915, -0.59863037],
       ..., 
       [-2.11022832,  0.78923842],
       [-2.10108117,  0.41150146],
       [-2.11397223,  1.04517555]])

7.4 画图(Plotting)


In [57]:
import matplotlib.pyplot as plt

print('explained variance ratio (first two components): %s' 
      % str(pca.explained_variance_ratio_))

plt.figure()
lw = 2

plt.scatter(X_r[:,0], X_r[:,1], alpha=.8, lw=lw)
plt.title('PCA of flights dataset')


explained variance ratio (first two components): [ 0.16961823  0.14269873]
Out[57]:
Text(0.5,1,'PCA of flights dataset')

8 构建分类器(Build a classifier)

我们来预测一个航班是否会晚点


In [61]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sklearn
from sklearn import linear_model, cross_validation, metrics, svm, ensemble
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_fscore_support, accuracy_score
from sklearn.cross_validation import train_test_split, cross_val_score, ShuffleSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [62]:
flights = pd.read_csv('data/nycflights13/flights.csv.gz')
weather = pd.read_csv('data/nycflights13/weather.csv.gz')
airports = pd.read_csv('data/nycflights13/airports.csv.gz')

df_withweather = pd.merge(flights, weather, how='left', on=['year', 'month', 'day', 'hour'])
df = pd.merge(df_withweather, airports, how='left', left_on='dest', right_on='faa')

df = df.dropna()

In [63]:
df


Out[63]:
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier ... visib time_hour_y faa name lat lon alt tz dst tzone
4 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL ... 10.0 2013-01-01 01:00:00 ATL Hartsfield Jackson Atlanta Intl 33.636719 -84.428067 1026.0 -5.0 A America/New_York
5 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL ... 10.0 2013-01-01 01:00:00 ATL Hartsfield Jackson Atlanta Intl 33.636719 -84.428067 1026.0 -5.0 A America/New_York
6 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL ... 10.0 2013-01-01 01:00:00 ATL Hartsfield Jackson Atlanta Intl 33.636719 -84.428067 1026.0 -5.0 A America/New_York
8 2013 1 1 555.0 600 -5.0 913.0 854 19.0 B6 ... 10.0 2013-01-01 01:00:00 FLL Fort Lauderdale Hollywood Intl 26.072583 -80.152750 9.0 -5.0 A America/New_York
9 2013 1 1 555.0 600 -5.0 913.0 854 19.0 B6 ... 10.0 2013-01-01 01:00:00 FLL Fort Lauderdale Hollywood Intl 26.072583 -80.152750 9.0 -5.0 A America/New_York
10 2013 1 1 555.0 600 -5.0 913.0 854 19.0 B6 ... 10.0 2013-01-01 01:00:00 FLL Fort Lauderdale Hollywood Intl 26.072583 -80.152750 9.0 -5.0 A America/New_York
11 2013 1 1 557.0 600 -3.0 709.0 723 -14.0 EV ... 10.0 2013-01-01 01:00:00 IAD Washington Dulles Intl 38.944533 -77.455811 313.0 -5.0 A America/New_York
12 2013 1 1 557.0 600 -3.0 709.0 723 -14.0 EV ... 10.0 2013-01-01 01:00:00 IAD Washington Dulles Intl 38.944533 -77.455811 313.0 -5.0 A America/New_York
13 2013 1 1 557.0 600 -3.0 709.0 723 -14.0 EV ... 10.0 2013-01-01 01:00:00 IAD Washington Dulles Intl 38.944533 -77.455811 313.0 -5.0 A America/New_York
14 2013 1 1 557.0 600 -3.0 838.0 846 -8.0 B6 ... 10.0 2013-01-01 01:00:00 MCO Orlando Intl 28.429394 -81.308994 96.0 -5.0 A America/New_York
15 2013 1 1 557.0 600 -3.0 838.0 846 -8.0 B6 ... 10.0 2013-01-01 01:00:00 MCO Orlando Intl 28.429394 -81.308994 96.0 -5.0 A America/New_York
16 2013 1 1 557.0 600 -3.0 838.0 846 -8.0 B6 ... 10.0 2013-01-01 01:00:00 MCO Orlando Intl 28.429394 -81.308994 96.0 -5.0 A America/New_York
17 2013 1 1 558.0 600 -2.0 753.0 745 8.0 AA ... 10.0 2013-01-01 01:00:00 ORD Chicago Ohare Intl 41.978603 -87.904842 668.0 -6.0 A America/Chicago
18 2013 1 1 558.0 600 -2.0 753.0 745 8.0 AA ... 10.0 2013-01-01 01:00:00 ORD Chicago Ohare Intl 41.978603 -87.904842 668.0 -6.0 A America/Chicago
19 2013 1 1 558.0 600 -2.0 753.0 745 8.0 AA ... 10.0 2013-01-01 01:00:00 ORD Chicago Ohare Intl 41.978603 -87.904842 668.0 -6.0 A America/Chicago
20 2013 1 1 558.0 600 -2.0 849.0 851 -2.0 B6 ... 10.0 2013-01-01 01:00:00 PBI Palm Beach Intl 26.683161 -80.095589 19.0 -5.0 A America/New_York
21 2013 1 1 558.0 600 -2.0 849.0 851 -2.0 B6 ... 10.0 2013-01-01 01:00:00 PBI Palm Beach Intl 26.683161 -80.095589 19.0 -5.0 A America/New_York
22 2013 1 1 558.0 600 -2.0 849.0 851 -2.0 B6 ... 10.0 2013-01-01 01:00:00 PBI Palm Beach Intl 26.683161 -80.095589 19.0 -5.0 A America/New_York
23 2013 1 1 558.0 600 -2.0 853.0 856 -3.0 B6 ... 10.0 2013-01-01 01:00:00 TPA Tampa Intl 27.975472 -82.533250 26.0 -5.0 A America/New_York
24 2013 1 1 558.0 600 -2.0 853.0 856 -3.0 B6 ... 10.0 2013-01-01 01:00:00 TPA Tampa Intl 27.975472 -82.533250 26.0 -5.0 A America/New_York
25 2013 1 1 558.0 600 -2.0 853.0 856 -3.0 B6 ... 10.0 2013-01-01 01:00:00 TPA Tampa Intl 27.975472 -82.533250 26.0 -5.0 A America/New_York
26 2013 1 1 558.0 600 -2.0 924.0 917 7.0 UA ... 10.0 2013-01-01 01:00:00 LAX Los Angeles Intl 33.942536 -118.408075 126.0 -8.0 A America/Los_Angeles
27 2013 1 1 558.0 600 -2.0 924.0 917 7.0 UA ... 10.0 2013-01-01 01:00:00 LAX Los Angeles Intl 33.942536 -118.408075 126.0 -8.0 A America/Los_Angeles
28 2013 1 1 558.0 600 -2.0 924.0 917 7.0 UA ... 10.0 2013-01-01 01:00:00 LAX Los Angeles Intl 33.942536 -118.408075 126.0 -8.0 A America/Los_Angeles
29 2013 1 1 558.0 600 -2.0 923.0 937 -14.0 UA ... 10.0 2013-01-01 01:00:00 SFO San Francisco Intl 37.618972 -122.374889 13.0 -8.0 A America/Los_Angeles
30 2013 1 1 558.0 600 -2.0 923.0 937 -14.0 UA ... 10.0 2013-01-01 01:00:00 SFO San Francisco Intl 37.618972 -122.374889 13.0 -8.0 A America/Los_Angeles
31 2013 1 1 558.0 600 -2.0 923.0 937 -14.0 UA ... 10.0 2013-01-01 01:00:00 SFO San Francisco Intl 37.618972 -122.374889 13.0 -8.0 A America/Los_Angeles
32 2013 1 1 559.0 600 -1.0 941.0 910 31.0 AA ... 10.0 2013-01-01 01:00:00 DFW Dallas Fort Worth Intl 32.896828 -97.037997 607.0 -6.0 A America/Chicago
33 2013 1 1 559.0 600 -1.0 941.0 910 31.0 AA ... 10.0 2013-01-01 01:00:00 DFW Dallas Fort Worth Intl 32.896828 -97.037997 607.0 -6.0 A America/Chicago
34 2013 1 1 559.0 600 -1.0 941.0 910 31.0 AA ... 10.0 2013-01-01 01:00:00 DFW Dallas Fort Worth Intl 32.896828 -97.037997 607.0 -6.0 A America/Chicago
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1007540 2013 9 30 2207.0 2140 27.0 2257.0 2250 7.0 MQ ... 10.0 2013-09-30 16:00:00 BNA Nashville Intl 36.124472 -86.678194 599.0 -6.0 A America/Chicago
1007541 2013 9 30 2207.0 2140 27.0 2257.0 2250 7.0 MQ ... 10.0 2013-09-30 16:00:00 BNA Nashville Intl 36.124472 -86.678194 599.0 -6.0 A America/Chicago
1007542 2013 9 30 2207.0 2140 27.0 2257.0 2250 7.0 MQ ... 10.0 2013-09-30 16:00:00 BNA Nashville Intl 36.124472 -86.678194 599.0 -6.0 A America/Chicago
1007543 2013 9 30 2211.0 2059 72.0 2339.0 2242 57.0 EV ... 10.0 2013-09-30 15:00:00 STL Lambert St Louis Intl 38.748697 -90.370028 618.0 -6.0 A America/Chicago
1007544 2013 9 30 2211.0 2059 72.0 2339.0 2242 57.0 EV ... 10.0 2013-09-30 15:00:00 STL Lambert St Louis Intl 38.748697 -90.370028 618.0 -6.0 A America/Chicago
1007545 2013 9 30 2211.0 2059 72.0 2339.0 2242 57.0 EV ... 10.0 2013-09-30 15:00:00 STL Lambert St Louis Intl 38.748697 -90.370028 618.0 -6.0 A America/Chicago
1007546 2013 9 30 2231.0 2245 -14.0 2335.0 2356 -21.0 B6 ... 10.0 2013-09-30 17:00:00 PWM Portland Intl Jetport 43.646161 -70.309281 77.0 -5.0 A America/New_York
1007547 2013 9 30 2231.0 2245 -14.0 2335.0 2356 -21.0 B6 ... 10.0 2013-09-30 17:00:00 PWM Portland Intl Jetport 43.646161 -70.309281 77.0 -5.0 A America/New_York
1007548 2013 9 30 2231.0 2245 -14.0 2335.0 2356 -21.0 B6 ... 10.0 2013-09-30 17:00:00 PWM Portland Intl Jetport 43.646161 -70.309281 77.0 -5.0 A America/New_York
1007549 2013 9 30 2233.0 2113 80.0 112.0 30 42.0 UA ... 10.0 2013-09-30 16:00:00 SFO San Francisco Intl 37.618972 -122.374889 13.0 -8.0 A America/Los_Angeles
1007550 2013 9 30 2233.0 2113 80.0 112.0 30 42.0 UA ... 10.0 2013-09-30 16:00:00 SFO San Francisco Intl 37.618972 -122.374889 13.0 -8.0 A America/Los_Angeles
1007551 2013 9 30 2233.0 2113 80.0 112.0 30 42.0 UA ... 10.0 2013-09-30 16:00:00 SFO San Francisco Intl 37.618972 -122.374889 13.0 -8.0 A America/Los_Angeles
1007552 2013 9 30 2235.0 2001 154.0 59.0 2249 130.0 B6 ... 10.0 2013-09-30 15:00:00 MCO Orlando Intl 28.429394 -81.308994 96.0 -5.0 A America/New_York
1007553 2013 9 30 2235.0 2001 154.0 59.0 2249 130.0 B6 ... 10.0 2013-09-30 15:00:00 MCO Orlando Intl 28.429394 -81.308994 96.0 -5.0 A America/New_York
1007554 2013 9 30 2235.0 2001 154.0 59.0 2249 130.0 B6 ... 10.0 2013-09-30 15:00:00 MCO Orlando Intl 28.429394 -81.308994 96.0 -5.0 A America/New_York
1007555 2013 9 30 2237.0 2245 -8.0 2345.0 2353 -8.0 B6 ... 10.0 2013-09-30 17:00:00 BTV Burlington Intl 44.471861 -73.153278 335.0 -5.0 A America/New_York
1007556 2013 9 30 2237.0 2245 -8.0 2345.0 2353 -8.0 B6 ... 10.0 2013-09-30 17:00:00 BTV Burlington Intl 44.471861 -73.153278 335.0 -5.0 A America/New_York
1007557 2013 9 30 2237.0 2245 -8.0 2345.0 2353 -8.0 B6 ... 10.0 2013-09-30 17:00:00 BTV Burlington Intl 44.471861 -73.153278 335.0 -5.0 A America/New_York
1007558 2013 9 30 2240.0 2245 -5.0 2334.0 2351 -17.0 B6 ... 10.0 2013-09-30 17:00:00 SYR Syracuse Hancock Intl 43.111187 -76.106311 421.0 -5.0 A America/New_York
1007559 2013 9 30 2240.0 2245 -5.0 2334.0 2351 -17.0 B6 ... 10.0 2013-09-30 17:00:00 SYR Syracuse Hancock Intl 43.111187 -76.106311 421.0 -5.0 A America/New_York
1007560 2013 9 30 2240.0 2245 -5.0 2334.0 2351 -17.0 B6 ... 10.0 2013-09-30 17:00:00 SYR Syracuse Hancock Intl 43.111187 -76.106311 421.0 -5.0 A America/New_York
1007561 2013 9 30 2240.0 2250 -10.0 2347.0 7 -20.0 B6 ... 10.0 2013-09-30 17:00:00 BUF Buffalo Niagara Intl 42.940525 -78.732167 724.0 -5.0 A America/New_York
1007562 2013 9 30 2240.0 2250 -10.0 2347.0 7 -20.0 B6 ... 10.0 2013-09-30 17:00:00 BUF Buffalo Niagara Intl 42.940525 -78.732167 724.0 -5.0 A America/New_York
1007563 2013 9 30 2240.0 2250 -10.0 2347.0 7 -20.0 B6 ... 10.0 2013-09-30 17:00:00 BUF Buffalo Niagara Intl 42.940525 -78.732167 724.0 -5.0 A America/New_York
1007564 2013 9 30 2241.0 2246 -5.0 2345.0 1 -16.0 B6 ... 10.0 2013-09-30 17:00:00 ROC Greater Rochester Intl 43.118866 -77.672389 559.0 -5.0 A America/New_York
1007565 2013 9 30 2241.0 2246 -5.0 2345.0 1 -16.0 B6 ... 10.0 2013-09-30 17:00:00 ROC Greater Rochester Intl 43.118866 -77.672389 559.0 -5.0 A America/New_York
1007566 2013 9 30 2241.0 2246 -5.0 2345.0 1 -16.0 B6 ... 10.0 2013-09-30 17:00:00 ROC Greater Rochester Intl 43.118866 -77.672389 559.0 -5.0 A America/New_York
1007567 2013 9 30 2307.0 2255 12.0 2359.0 2358 1.0 B6 ... 10.0 2013-09-30 17:00:00 BOS General Edward Lawrence Logan Intl 42.364347 -71.005181 19.0 -5.0 A America/New_York
1007568 2013 9 30 2307.0 2255 12.0 2359.0 2358 1.0 B6 ... 10.0 2013-09-30 17:00:00 BOS General Edward Lawrence Logan Intl 42.364347 -71.005181 19.0 -5.0 A America/New_York
1007569 2013 9 30 2307.0 2255 12.0 2359.0 2358 1.0 B6 ... 10.0 2013-09-30 17:00:00 BOS General Edward Lawrence Logan Intl 42.364347 -71.005181 19.0 -5.0 A America/New_York

838182 rows × 38 columns


In [64]:
pred = 'dep_delay'
features = ['month', 'day', 'dep_time', 'arr_time', 'carrier', 'dest', 'air_time',
           'distance', 'lat', 'lon', 'alt', 'dewp', 'humid', 'wind_speed', 'wind_gust',
           'precip', 'pressure', 'visib']
features_v = df[features]
pred_v = df[pred]

how_late_is_late = 15.0

pd.options.mode.chained_assignment = None #default='warn'


# 因为航空公司不是一个数字,我们把它转化为数字哑变量
features_v['carrier'] = pd.factorize(features_v['carrier'])[0]

# dest也不是一个数字,我们也把它转为数字
features_v['dest'] = pd.factorize(features_v['dest'])[0]

scaler = StandardScaler()
scaled_features_v = scaler.fit_transform(features_v)

features_train, features_test, pred_train, pred_test = train_test_split(
    scaled_features_v, pred_v, test_size=0.30, random_state=0)

In [66]:
# 使用logistic回归来执行分类

clf_lr = sklearn.linear_model.LogisticRegression(penalty='l2',
                                                class_weight='balanced')
logistic_fit = clf_lr.fit(features_train, np.where(pred_train >= how_late_is_late, 1, 0))

predictions = clf_lr.predict(features_test)

In [67]:
# summary Report

# Confusion Matrix
cm_lr = confusion_matrix(np.where(pred_test >= how_late_is_late, 1, 0),
                        predictions)
print("Confusion Matrix")
print(pd.DataFrame(cm_lr))

# 获取精确值
report_lr = precision_recall_fscore_support(
    list(np.where(pred_test >= how_late_is_late, 1, 0)),
    list(predictions), average='binary')

#打印精度值
print("\nprecision = %0.2f, recall = %0.2f, F1 = %0.2f, accuracy = %0.2f"
     % (report_lr[0], report_lr[1], report_lr[2],
       accuracy_score(list(np.where(pred_test >= how_late_is_late, 1, 0)),
                     list(predictions))))


Confusion Matrix
        0      1
0  135181  64481
1   17910  33883

precision = 0.34, recall = 0.65, F1 = 0.45, accuracy = 0.67

9 聚合数据(Cluster data)

最简单的聚类方法是K-Means


In [68]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sklearn
from sklearn.cluster import KMeans
from sklearn import linear_model, cross_validation, cluster
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_fscore_support, accuracy_score
from sklearn.cross_validation import train_test_split, cross_val_score, ShuffleSplit
from sklearn.preprocessing import StandardScaler, OneHotEncoder

flights = pd.read_csv('data/nycflights13/flights.csv.gz')
weather = pd.read_csv('data/nycflights13/weather.csv.gz')
airports = pd.read_csv('data/nycflights13/airports.csv.gz')

df_withweather = pd.merge(flights, weather, how='left', on=['year', 'month', 'day', 'hour'])
df = pd.merge(df_withweather, airports, how='left', left_on='dest', right_on='faa')

df = df.dropna()

In [69]:
pred = 'dep_delay'
features = ['month', 'day', 'dep_time', 'arr_time', 'carrier', 'dest', 'air_time',
           'distance', 'lat', 'lon', 'alt', 'dewp', 'humid', 'wind_speed', 'wind_gust',
           'precip', 'pressure', 'visib']
features_v = df[features]
pred_v = df[pred]

how_late_is_late = 15.0

pd.options.mode.chained_assignment = None #default='warn'

# 因为航空公司不是一个数字,我们把它转化为数字哑变量
features_v['carrier'] = pd.factorize(features_v['carrier'])[0]

# dest也不是一个数字,我们也把它转为数字
features_v['dest'] = pd.factorize(features_v['dest'])[0]

scaler = StandardScaler()
scaled_features_v = scaler.fit_transform(features_v)

features_train, features_test, pred_train, pred_test = train_test_split(
    scaled_features_v, pred_v, test_size=0.30, random_state=0)

In [71]:
cluster = sklearn.cluster.KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', random_state=None, verbose=0)
cluster.fit(features_train)


Out[71]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=8, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [72]:
# 预测测试数据
result = cluster.predict(features_test)

In [73]:
result


Out[73]:
array([7, 0, 7, ..., 7, 5, 0])

In [78]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

reduced_data = PCA(n_components=2).fit_transform(features_train)
kmeans = KMeans(init='k-means++', n_clusters=8, n_init=10)
kmeans.fit(reduced_data)

# mesh的步长
h = .02

x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

z = z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(z, interpolation='nearest',
          extend=(xx.min(), xx.max(), yy.min(), yy.max()),
          cmap=plt.cm.Paired
          #aspect='auto' 
          # origin='lower'
          )

plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)

centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
           marker='x', s=169, linewidths=3,
           color='w', zorder=10)
plt.title('K-Means clustering on the dataset (PCA-reduced data)\n'
         'Centroids are marked with white cross')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-78-605c50db6054> in <module>()
     20 plt.imshow(z, interpolation='nearest',
     21           extend=(xx.min(), xx.max(), yy.min(), yy.max()),
---> 22           cmap=plt.cm.Paired
     23           #aspect='auto'
     24           # origin='lower'

d:\app\Anaconda3\lib\site-packages\matplotlib\pyplot.py in imshow(X, cmap, norm, aspect, interpolation, alpha, vmin, vmax, origin, extent, shape, filternorm, filterrad, imlim, resample, url, hold, data, **kwargs)
   3078                         filternorm=filternorm, filterrad=filterrad,
   3079                         imlim=imlim, resample=resample, url=url, data=data,
-> 3080                         **kwargs)
   3081     finally:
   3082         ax._hold = washold

d:\app\Anaconda3\lib\site-packages\matplotlib\__init__.py in inner(ax, *args, **kwargs)
   1708                     warnings.warn(msg % (label_namer, func.__name__),
   1709                                   RuntimeWarning, stacklevel=2)
-> 1710             return func(ax, *args, **kwargs)
   1711         pre_doc = inner.__doc__
   1712         if pre_doc is None:

d:\app\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py in imshow(self, X, cmap, norm, aspect, interpolation, alpha, vmin, vmax, origin, extent, shape, filternorm, filterrad, imlim, resample, url, **kwargs)
   5190         im = mimage.AxesImage(self, cmap, norm, interpolation, origin, extent,
   5191                               filternorm=filternorm, filterrad=filterrad,
-> 5192                               resample=resample, **kwargs)
   5193 
   5194         im.set_data(X)

d:\app\Anaconda3\lib\site-packages\matplotlib\image.py in __init__(self, ax, cmap, norm, interpolation, origin, extent, filternorm, filterrad, resample, **kwargs)
    755             filterrad=filterrad,
    756             resample=resample,
--> 757             **kwargs
    758         )
    759 

d:\app\Anaconda3\lib\site-packages\matplotlib\image.py in __init__(self, ax, cmap, norm, interpolation, origin, filternorm, filterrad, resample, **kwargs)
    233         self._imcache = None
    234 
--> 235         self.update(kwargs)
    236 
    237     def __getstate__(self):

d:\app\Anaconda3\lib\site-packages\matplotlib\artist.py in update(self, props)
    845         try:
    846             ret = [_update_property(self, k, v)
--> 847                    for k, v in props.items()]
    848         finally:
    849             self.eventson = store

d:\app\Anaconda3\lib\site-packages\matplotlib\artist.py in <listcomp>(.0)
    845         try:
    846             ret = [_update_property(self, k, v)
--> 847                    for k, v in props.items()]
    848         finally:
    849             self.eventson = store

d:\app\Anaconda3\lib\site-packages\matplotlib\artist.py in _update_property(self, k, v)
    838                 func = getattr(self, 'set_' + k, None)
    839                 if not callable(func):
--> 840                     raise AttributeError('Unknown property %s' % k)
    841                 return func(v)
    842 

AttributeError: Unknown property extend

10 PySpark简介

扩展我们的算法:有时我们需要处理大量数据,并且采样已经无效,这个时候可以通过把数据分到多个机器来处理。

Spark是一个用来并行进行大数据处理的API。它将数据切割到集群来处理。在开发阶段,我们可以只在本地运行。

我们使用PySpark Shell来连接到集群。

运行下面路径的pyspark,会启动PySpark Shell

~/spark/bin/pyspark (Max/Linux)
C:\spark\bin\pyspark (Windows)

此时,可以在Shell中运行文件加载:

lines = sc.textFile("README.md")
lines.first() # 加载第一行

可以在http://localhost:4040查看PySpark运行的Job

大多数情况下,我们希望能够在Jupyter Notebook中运行PySpark,为此,我们需要设置环境变量:

PYSPARK_PYTHON=python3
PYSPARK_DRIVER_PYTHON="jupyter"
PYSPARK_DRIVER_PYTHON_OPTS="notebook"

然后运行~/spark/bin/pyspark,最后一个命令会启动一个jupyter server,样子跟我们用的一样。


In [1]:
lines = sc.text('README.md')
lines.take(5)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-2f627e434df0> in <module>()
----> 1 lines = sc.text('README.md')
      2 lines.take(5)

NameError: name 'sc' is not defined

我们看看http://localhost:4040 可以查看运行的Job


In [2]:
linesWithSpark = lines.filter(lambda line: 'spark' in line)
linesWithSpark.count()


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-a9d79c569bce> in <module>()
----> 1 linesWithSpark = lines.filter(lambda line: 'spark' in line)
      2 linesWithSpark.count()

NameError: name 'lines' is not defined

Spark的基本类型是RDD(resilient distributed dataset),它是基本分布式数据类型。RDD有两类操作,第一个是变换(transformation),返回值仍然是RDD,另外一种是动作(action),用来计算结果。Spark的操作是Lazy的,也就是说只有在执行action时才会真正的开始处理。


In [ ]: