C:\Hadoop\bin
C:\Hadoop\bin
下面HADOOP_HOME
环境变量,指向C:\Hadoop
C:\temp\hive
文件夹c:\hadoop\bin\winutils chmod 777 \temp\hive
C:\SPARK
目录下,其它操作系统的放到home
目录SPARK_HOME
,指向C:\SPARK
c:\spark\bin\spark-shell
看看是否安装成功conda --version
conda list
conda update conda
输入jupyter notebook
In [1]:
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 3*np.pi, 500)
plt.plot(x, np.sin(x**2))
plt.title('Sine wave')
Out[1]:
Text(0.5,1,'Sine wave')
In [2]:
import numpy as np
import pandas as pd
In [3]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s[4] # 6.0
Out[3]:
6.0
In [4]:
df = pd.DataFrame({'data': ['2016-01-01', '2016-01-02', '2016-01-03'], 'qty': [20, 30, 40]})
df
Out[4]:
data
qty
0
2016-01-01
20
1
2016-01-02
30
2
2016-01-03
40
更大的数据应当从文件里获取
In [5]:
rain = pd.read_csv('data/rainfall/rainfall.csv')
rain
Out[5]:
City
Month
Rainfall
0
San Francisco
Jan
10.0
1
Seattle
Jan
30.0
2
Los Angeles
Jan
2.0
3
Seattle
Feb
20.0
4
San Francisco
Feb
4.0
5
Los Angeles
Feb
0.0
6
Seattle
Mar
22.0
7
San Francisco
Mar
4.0
8
Los Angeles
Mar
NaN
9
Seattle
Apr
NaN
10
San Francisco
Apr
5.0
11
Los Angeles
Apr
4.0
In [6]:
# 加载一列
rain['City']
Out[6]:
0 San Francisco
1 Seattle
2 Los Angeles
3 Seattle
4 San Francisco
5 Los Angeles
6 Seattle
7 San Francisco
8 Los Angeles
9 Seattle
10 San Francisco
11 Los Angeles
Name: City, dtype: object
In [7]:
# 加载一行(第二行)
rain.loc[[1]]
Out[7]:
City
Month
Rainfall
1
Seattle
Jan
30.0
In [8]:
# 第一行和第二行
rain.loc[0:1]
Out[8]:
City
Month
Rainfall
0
San Francisco
Jan
10.0
1
Seattle
Jan
30.0
In [10]:
# 查找所有降雨量小于10的数据
rain[rain['Rainfall'] < 10]
Out[10]:
City
Month
Rainfall
2
Los Angeles
Jan
2.0
4
San Francisco
Feb
4.0
5
Los Angeles
Feb
0.0
7
San Francisco
Mar
4.0
10
San Francisco
Apr
5.0
11
Los Angeles
Apr
4.0
查找4月份的降雨
In [11]:
rain[rain['Month'] == 'Apr']
Out[11]:
City
Month
Rainfall
9
Seattle
Apr
NaN
10
San Francisco
Apr
5.0
11
Los Angeles
Apr
4.0
查找Los Angeles的数据
In [12]:
rain[rain['City'] == 'Los Angeles']
Out[12]:
City
Month
Rainfall
2
Los Angeles
Jan
2.0
5
Los Angeles
Feb
0.0
8
Los Angeles
Mar
NaN
11
Los Angeles
Apr
4.0
In [13]:
rain = rain.set_index(rain['City'] + rain['Month'])
注意,当我们修改dataframe时,其实是在创建一个副本,因此要把这个值再赋值给原有的dataframe
In [14]:
rain.loc['San FranciscoApr']
Out[14]:
City San Francisco
Month Apr
Rainfall 5
Name: San FranciscoApr, dtype: object
In [15]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
In [16]:
df = pd.read_csv('data/nycflights13/flights.csv.gz')
In [17]:
df
Out[17]:
year
month
day
dep_time
sched_dep_time
dep_delay
arr_time
sched_arr_time
arr_delay
carrier
flight
tailnum
origin
dest
air_time
distance
hour
minute
time_hour
0
2013
1
1
517.0
515
2.0
830.0
819
11.0
UA
1545
N14228
EWR
IAH
227.0
1400
5
15
2013-01-01 05:00:00
1
2013
1
1
533.0
529
4.0
850.0
830
20.0
UA
1714
N24211
LGA
IAH
227.0
1416
5
29
2013-01-01 05:00:00
2
2013
1
1
542.0
540
2.0
923.0
850
33.0
AA
1141
N619AA
JFK
MIA
160.0
1089
5
40
2013-01-01 05:00:00
3
2013
1
1
544.0
545
-1.0
1004.0
1022
-18.0
B6
725
N804JB
JFK
BQN
183.0
1576
5
45
2013-01-01 05:00:00
4
2013
1
1
554.0
600
-6.0
812.0
837
-25.0
DL
461
N668DN
LGA
ATL
116.0
762
6
0
2013-01-01 06:00:00
5
2013
1
1
554.0
558
-4.0
740.0
728
12.0
UA
1696
N39463
EWR
ORD
150.0
719
5
58
2013-01-01 05:00:00
6
2013
1
1
555.0
600
-5.0
913.0
854
19.0
B6
507
N516JB
EWR
FLL
158.0
1065
6
0
2013-01-01 06:00:00
7
2013
1
1
557.0
600
-3.0
709.0
723
-14.0
EV
5708
N829AS
LGA
IAD
53.0
229
6
0
2013-01-01 06:00:00
8
2013
1
1
557.0
600
-3.0
838.0
846
-8.0
B6
79
N593JB
JFK
MCO
140.0
944
6
0
2013-01-01 06:00:00
9
2013
1
1
558.0
600
-2.0
753.0
745
8.0
AA
301
N3ALAA
LGA
ORD
138.0
733
6
0
2013-01-01 06:00:00
10
2013
1
1
558.0
600
-2.0
849.0
851
-2.0
B6
49
N793JB
JFK
PBI
149.0
1028
6
0
2013-01-01 06:00:00
11
2013
1
1
558.0
600
-2.0
853.0
856
-3.0
B6
71
N657JB
JFK
TPA
158.0
1005
6
0
2013-01-01 06:00:00
12
2013
1
1
558.0
600
-2.0
924.0
917
7.0
UA
194
N29129
JFK
LAX
345.0
2475
6
0
2013-01-01 06:00:00
13
2013
1
1
558.0
600
-2.0
923.0
937
-14.0
UA
1124
N53441
EWR
SFO
361.0
2565
6
0
2013-01-01 06:00:00
14
2013
1
1
559.0
600
-1.0
941.0
910
31.0
AA
707
N3DUAA
LGA
DFW
257.0
1389
6
0
2013-01-01 06:00:00
15
2013
1
1
559.0
559
0.0
702.0
706
-4.0
B6
1806
N708JB
JFK
BOS
44.0
187
5
59
2013-01-01 05:00:00
16
2013
1
1
559.0
600
-1.0
854.0
902
-8.0
UA
1187
N76515
EWR
LAS
337.0
2227
6
0
2013-01-01 06:00:00
17
2013
1
1
600.0
600
0.0
851.0
858
-7.0
B6
371
N595JB
LGA
FLL
152.0
1076
6
0
2013-01-01 06:00:00
18
2013
1
1
600.0
600
0.0
837.0
825
12.0
MQ
4650
N542MQ
LGA
ATL
134.0
762
6
0
2013-01-01 06:00:00
19
2013
1
1
601.0
600
1.0
844.0
850
-6.0
B6
343
N644JB
EWR
PBI
147.0
1023
6
0
2013-01-01 06:00:00
20
2013
1
1
602.0
610
-8.0
812.0
820
-8.0
DL
1919
N971DL
LGA
MSP
170.0
1020
6
10
2013-01-01 06:00:00
21
2013
1
1
602.0
605
-3.0
821.0
805
16.0
MQ
4401
N730MQ
LGA
DTW
105.0
502
6
5
2013-01-01 06:00:00
22
2013
1
1
606.0
610
-4.0
858.0
910
-12.0
AA
1895
N633AA
EWR
MIA
152.0
1085
6
10
2013-01-01 06:00:00
23
2013
1
1
606.0
610
-4.0
837.0
845
-8.0
DL
1743
N3739P
JFK
ATL
128.0
760
6
10
2013-01-01 06:00:00
24
2013
1
1
607.0
607
0.0
858.0
915
-17.0
UA
1077
N53442
EWR
MIA
157.0
1085
6
7
2013-01-01 06:00:00
25
2013
1
1
608.0
600
8.0
807.0
735
32.0
MQ
3768
N9EAMQ
EWR
ORD
139.0
719
6
0
2013-01-01 06:00:00
26
2013
1
1
611.0
600
11.0
945.0
931
14.0
UA
303
N532UA
JFK
SFO
366.0
2586
6
0
2013-01-01 06:00:00
27
2013
1
1
613.0
610
3.0
925.0
921
4.0
B6
135
N635JB
JFK
RSW
175.0
1074
6
10
2013-01-01 06:00:00
28
2013
1
1
615.0
615
0.0
1039.0
1100
-21.0
B6
709
N794JB
JFK
SJU
182.0
1598
6
15
2013-01-01 06:00:00
29
2013
1
1
615.0
615
0.0
833.0
842
-9.0
DL
575
N326NB
EWR
ATL
120.0
746
6
15
2013-01-01 06:00:00
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
336746
2013
9
30
2123.0
2125
-2.0
2223.0
2247
-24.0
EV
5489
N712EV
LGA
CHO
45.0
305
21
25
2013-09-30 21:00:00
336747
2013
9
30
2127.0
2129
-2.0
2314.0
2323
-9.0
EV
3833
N16546
EWR
CLT
72.0
529
21
29
2013-09-30 21:00:00
336748
2013
9
30
2128.0
2130
-2.0
2328.0
2359
-31.0
B6
97
N807JB
JFK
DEN
213.0
1626
21
30
2013-09-30 21:00:00
336749
2013
9
30
2129.0
2059
30.0
2230.0
2232
-2.0
EV
5048
N751EV
LGA
RIC
45.0
292
20
59
2013-09-30 20:00:00
336750
2013
9
30
2131.0
2140
-9.0
2225.0
2255
-30.0
MQ
3621
N807MQ
JFK
DCA
36.0
213
21
40
2013-09-30 21:00:00
336751
2013
9
30
2140.0
2140
0.0
10.0
40
-30.0
AA
185
N335AA
JFK
LAX
298.0
2475
21
40
2013-09-30 21:00:00
336752
2013
9
30
2142.0
2129
13.0
2250.0
2239
11.0
EV
4509
N12957
EWR
PWM
47.0
284
21
29
2013-09-30 21:00:00
336753
2013
9
30
2145.0
2145
0.0
115.0
140
-25.0
B6
1103
N633JB
JFK
SJU
192.0
1598
21
45
2013-09-30 21:00:00
336754
2013
9
30
2147.0
2137
10.0
30.0
27
3.0
B6
1371
N627JB
LGA
FLL
139.0
1076
21
37
2013-09-30 21:00:00
336755
2013
9
30
2149.0
2156
-7.0
2245.0
2308
-23.0
UA
523
N813UA
EWR
BOS
37.0
200
21
56
2013-09-30 21:00:00
336756
2013
9
30
2150.0
2159
-9.0
2250.0
2306
-16.0
EV
3842
N10575
EWR
MHT
39.0
209
21
59
2013-09-30 21:00:00
336757
2013
9
30
2159.0
1845
194.0
2344.0
2030
194.0
9E
3320
N906XJ
JFK
BUF
50.0
301
18
45
2013-09-30 18:00:00
336758
2013
9
30
2203.0
2205
-2.0
2339.0
2331
8.0
EV
5311
N722EV
LGA
BGR
61.0
378
22
5
2013-09-30 22:00:00
336759
2013
9
30
2207.0
2140
27.0
2257.0
2250
7.0
MQ
3660
N532MQ
LGA
BNA
97.0
764
21
40
2013-09-30 21:00:00
336760
2013
9
30
2211.0
2059
72.0
2339.0
2242
57.0
EV
4672
N12145
EWR
STL
120.0
872
20
59
2013-09-30 20:00:00
336761
2013
9
30
2231.0
2245
-14.0
2335.0
2356
-21.0
B6
108
N193JB
JFK
PWM
48.0
273
22
45
2013-09-30 22:00:00
336762
2013
9
30
2233.0
2113
80.0
112.0
30
42.0
UA
471
N578UA
EWR
SFO
318.0
2565
21
13
2013-09-30 21:00:00
336763
2013
9
30
2235.0
2001
154.0
59.0
2249
130.0
B6
1083
N804JB
JFK
MCO
123.0
944
20
1
2013-09-30 20:00:00
336764
2013
9
30
2237.0
2245
-8.0
2345.0
2353
-8.0
B6
234
N318JB
JFK
BTV
43.0
266
22
45
2013-09-30 22:00:00
336765
2013
9
30
2240.0
2245
-5.0
2334.0
2351
-17.0
B6
1816
N354JB
JFK
SYR
41.0
209
22
45
2013-09-30 22:00:00
336766
2013
9
30
2240.0
2250
-10.0
2347.0
7
-20.0
B6
2002
N281JB
JFK
BUF
52.0
301
22
50
2013-09-30 22:00:00
336767
2013
9
30
2241.0
2246
-5.0
2345.0
1
-16.0
B6
486
N346JB
JFK
ROC
47.0
264
22
46
2013-09-30 22:00:00
336768
2013
9
30
2307.0
2255
12.0
2359.0
2358
1.0
B6
718
N565JB
JFK
BOS
33.0
187
22
55
2013-09-30 22:00:00
336769
2013
9
30
2349.0
2359
-10.0
325.0
350
-25.0
B6
745
N516JB
JFK
PSE
196.0
1617
23
59
2013-09-30 23:00:00
336770
2013
9
30
NaN
1842
NaN
NaN
2019
NaN
EV
5274
N740EV
LGA
BNA
NaN
764
18
42
2013-09-30 18:00:00
336771
2013
9
30
NaN
1455
NaN
NaN
1634
NaN
9E
3393
NaN
JFK
DCA
NaN
213
14
55
2013-09-30 14:00:00
336772
2013
9
30
NaN
2200
NaN
NaN
2312
NaN
9E
3525
NaN
LGA
SYR
NaN
198
22
0
2013-09-30 22:00:00
336773
2013
9
30
NaN
1210
NaN
NaN
1330
NaN
MQ
3461
N535MQ
LGA
BNA
NaN
764
12
10
2013-09-30 12:00:00
336774
2013
9
30
NaN
1159
NaN
NaN
1344
NaN
MQ
3572
N511MQ
LGA
CLE
NaN
419
11
59
2013-09-30 11:00:00
336775
2013
9
30
NaN
840
NaN
NaN
1020
NaN
MQ
3531
N839MQ
LGA
RDU
NaN
431
8
40
2013-09-30 08:00:00
336776 rows × 19 columns
这里我们主要关注统计数据和可视化。我们来看一下按月统计的晚点时间的均值。
In [18]:
mean_delay_by_month = df.groupby(['month'])['arr_delay'].mean()
mean_delay_by_month
Out[18]:
month
1 6.129972
2 5.613019
3 5.807577
4 11.176063
5 3.521509
6 16.481330
7 16.711307
8 6.040652
9 -4.018364
10 -0.167063
11 0.461347
12 14.870355
Name: arr_delay, dtype: float64
In [19]:
mean_month_plt = mean_delay_by_month.plot(kind='bar', title='Mean Delay By Month')
mean_month_plt
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x259b18d6ac8>
注意,这里9、10月均值会有负值。
In [20]:
mean_delay_by_month_ord = df[(df.dest == 'ORD')].groupby(['month'])['arr_delay'].mean()
print("Flights to Chicago (ORD)")
print(mean_delay_by_month_ord)
mean_month_plt_ord = mean_delay_by_month_ord.plot(kind='bar', title="Mean Delay By Month (Chicago)")
mean_month_plt_ord
Flights to Chicago (ORD)
month
1 7.287694
2 3.680794
3 -2.702473
4 19.179352
5 7.938280
6 13.299376
7 8.405514
8 4.256851
9 -4.745370
10 -1.597090
11 2.071058
12 16.462817
Name: arr_delay, dtype: float64
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x259b19e7f98>
In [21]:
# 再看看Los Angeles进行比较一下
mean_delay_by_month_lax = df[(df.dest == 'LAX')].groupby(['month'])['arr_delay'].mean()
print("Flights to Chicago (LAX)")
print(mean_delay_by_month_lax)
mean_month_plt_lax = mean_delay_by_month_lax.plot(kind='bar', title="Mean Delay By Month (Los Angeles)")
mean_month_plt_lax
Flights to Chicago (LAX)
month
1 -4.160312
2 -7.601824
3 -5.280928
4 3.085153
5 -7.150657
6 13.007027
7 8.191432
8 1.028667
9 -8.719044
10 -1.205694
11 -0.103290
12 10.724460
Name: arr_delay, dtype: float64
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x259b19d8550>
从上面的图表中我们可以直观的看到一些特征。现在我们再来看看每个航空公司晚点的情况,并进行一些可视化。
In [22]:
# 看看是否不同的航空公司对晚点会有不同的影响
df[['carrier', 'arr_delay']].groupby('carrier').mean().plot(kind='bar', figsize=(12, 8))
plt.xticks(rotation=0)
plt.xlabel('Carrier')
plt.ylabel('Average Delay in Min')
plt.title('Average Arrival Delay by Carrier in 2008, All airports')
df[['carrier', 'dep_delay']].groupby('carrier').mean().plot(kind='bar', figsize=(12, 8))
plt.xticks(rotation=0)
plt.xlabel('Carrier')
plt.ylabel('Average Delay in Min')
plt.title('Average Departure Delay by Carrier in 2008, All airports')
Out[22]:
Text(0.5,1,'Average Departure Delay by Carrier in 2008, All airports')
从上面的图表里我们可以看到F9(Front Airlines)几乎是最经常晚点的,而夏威夷(HA)在这方面表现最好。
我们有多个数据集,天气、机场的。现在我们来看一下如何把两个表连接在一起
In [25]:
weather = pd.read_csv('data/nycflights13/weather.csv.gz')
weather
Out[25]:
origin
year
month
day
hour
temp
dewp
humid
wind_dir
wind_speed
wind_gust
precip
pressure
visib
time_hour
0
EWR
2013
1
1
0
37.04
21.92
53.97
230.0
10.35702
11.918651
0.00
1013.9
10.00
2012-12-31 19:00:00
1
EWR
2013
1
1
1
37.04
21.92
53.97
230.0
13.80936
15.891535
0.00
1013.0
10.00
2012-12-31 20:00:00
2
EWR
2013
1
1
2
37.94
21.92
52.09
230.0
12.65858
14.567241
0.00
1012.6
10.00
2012-12-31 21:00:00
3
EWR
2013
1
1
3
37.94
23.00
54.51
230.0
13.80936
15.891535
0.00
1012.7
10.00
2012-12-31 22:00:00
4
EWR
2013
1
1
4
37.94
24.08
57.04
240.0
14.96014
17.215830
0.00
1012.8
10.00
2012-12-31 23:00:00
5
EWR
2013
1
1
6
39.02
26.06
59.37
270.0
10.35702
11.918651
0.00
1012.0
10.00
2013-01-01 01:00:00
6
EWR
2013
1
1
7
39.02
26.96
61.63
250.0
8.05546
9.270062
0.00
1012.3
10.00
2013-01-01 02:00:00
7
EWR
2013
1
1
8
39.02
28.04
64.43
240.0
11.50780
13.242946
0.00
1012.5
10.00
2013-01-01 03:00:00
8
EWR
2013
1
1
9
39.92
28.04
62.21
250.0
12.65858
14.567241
0.00
1012.2
10.00
2013-01-01 04:00:00
9
EWR
2013
1
1
10
39.02
28.04
64.43
260.0
12.65858
14.567241
0.00
1011.9
10.00
2013-01-01 05:00:00
10
EWR
2013
1
1
11
37.94
28.04
67.21
240.0
11.50780
13.242946
0.00
1012.4
10.00
2013-01-01 06:00:00
11
EWR
2013
1
1
12
39.02
28.04
64.43
240.0
14.96014
17.215830
0.00
1012.2
10.00
2013-01-01 07:00:00
12
EWR
2013
1
1
13
39.92
28.04
62.21
250.0
10.35702
11.918651
0.00
1012.2
10.00
2013-01-01 08:00:00
13
EWR
2013
1
1
14
39.92
28.04
62.21
260.0
14.96014
17.215830
0.00
1012.7
10.00
2013-01-01 09:00:00
14
EWR
2013
1
1
15
41.00
28.04
59.65
260.0
13.80936
15.891535
0.00
1012.4
10.00
2013-01-01 10:00:00
15
EWR
2013
1
1
16
41.00
26.96
57.06
260.0
14.96014
17.215830
0.00
1011.4
10.00
2013-01-01 11:00:00
16
EWR
2013
1
1
17
39.20
28.40
64.93
270.0
16.11092
18.540125
0.00
NaN
10.00
2013-01-01 12:00:00
17
EWR
2013
1
1
18
39.20
28.40
64.93
330.0
14.96014
17.215830
0.00
NaN
10.00
2013-01-01 13:00:00
18
EWR
2013
1
1
19
39.02
24.08
54.68
280.0
13.80936
15.891535
0.00
1010.8
10.00
2013-01-01 14:00:00
19
EWR
2013
1
1
20
37.94
24.08
57.04
290.0
9.20624
10.594357
0.00
1011.9
10.00
2013-01-01 15:00:00
20
EWR
2013
1
1
21
37.04
19.94
49.62
300.0
13.80936
15.891535
0.00
1012.1
10.00
2013-01-01 16:00:00
21
EWR
2013
1
1
22
35.96
19.04
49.83
330.0
11.50780
13.242946
0.00
1013.2
10.00
2013-01-01 17:00:00
22
EWR
2013
1
1
23
33.98
15.08
45.43
310.0
12.65858
14.567241
0.00
1014.1
10.00
2013-01-01 18:00:00
23
EWR
2013
1
2
0
33.08
12.92
42.84
320.0
10.35702
11.918651
0.00
1014.4
10.00
2013-01-01 19:00:00
24
EWR
2013
1
2
1
32.00
15.08
49.19
310.0
14.96014
17.215830
0.00
1015.2
10.00
2013-01-01 20:00:00
25
EWR
2013
1
2
2
30.02
12.92
48.48
320.0
18.41248
21.188714
0.00
1016.0
10.00
2013-01-01 21:00:00
26
EWR
2013
1
2
3
28.94
12.02
48.69
320.0
18.41248
21.188714
0.00
1016.5
10.00
2013-01-01 22:00:00
27
EWR
2013
1
2
4
28.04
10.94
48.15
310.0
16.11092
18.540125
0.00
1016.4
10.00
2013-01-01 23:00:00
28
EWR
2013
1
2
5
26.96
10.94
50.34
310.0
14.96014
17.215830
0.00
1016.3
10.00
2013-01-02 00:00:00
29
EWR
2013
1
2
6
26.06
10.94
52.25
330.0
12.65858
14.567241
0.00
1016.3
10.00
2013-01-02 01:00:00
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
26100
LGA
2013
12
29
18
42.80
37.40
81.07
70.0
9.20624
10.594357
0.06
NaN
2.50
2013-12-29 13:00:00
26101
LGA
2013
12
29
19
39.20
37.40
93.19
40.0
16.11092
18.540125
0.06
NaN
1.75
2013-12-29 14:00:00
26102
LGA
2013
12
29
20
41.00
39.02
92.59
40.0
13.80936
15.891535
0.37
999.9
1.50
2013-12-29 15:00:00
26103
LGA
2013
12
29
21
41.00
37.94
88.76
350.0
8.05546
9.270062
0.28
998.7
1.50
2013-12-29 16:00:00
26104
LGA
2013
12
29
22
41.00
39.20
93.24
50.0
20.71404
23.837303
0.03
NaN
2.00
2013-12-29 17:00:00
26105
LGA
2013
12
29
23
42.08
39.02
88.81
330.0
14.96014
17.215830
0.00
997.2
3.00
2013-12-29 18:00:00
26106
LGA
2013
12
30
0
42.80
37.40
81.07
320.0
17.26170
19.864419
0.00
NaN
8.00
2013-12-29 19:00:00
26107
LGA
2013
12
30
1
41.00
37.40
86.89
310.0
19.56326
22.513008
0.00
NaN
6.00
2013-12-29 20:00:00
26108
LGA
2013
12
30
2
42.80
37.40
81.07
320.0
16.11092
18.540125
0.00
NaN
10.00
2013-12-29 21:00:00
26109
LGA
2013
12
30
3
42.98
37.04
79.38
300.0
9.20624
10.594357
0.00
1003.8
10.00
2013-12-29 22:00:00
26110
LGA
2013
12
30
4
42.98
35.06
73.39
310.0
17.26170
19.864419
0.00
1005.1
10.00
2013-12-29 23:00:00
26111
LGA
2013
12
30
5
42.08
33.98
72.78
320.0
11.50780
13.242946
0.00
1005.9
10.00
2013-12-30 00:00:00
26112
LGA
2013
12
30
6
42.08
33.98
72.78
250.0
9.20624
10.594357
0.00
1007.6
10.00
2013-12-30 01:00:00
26113
LGA
2013
12
30
7
41.00
33.98
75.88
240.0
8.05546
9.270062
0.00
1008.3
10.00
2013-12-30 02:00:00
26114
LGA
2013
12
30
8
42.98
33.98
70.30
270.0
9.20624
10.594357
0.00
1008.2
10.00
2013-12-30 03:00:00
26115
LGA
2013
12
30
9
41.00
33.08
73.19
0.0
0.00000
0.000000
0.00
1008.9
10.00
2013-12-30 04:00:00
26116
LGA
2013
12
30
10
42.98
33.08
67.81
250.0
10.35702
11.918651
0.00
1009.2
10.00
2013-12-30 05:00:00
26117
LGA
2013
12
30
11
42.98
33.98
70.30
230.0
6.90468
7.945768
0.00
1010.8
10.00
2013-12-30 06:00:00
26118
LGA
2013
12
30
12
44.06
35.06
70.42
240.0
11.50780
13.242946
0.00
1011.9
10.00
2013-12-30 07:00:00
26119
LGA
2013
12
30
13
44.06
33.98
67.45
260.0
11.50780
13.242946
0.00
1012.9
10.00
2013-12-30 08:00:00
26120
LGA
2013
12
30
14
44.06
33.08
65.07
260.0
13.80936
15.891535
0.00
1013.7
10.00
2013-12-30 09:00:00
26121
LGA
2013
12
30
15
42.80
33.80
70.28
330.0
16.11092
18.540125
0.00
NaN
10.00
2013-12-30 10:00:00
26122
LGA
2013
12
30
16
41.00
28.40
60.54
340.0
13.80936
15.891535
0.00
NaN
10.00
2013-12-30 11:00:00
26123
LGA
2013
12
30
17
37.94
23.00
54.51
330.0
21.86482
25.161598
0.00
1015.7
10.00
2013-12-30 12:00:00
26124
LGA
2013
12
30
18
37.04
21.92
53.97
340.0
17.26170
19.864419
0.00
1016.5
10.00
2013-12-30 13:00:00
26125
LGA
2013
12
30
19
35.96
19.94
51.78
340.0
13.80936
15.891535
0.00
1017.1
10.00
2013-12-30 14:00:00
26126
LGA
2013
12
30
20
33.98
17.06
49.51
330.0
17.26170
19.864419
0.00
1018.8
10.00
2013-12-30 15:00:00
26127
LGA
2013
12
30
21
32.00
15.08
49.19
340.0
14.96014
17.215830
0.00
1019.5
10.00
2013-12-30 16:00:00
26128
LGA
2013
12
30
22
30.92
12.92
46.74
320.0
17.26170
19.864419
0.00
1019.9
10.00
2013-12-30 17:00:00
26129
LGA
2013
12
30
23
28.94
10.94
46.41
330.0
18.41248
21.188714
0.00
1020.9
10.00
2013-12-30 18:00:00
26130 rows × 15 columns
In [26]:
df_withweather = pd.merge(df, weather, how='left', on=['year', 'month', 'day', 'hour'])
df_withweather
Out[26]:
year
month
day
dep_time
sched_dep_time
dep_delay
arr_time
sched_arr_time
arr_delay
carrier
...
temp
dewp
humid
wind_dir
wind_speed
wind_gust
precip
pressure
visib
time_hour_y
0
2013
1
1
517.0
515
2.0
830.0
819
11.0
UA
...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1
2013
1
1
533.0
529
4.0
850.0
830
20.0
UA
...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2
2013
1
1
542.0
540
2.0
923.0
850
33.0
AA
...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
3
2013
1
1
544.0
545
-1.0
1004.0
1022
-18.0
B6
...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
4
2013
1
1
554.0
600
-6.0
812.0
837
-25.0
DL
...
39.02
26.06
59.37
270.0
10.35702
11.918651
0.0
1012.0
10.0
2013-01-01 01:00:00
5
2013
1
1
554.0
600
-6.0
812.0
837
-25.0
DL
...
39.02
26.06
59.37
260.0
12.65858
14.567241
0.0
1012.6
10.0
2013-01-01 01:00:00
6
2013
1
1
554.0
600
-6.0
812.0
837
-25.0
DL
...
39.92
26.06
57.33
260.0
13.80936
15.891535
0.0
1011.9
10.0
2013-01-01 01:00:00
7
2013
1
1
554.0
558
-4.0
740.0
728
12.0
UA
...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
8
2013
1
1
555.0
600
-5.0
913.0
854
19.0
B6
...
39.02
26.06
59.37
270.0
10.35702
11.918651
0.0
1012.0
10.0
2013-01-01 01:00:00
9
2013
1
1
555.0
600
-5.0
913.0
854
19.0
B6
...
39.02
26.06
59.37
260.0
12.65858
14.567241
0.0
1012.6
10.0
2013-01-01 01:00:00
10
2013
1
1
555.0
600
-5.0
913.0
854
19.0
B6
...
39.92
26.06
57.33
260.0
13.80936
15.891535
0.0
1011.9
10.0
2013-01-01 01:00:00
11
2013
1
1
557.0
600
-3.0
709.0
723
-14.0
EV
...
39.02
26.06
59.37
270.0
10.35702
11.918651
0.0
1012.0
10.0
2013-01-01 01:00:00
12
2013
1
1
557.0
600
-3.0
709.0
723
-14.0
EV
...
39.02
26.06
59.37
260.0
12.65858
14.567241
0.0
1012.6
10.0
2013-01-01 01:00:00
13
2013
1
1
557.0
600
-3.0
709.0
723
-14.0
EV
...
39.92
26.06
57.33
260.0
13.80936
15.891535
0.0
1011.9
10.0
2013-01-01 01:00:00
14
2013
1
1
557.0
600
-3.0
838.0
846
-8.0
B6
...
39.02
26.06
59.37
270.0
10.35702
11.918651
0.0
1012.0
10.0
2013-01-01 01:00:00
15
2013
1
1
557.0
600
-3.0
838.0
846
-8.0
B6
...
39.02
26.06
59.37
260.0
12.65858
14.567241
0.0
1012.6
10.0
2013-01-01 01:00:00
16
2013
1
1
557.0
600
-3.0
838.0
846
-8.0
B6
...
39.92
26.06
57.33
260.0
13.80936
15.891535
0.0
1011.9
10.0
2013-01-01 01:00:00
17
2013
1
1
558.0
600
-2.0
753.0
745
8.0
AA
...
39.02
26.06
59.37
270.0
10.35702
11.918651
0.0
1012.0
10.0
2013-01-01 01:00:00
18
2013
1
1
558.0
600
-2.0
753.0
745
8.0
AA
...
39.02
26.06
59.37
260.0
12.65858
14.567241
0.0
1012.6
10.0
2013-01-01 01:00:00
19
2013
1
1
558.0
600
-2.0
753.0
745
8.0
AA
...
39.92
26.06
57.33
260.0
13.80936
15.891535
0.0
1011.9
10.0
2013-01-01 01:00:00
20
2013
1
1
558.0
600
-2.0
849.0
851
-2.0
B6
...
39.02
26.06
59.37
270.0
10.35702
11.918651
0.0
1012.0
10.0
2013-01-01 01:00:00
21
2013
1
1
558.0
600
-2.0
849.0
851
-2.0
B6
...
39.02
26.06
59.37
260.0
12.65858
14.567241
0.0
1012.6
10.0
2013-01-01 01:00:00
22
2013
1
1
558.0
600
-2.0
849.0
851
-2.0
B6
...
39.92
26.06
57.33
260.0
13.80936
15.891535
0.0
1011.9
10.0
2013-01-01 01:00:00
23
2013
1
1
558.0
600
-2.0
853.0
856
-3.0
B6
...
39.02
26.06
59.37
270.0
10.35702
11.918651
0.0
1012.0
10.0
2013-01-01 01:00:00
24
2013
1
1
558.0
600
-2.0
853.0
856
-3.0
B6
...
39.02
26.06
59.37
260.0
12.65858
14.567241
0.0
1012.6
10.0
2013-01-01 01:00:00
25
2013
1
1
558.0
600
-2.0
853.0
856
-3.0
B6
...
39.92
26.06
57.33
260.0
13.80936
15.891535
0.0
1011.9
10.0
2013-01-01 01:00:00
26
2013
1
1
558.0
600
-2.0
924.0
917
7.0
UA
...
39.02
26.06
59.37
270.0
10.35702
11.918651
0.0
1012.0
10.0
2013-01-01 01:00:00
27
2013
1
1
558.0
600
-2.0
924.0
917
7.0
UA
...
39.02
26.06
59.37
260.0
12.65858
14.567241
0.0
1012.6
10.0
2013-01-01 01:00:00
28
2013
1
1
558.0
600
-2.0
924.0
917
7.0
UA
...
39.92
26.06
57.33
260.0
13.80936
15.891535
0.0
1011.9
10.0
2013-01-01 01:00:00
29
2013
1
1
558.0
600
-2.0
923.0
937
-14.0
UA
...
39.02
26.06
59.37
270.0
10.35702
11.918651
0.0
1012.0
10.0
2013-01-01 01:00:00
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
1007561
2013
9
30
2240.0
2250
-10.0
2347.0
7
-20.0
B6
...
68.00
53.06
58.80
150.0
5.75390
6.621473
0.0
1015.4
10.0
2013-09-30 17:00:00
1007562
2013
9
30
2240.0
2250
-10.0
2347.0
7
-20.0
B6
...
64.04
55.94
74.94
180.0
6.90468
7.945768
0.0
1016.0
10.0
2013-09-30 17:00:00
1007563
2013
9
30
2240.0
2250
-10.0
2347.0
7
-20.0
B6
...
66.92
51.98
58.65
160.0
9.20624
10.594357
0.0
1015.4
10.0
2013-09-30 17:00:00
1007564
2013
9
30
2241.0
2246
-5.0
2345.0
1
-16.0
B6
...
68.00
53.06
58.80
150.0
5.75390
6.621473
0.0
1015.4
10.0
2013-09-30 17:00:00
1007565
2013
9
30
2241.0
2246
-5.0
2345.0
1
-16.0
B6
...
64.04
55.94
74.94
180.0
6.90468
7.945768
0.0
1016.0
10.0
2013-09-30 17:00:00
1007566
2013
9
30
2241.0
2246
-5.0
2345.0
1
-16.0
B6
...
66.92
51.98
58.65
160.0
9.20624
10.594357
0.0
1015.4
10.0
2013-09-30 17:00:00
1007567
2013
9
30
2307.0
2255
12.0
2359.0
2358
1.0
B6
...
68.00
53.06
58.80
150.0
5.75390
6.621473
0.0
1015.4
10.0
2013-09-30 17:00:00
1007568
2013
9
30
2307.0
2255
12.0
2359.0
2358
1.0
B6
...
64.04
55.94
74.94
180.0
6.90468
7.945768
0.0
1016.0
10.0
2013-09-30 17:00:00
1007569
2013
9
30
2307.0
2255
12.0
2359.0
2358
1.0
B6
...
66.92
51.98
58.65
160.0
9.20624
10.594357
0.0
1015.4
10.0
2013-09-30 17:00:00
1007570
2013
9
30
2349.0
2359
-10.0
325.0
350
-25.0
B6
...
66.02
53.96
65.07
160.0
4.60312
5.297178
0.0
1015.6
10.0
2013-09-30 18:00:00
1007571
2013
9
30
2349.0
2359
-10.0
325.0
350
-25.0
B6
...
62.06
55.94
80.34
190.0
5.75390
6.621473
0.0
1016.0
10.0
2013-09-30 18:00:00
1007572
2013
9
30
2349.0
2359
-10.0
325.0
350
-25.0
B6
...
64.94
53.96
67.57
200.0
9.20624
10.594357
0.0
1015.5
10.0
2013-09-30 18:00:00
1007573
2013
9
30
NaN
1842
NaN
NaN
2019
NaN
EV
...
71.96
46.94
40.90
90.0
5.75390
6.621473
0.0
1015.7
10.0
2013-09-30 13:00:00
1007574
2013
9
30
NaN
1842
NaN
NaN
2019
NaN
EV
...
68.00
55.04
63.21
190.0
11.50780
13.242946
0.0
1016.6
10.0
2013-09-30 13:00:00
1007575
2013
9
30
NaN
1842
NaN
NaN
2019
NaN
EV
...
71.96
46.94
40.90
NaN
3.45234
3.972884
0.0
1015.8
10.0
2013-09-30 13:00:00
1007576
2013
9
30
NaN
1455
NaN
NaN
1634
NaN
9E
...
66.92
50.00
54.51
50.0
3.45234
3.972884
0.0
1018.3
10.0
2013-09-30 09:00:00
1007577
2013
9
30
NaN
1455
NaN
NaN
1634
NaN
9E
...
68.00
48.92
50.44
20.0
5.75390
6.621473
0.0
1018.4
10.0
2013-09-30 09:00:00
1007578
2013
9
30
NaN
1455
NaN
NaN
1634
NaN
9E
...
64.94
50.00
58.39
NaN
3.45234
3.972884
0.0
1018.2
10.0
2013-09-30 09:00:00
1007579
2013
9
30
NaN
2200
NaN
NaN
2312
NaN
9E
...
68.00
53.06
58.80
150.0
5.75390
6.621473
0.0
1015.4
10.0
2013-09-30 17:00:00
1007580
2013
9
30
NaN
2200
NaN
NaN
2312
NaN
9E
...
64.04
55.94
74.94
180.0
6.90468
7.945768
0.0
1016.0
10.0
2013-09-30 17:00:00
1007581
2013
9
30
NaN
2200
NaN
NaN
2312
NaN
9E
...
66.92
51.98
58.65
160.0
9.20624
10.594357
0.0
1015.4
10.0
2013-09-30 17:00:00
1007582
2013
9
30
NaN
1210
NaN
NaN
1330
NaN
MQ
...
59.00
53.96
83.34
10.0
6.90468
7.945768
0.0
1018.8
10.0
2013-09-30 07:00:00
1007583
2013
9
30
NaN
1210
NaN
NaN
1330
NaN
MQ
...
62.06
53.96
74.75
40.0
5.75390
6.621473
0.0
1018.8
10.0
2013-09-30 07:00:00
1007584
2013
9
30
NaN
1210
NaN
NaN
1330
NaN
MQ
...
60.98
51.08
69.86
NaN
5.75390
6.621473
0.0
1018.6
10.0
2013-09-30 07:00:00
1007585
2013
9
30
NaN
1159
NaN
NaN
1344
NaN
MQ
...
55.94
53.96
93.05
20.0
5.75390
6.621473
0.0
1018.7
10.0
2013-09-30 06:00:00
1007586
2013
9
30
NaN
1159
NaN
NaN
1344
NaN
MQ
...
57.92
53.96
86.63
360.0
6.90468
7.945768
0.0
1018.9
10.0
2013-09-30 06:00:00
1007587
2013
9
30
NaN
1159
NaN
NaN
1344
NaN
MQ
...
60.08
51.98
74.56
360.0
4.60312
5.297178
0.0
1018.5
10.0
2013-09-30 06:00:00
1007588
2013
9
30
NaN
840
NaN
NaN
1020
NaN
MQ
...
53.96
51.08
89.95
350.0
4.60312
5.297178
0.0
1018.1
10.0
2013-09-30 03:00:00
1007589
2013
9
30
NaN
840
NaN
NaN
1020
NaN
MQ
...
53.06
50.00
89.31
30.0
3.45234
3.972884
0.0
1018.1
10.0
2013-09-30 03:00:00
1007590
2013
9
30
NaN
840
NaN
NaN
1020
NaN
MQ
...
59.00
53.06
80.64
10.0
3.45234
3.972884
0.0
1017.8
10.0
2013-09-30 03:00:00
1007591 rows × 30 columns
In [27]:
airports = pd.read_csv('data/nycflights13/airports.csv.gz')
airports
Out[27]:
faa
name
lat
lon
alt
tz
dst
tzone
0
04G
Lansdowne Airport
41.130472
-80.619583
1044
-5
A
America/New_York
1
06A
Moton Field Municipal Airport
32.460572
-85.680028
264
-6
A
America/Chicago
2
06C
Schaumburg Regional
41.989341
-88.101243
801
-6
A
America/Chicago
3
06N
Randall Airport
41.431912
-74.391561
523
-5
A
America/New_York
4
09J
Jekyll Island Airport
31.074472
-81.427778
11
-5
A
America/New_York
5
0A9
Elizabethton Municipal Airport
36.371222
-82.173417
1593
-5
A
America/New_York
6
0G6
Williams County Airport
41.467306
-84.506778
730
-5
A
America/New_York
7
0G7
Finger Lakes Regional Airport
42.883565
-76.781232
492
-5
A
America/New_York
8
0P2
Shoestring Aviation Airfield
39.794824
-76.647191
1000
-5
U
America/New_York
9
0S9
Jefferson County Intl
48.053809
-122.810644
108
-8
A
America/Los_Angeles
10
0W3
Harford County Airport
39.566838
-76.202403
409
-5
A
America/New_York
11
10C
Galt Field Airport
42.402889
-88.375111
875
-6
U
America/Chicago
12
17G
Port Bucyrus-Crawford County Airport
40.781556
-82.974806
1003
-5
A
America/New_York
13
19A
Jackson County Airport
34.175864
-83.561597
951
-5
U
America/New_York
14
1A3
Martin Campbell Field Airport
35.015806
-84.346833
1789
-5
A
America/New_York
15
1B9
Mansfield Municipal
42.000133
-71.196771
122
-5
A
America/New_York
16
1C9
Frazier Lake Airpark
54.013333
-124.768333
152
-8
A
America/Vancouver
17
1CS
Clow International Airport
41.695974
-88.129231
670
-6
U
America/Chicago
18
1G3
Kent State Airport
41.151389
-81.415111
1134
-5
A
America/New_York
19
1G4
Grand Canyon West Airport
35.899904
-113.815674
4813
-7
A
America/Phoenix
20
1H2
Effingham Memorial Airport
39.070000
-88.534000
585
-6
A
America/Chicago
21
1OH
Fortman Airport
40.555325
-84.386619
885
-5
U
America/New_York
22
1RL
Point Roberts Airpark
48.979722
-123.078889
10
-8
A
America/Los_Angeles
23
23M
Clarke CO
32.051700
-88.443400
320
-6
A
America/Chicago
24
24C
Lowell City Airport
42.953920
-85.343906
681
-5
A
America/New_York
25
24J
Suwannee County Airport
30.300125
-83.024694
104
-5
A
America/New_York
26
25D
Forest Lake Airport
45.247746
-92.994385
925
-6
A
America/Chicago
27
29D
Grove City Airport
41.146028
-80.167750
1371
-5
A
America/New_York
28
2A0
Mark Anton Airport
35.486250
-84.931083
718
-5
A
America/New_York
29
2B2
Plum Island Airport
42.795361
-70.839444
11
-5
A
America/New_York
...
...
...
...
...
...
...
...
...
1428
X59
Valkaria Municipal
27.960861
-80.558333
26
-5
A
America/New_York
1429
XFL
Flagler County Airport
29.282100
-81.121200
33
-5
A
America/New_York
1430
XNA
NW Arkansas Regional
36.281869
-94.306811
1287
-6
A
America/Chicago
1431
XZK
Amherst Amtrak Station AMM
42.375000
-72.511389
258
-5
A
America/New_York
1432
Y51
Municipal Airport
43.579360
-90.896474
1292
-6
A
America/Chicago
1433
Y72
Bloyer Field
43.976222
-90.480611
966
-6
A
America/Chicago
1434
YAK
Yakutat
59.301200
-139.393700
33
-9
A
\N
1435
YIP
Willow Run
42.237928
-83.530408
716
-5
A
America/New_York
1436
YKM
Yakima Air Terminal McAllister Field
46.568200
-120.544000
1095
-8
A
America/Los_Angeles
1437
YKN
Chan Gurney
42.871100
-97.396900
1200
-6
A
America/Chicago
1438
YNG
Youngstown Warren Rgnl
41.260736
-80.679097
1196
-5
A
America/New_York
1439
YUM
Yuma Mcas Yuma Intl
32.656578
-114.605980
216
-7
N
America/Phoenix
1440
Z84
Clear
64.301203
-149.120144
552
-9
A
America/Anchorage
1441
ZBP
Penn Station
39.307222
-76.615556
66
-5
A
America/New_York
1442
ZFV
Philadelphia 30th St Station
39.955700
-75.182000
0
-5
A
America/New_York
1443
ZPH
Municipal Airport
28.228056
-82.155833
90
-5
A
America/New_York
1444
ZRA
Atlantic City Rail Terminal
39.366500
-74.442000
8
-5
A
America/New_York
1445
ZRD
Train Station
37.534300
-77.429450
26
-5
A
America/New_York
1446
ZRP
Newark Penn Station
40.734722
-74.164167
0
-5
A
America/New_York
1447
ZRT
Hartford Union Station
41.768880
-72.681500
0
-5
A
America/New_York
1448
ZRZ
New Carrollton Rail Station
38.948000
-76.871900
39
-5
A
America/New_York
1449
ZSF
Springfield Amtrak Station
42.106000
-72.593054
65
-5
A
America/New_York
1450
ZSY
Scottsdale Airport
33.622889
-111.910528
1519
-7
A
America/Phoenix
1451
ZTF
Stamford Amtrak Station
41.046937
-73.541493
0
-5
A
America/New_York
1452
ZTY
Boston Back Bay Station
42.347800
-71.075000
20
-5
A
America/New_York
1453
ZUN
Black Rock
35.083228
-108.791778
6454
-7
A
America/Denver
1454
ZVE
New Haven Rail Station
41.298669
-72.925992
7
-5
A
America/New_York
1455
ZWI
Wilmington Amtrak Station
39.736667
-75.551667
0
-5
A
America/New_York
1456
ZWU
Washington Union Station
38.897460
-77.006430
76
-5
A
America/New_York
1457
ZYP
Penn Station
40.750500
-73.993500
35
-5
A
America/New_York
1458 rows × 8 columns
In [28]:
df_withairport = pd.merge(df_withweather, airports, how='left', left_on='dest', right_on='faa')
df_withairport
Out[28]:
year
month
day
dep_time
sched_dep_time
dep_delay
arr_time
sched_arr_time
arr_delay
carrier
...
visib
time_hour_y
faa
name
lat
lon
alt
tz
dst
tzone
0
2013
1
1
517.0
515
2.0
830.0
819
11.0
UA
...
NaN
NaN
IAH
George Bush Intercontinental
29.984433
-95.341442
97.0
-6.0
A
America/Chicago
1
2013
1
1
533.0
529
4.0
850.0
830
20.0
UA
...
NaN
NaN
IAH
George Bush Intercontinental
29.984433
-95.341442
97.0
-6.0
A
America/Chicago
2
2013
1
1
542.0
540
2.0
923.0
850
33.0
AA
...
NaN
NaN
MIA
Miami Intl
25.793250
-80.290556
8.0
-5.0
A
America/New_York
3
2013
1
1
544.0
545
-1.0
1004.0
1022
-18.0
B6
...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
4
2013
1
1
554.0
600
-6.0
812.0
837
-25.0
DL
...
10.0
2013-01-01 01:00:00
ATL
Hartsfield Jackson Atlanta Intl
33.636719
-84.428067
1026.0
-5.0
A
America/New_York
5
2013
1
1
554.0
600
-6.0
812.0
837
-25.0
DL
...
10.0
2013-01-01 01:00:00
ATL
Hartsfield Jackson Atlanta Intl
33.636719
-84.428067
1026.0
-5.0
A
America/New_York
6
2013
1
1
554.0
600
-6.0
812.0
837
-25.0
DL
...
10.0
2013-01-01 01:00:00
ATL
Hartsfield Jackson Atlanta Intl
33.636719
-84.428067
1026.0
-5.0
A
America/New_York
7
2013
1
1
554.0
558
-4.0
740.0
728
12.0
UA
...
NaN
NaN
ORD
Chicago Ohare Intl
41.978603
-87.904842
668.0
-6.0
A
America/Chicago
8
2013
1
1
555.0
600
-5.0
913.0
854
19.0
B6
...
10.0
2013-01-01 01:00:00
FLL
Fort Lauderdale Hollywood Intl
26.072583
-80.152750
9.0
-5.0
A
America/New_York
9
2013
1
1
555.0
600
-5.0
913.0
854
19.0
B6
...
10.0
2013-01-01 01:00:00
FLL
Fort Lauderdale Hollywood Intl
26.072583
-80.152750
9.0
-5.0
A
America/New_York
10
2013
1
1
555.0
600
-5.0
913.0
854
19.0
B6
...
10.0
2013-01-01 01:00:00
FLL
Fort Lauderdale Hollywood Intl
26.072583
-80.152750
9.0
-5.0
A
America/New_York
11
2013
1
1
557.0
600
-3.0
709.0
723
-14.0
EV
...
10.0
2013-01-01 01:00:00
IAD
Washington Dulles Intl
38.944533
-77.455811
313.0
-5.0
A
America/New_York
12
2013
1
1
557.0
600
-3.0
709.0
723
-14.0
EV
...
10.0
2013-01-01 01:00:00
IAD
Washington Dulles Intl
38.944533
-77.455811
313.0
-5.0
A
America/New_York
13
2013
1
1
557.0
600
-3.0
709.0
723
-14.0
EV
...
10.0
2013-01-01 01:00:00
IAD
Washington Dulles Intl
38.944533
-77.455811
313.0
-5.0
A
America/New_York
14
2013
1
1
557.0
600
-3.0
838.0
846
-8.0
B6
...
10.0
2013-01-01 01:00:00
MCO
Orlando Intl
28.429394
-81.308994
96.0
-5.0
A
America/New_York
15
2013
1
1
557.0
600
-3.0
838.0
846
-8.0
B6
...
10.0
2013-01-01 01:00:00
MCO
Orlando Intl
28.429394
-81.308994
96.0
-5.0
A
America/New_York
16
2013
1
1
557.0
600
-3.0
838.0
846
-8.0
B6
...
10.0
2013-01-01 01:00:00
MCO
Orlando Intl
28.429394
-81.308994
96.0
-5.0
A
America/New_York
17
2013
1
1
558.0
600
-2.0
753.0
745
8.0
AA
...
10.0
2013-01-01 01:00:00
ORD
Chicago Ohare Intl
41.978603
-87.904842
668.0
-6.0
A
America/Chicago
18
2013
1
1
558.0
600
-2.0
753.0
745
8.0
AA
...
10.0
2013-01-01 01:00:00
ORD
Chicago Ohare Intl
41.978603
-87.904842
668.0
-6.0
A
America/Chicago
19
2013
1
1
558.0
600
-2.0
753.0
745
8.0
AA
...
10.0
2013-01-01 01:00:00
ORD
Chicago Ohare Intl
41.978603
-87.904842
668.0
-6.0
A
America/Chicago
20
2013
1
1
558.0
600
-2.0
849.0
851
-2.0
B6
...
10.0
2013-01-01 01:00:00
PBI
Palm Beach Intl
26.683161
-80.095589
19.0
-5.0
A
America/New_York
21
2013
1
1
558.0
600
-2.0
849.0
851
-2.0
B6
...
10.0
2013-01-01 01:00:00
PBI
Palm Beach Intl
26.683161
-80.095589
19.0
-5.0
A
America/New_York
22
2013
1
1
558.0
600
-2.0
849.0
851
-2.0
B6
...
10.0
2013-01-01 01:00:00
PBI
Palm Beach Intl
26.683161
-80.095589
19.0
-5.0
A
America/New_York
23
2013
1
1
558.0
600
-2.0
853.0
856
-3.0
B6
...
10.0
2013-01-01 01:00:00
TPA
Tampa Intl
27.975472
-82.533250
26.0
-5.0
A
America/New_York
24
2013
1
1
558.0
600
-2.0
853.0
856
-3.0
B6
...
10.0
2013-01-01 01:00:00
TPA
Tampa Intl
27.975472
-82.533250
26.0
-5.0
A
America/New_York
25
2013
1
1
558.0
600
-2.0
853.0
856
-3.0
B6
...
10.0
2013-01-01 01:00:00
TPA
Tampa Intl
27.975472
-82.533250
26.0
-5.0
A
America/New_York
26
2013
1
1
558.0
600
-2.0
924.0
917
7.0
UA
...
10.0
2013-01-01 01:00:00
LAX
Los Angeles Intl
33.942536
-118.408075
126.0
-8.0
A
America/Los_Angeles
27
2013
1
1
558.0
600
-2.0
924.0
917
7.0
UA
...
10.0
2013-01-01 01:00:00
LAX
Los Angeles Intl
33.942536
-118.408075
126.0
-8.0
A
America/Los_Angeles
28
2013
1
1
558.0
600
-2.0
924.0
917
7.0
UA
...
10.0
2013-01-01 01:00:00
LAX
Los Angeles Intl
33.942536
-118.408075
126.0
-8.0
A
America/Los_Angeles
29
2013
1
1
558.0
600
-2.0
923.0
937
-14.0
UA
...
10.0
2013-01-01 01:00:00
SFO
San Francisco Intl
37.618972
-122.374889
13.0
-8.0
A
America/Los_Angeles
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
1007561
2013
9
30
2240.0
2250
-10.0
2347.0
7
-20.0
B6
...
10.0
2013-09-30 17:00:00
BUF
Buffalo Niagara Intl
42.940525
-78.732167
724.0
-5.0
A
America/New_York
1007562
2013
9
30
2240.0
2250
-10.0
2347.0
7
-20.0
B6
...
10.0
2013-09-30 17:00:00
BUF
Buffalo Niagara Intl
42.940525
-78.732167
724.0
-5.0
A
America/New_York
1007563
2013
9
30
2240.0
2250
-10.0
2347.0
7
-20.0
B6
...
10.0
2013-09-30 17:00:00
BUF
Buffalo Niagara Intl
42.940525
-78.732167
724.0
-5.0
A
America/New_York
1007564
2013
9
30
2241.0
2246
-5.0
2345.0
1
-16.0
B6
...
10.0
2013-09-30 17:00:00
ROC
Greater Rochester Intl
43.118866
-77.672389
559.0
-5.0
A
America/New_York
1007565
2013
9
30
2241.0
2246
-5.0
2345.0
1
-16.0
B6
...
10.0
2013-09-30 17:00:00
ROC
Greater Rochester Intl
43.118866
-77.672389
559.0
-5.0
A
America/New_York
1007566
2013
9
30
2241.0
2246
-5.0
2345.0
1
-16.0
B6
...
10.0
2013-09-30 17:00:00
ROC
Greater Rochester Intl
43.118866
-77.672389
559.0
-5.0
A
America/New_York
1007567
2013
9
30
2307.0
2255
12.0
2359.0
2358
1.0
B6
...
10.0
2013-09-30 17:00:00
BOS
General Edward Lawrence Logan Intl
42.364347
-71.005181
19.0
-5.0
A
America/New_York
1007568
2013
9
30
2307.0
2255
12.0
2359.0
2358
1.0
B6
...
10.0
2013-09-30 17:00:00
BOS
General Edward Lawrence Logan Intl
42.364347
-71.005181
19.0
-5.0
A
America/New_York
1007569
2013
9
30
2307.0
2255
12.0
2359.0
2358
1.0
B6
...
10.0
2013-09-30 17:00:00
BOS
General Edward Lawrence Logan Intl
42.364347
-71.005181
19.0
-5.0
A
America/New_York
1007570
2013
9
30
2349.0
2359
-10.0
325.0
350
-25.0
B6
...
10.0
2013-09-30 18:00:00
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1007571
2013
9
30
2349.0
2359
-10.0
325.0
350
-25.0
B6
...
10.0
2013-09-30 18:00:00
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1007572
2013
9
30
2349.0
2359
-10.0
325.0
350
-25.0
B6
...
10.0
2013-09-30 18:00:00
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1007573
2013
9
30
NaN
1842
NaN
NaN
2019
NaN
EV
...
10.0
2013-09-30 13:00:00
BNA
Nashville Intl
36.124472
-86.678194
599.0
-6.0
A
America/Chicago
1007574
2013
9
30
NaN
1842
NaN
NaN
2019
NaN
EV
...
10.0
2013-09-30 13:00:00
BNA
Nashville Intl
36.124472
-86.678194
599.0
-6.0
A
America/Chicago
1007575
2013
9
30
NaN
1842
NaN
NaN
2019
NaN
EV
...
10.0
2013-09-30 13:00:00
BNA
Nashville Intl
36.124472
-86.678194
599.0
-6.0
A
America/Chicago
1007576
2013
9
30
NaN
1455
NaN
NaN
1634
NaN
9E
...
10.0
2013-09-30 09:00:00
DCA
Ronald Reagan Washington Natl
38.852083
-77.037722
15.0
-5.0
A
America/New_York
1007577
2013
9
30
NaN
1455
NaN
NaN
1634
NaN
9E
...
10.0
2013-09-30 09:00:00
DCA
Ronald Reagan Washington Natl
38.852083
-77.037722
15.0
-5.0
A
America/New_York
1007578
2013
9
30
NaN
1455
NaN
NaN
1634
NaN
9E
...
10.0
2013-09-30 09:00:00
DCA
Ronald Reagan Washington Natl
38.852083
-77.037722
15.0
-5.0
A
America/New_York
1007579
2013
9
30
NaN
2200
NaN
NaN
2312
NaN
9E
...
10.0
2013-09-30 17:00:00
SYR
Syracuse Hancock Intl
43.111187
-76.106311
421.0
-5.0
A
America/New_York
1007580
2013
9
30
NaN
2200
NaN
NaN
2312
NaN
9E
...
10.0
2013-09-30 17:00:00
SYR
Syracuse Hancock Intl
43.111187
-76.106311
421.0
-5.0
A
America/New_York
1007581
2013
9
30
NaN
2200
NaN
NaN
2312
NaN
9E
...
10.0
2013-09-30 17:00:00
SYR
Syracuse Hancock Intl
43.111187
-76.106311
421.0
-5.0
A
America/New_York
1007582
2013
9
30
NaN
1210
NaN
NaN
1330
NaN
MQ
...
10.0
2013-09-30 07:00:00
BNA
Nashville Intl
36.124472
-86.678194
599.0
-6.0
A
America/Chicago
1007583
2013
9
30
NaN
1210
NaN
NaN
1330
NaN
MQ
...
10.0
2013-09-30 07:00:00
BNA
Nashville Intl
36.124472
-86.678194
599.0
-6.0
A
America/Chicago
1007584
2013
9
30
NaN
1210
NaN
NaN
1330
NaN
MQ
...
10.0
2013-09-30 07:00:00
BNA
Nashville Intl
36.124472
-86.678194
599.0
-6.0
A
America/Chicago
1007585
2013
9
30
NaN
1159
NaN
NaN
1344
NaN
MQ
...
10.0
2013-09-30 06:00:00
CLE
Cleveland Hopkins Intl
41.411689
-81.849794
791.0
-5.0
A
America/New_York
1007586
2013
9
30
NaN
1159
NaN
NaN
1344
NaN
MQ
...
10.0
2013-09-30 06:00:00
CLE
Cleveland Hopkins Intl
41.411689
-81.849794
791.0
-5.0
A
America/New_York
1007587
2013
9
30
NaN
1159
NaN
NaN
1344
NaN
MQ
...
10.0
2013-09-30 06:00:00
CLE
Cleveland Hopkins Intl
41.411689
-81.849794
791.0
-5.0
A
America/New_York
1007588
2013
9
30
NaN
840
NaN
NaN
1020
NaN
MQ
...
10.0
2013-09-30 03:00:00
RDU
Raleigh Durham Intl
35.877639
-78.787472
435.0
-5.0
A
America/New_York
1007589
2013
9
30
NaN
840
NaN
NaN
1020
NaN
MQ
...
10.0
2013-09-30 03:00:00
RDU
Raleigh Durham Intl
35.877639
-78.787472
435.0
-5.0
A
America/New_York
1007590
2013
9
30
NaN
840
NaN
NaN
1020
NaN
MQ
...
10.0
2013-09-30 03:00:00
RDU
Raleigh Durham Intl
35.877639
-78.787472
435.0
-5.0
A
America/New_York
1007591 rows × 38 columns
In [29]:
import numpy as np
a = np.array([1, 2, 3])
a
Out[29]:
array([1, 2, 3])
注意这里我们传的是列表,而不是np.array(1, 2, 3)
。
现在我们创建一个arange
In [30]:
np.arange(10)
Out[30]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [31]:
# 给序列乘以一个系数
np.arange(10) * np.pi
Out[31]:
array([ 0. , 3.14159265, 6.28318531, 9.42477796,
12.56637061, 15.70796327, 18.84955592, 21.99114858,
25.13274123, 28.27433388])
我们也可以使用shape
方法从一维数组创建多维数组
In [32]:
a = np.array([1, 2, 3, 4, 5, 6])
a.shape = (2, 3)
a
Out[32]:
array([[1, 2, 3],
[4, 5, 6]])
In [35]:
np.matrix('1 2; 3 4')
Out[35]:
matrix([[1, 2],
[3, 4]])
In [40]:
#矩阵乘
a1 = np.matrix('1 2; 3 4')
a2 = np.matrix('3 4; 5 7')
a1 * a2
Out[40]:
matrix([[13, 18],
[29, 40]])
In [41]:
#array转换为矩阵
mat_a = np.mat(a1)
mat_a
Out[41]:
matrix([[1, 2],
[3, 4]])
In [42]:
import numpy, scipy.sparse
n = 100000
x = (numpy.random.rand(n) * 2).astype(int).astype(float) #50%稀疏矩阵
x_csr = scipy.sparse.csr_matrix(x)
x_dok = scipy.sparse.dok_matrix(x.reshape(x_csr.shape))
x_dok
Out[42]:
<1x100000 sparse matrix of type '<class 'numpy.float64'>'
with 50077 stored elements in Dictionary Of Keys format>
In [43]:
import csv
with open('data/array/array.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)
data = []
for row in csvreader:
row = [float(x) for x in row]
data.append(row)
data
Out[43]:
[[2.0, 3.0, 4.0, 5.0], [3.0, 4.0, 5.0, 6.0], [7.0, 9.0, 9.0, 10.0]]
In [44]:
import numpy as np
import scipy as sp
a = np.array([[3, 2, 0], [1, -1, 0], [0, 5, 1]])
b = np.array([2, 4, -1])
x = np.linalg.solve(a, b)
x
Out[44]:
array([ 2., -2., 9.])
In [45]:
#检查结果是否正确
np.dot(a, x) == b
Out[45]:
array([ True, True, True], dtype=bool)
前面我们介绍了pandas和numpy、scipy。现在我们来介绍python机器库Scikit。首先需要先知道机器学习的两种:
Scikit-learn有一下特性:
这里我们还是看nycflights13的数据集。
In [47]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
flights = pd.read_csv('data/nycflights13/flights.csv.gz')
weather = pd.read_csv('data/nycflights13/weather.csv.gz')
airports = pd.read_csv('data/nycflights13/airports.csv.gz')
df_withweather = pd.merge(flights, weather, how='left', on=['year', 'month', 'day', 'hour'])
df = pd.merge(df_withweather, airports, how='left', left_on='dest', right_on='faa')
df = df.dropna()
df
Out[47]:
year
month
day
dep_time
sched_dep_time
dep_delay
arr_time
sched_arr_time
arr_delay
carrier
...
visib
time_hour_y
faa
name
lat
lon
alt
tz
dst
tzone
4
2013
1
1
554.0
600
-6.0
812.0
837
-25.0
DL
...
10.0
2013-01-01 01:00:00
ATL
Hartsfield Jackson Atlanta Intl
33.636719
-84.428067
1026.0
-5.0
A
America/New_York
5
2013
1
1
554.0
600
-6.0
812.0
837
-25.0
DL
...
10.0
2013-01-01 01:00:00
ATL
Hartsfield Jackson Atlanta Intl
33.636719
-84.428067
1026.0
-5.0
A
America/New_York
6
2013
1
1
554.0
600
-6.0
812.0
837
-25.0
DL
...
10.0
2013-01-01 01:00:00
ATL
Hartsfield Jackson Atlanta Intl
33.636719
-84.428067
1026.0
-5.0
A
America/New_York
8
2013
1
1
555.0
600
-5.0
913.0
854
19.0
B6
...
10.0
2013-01-01 01:00:00
FLL
Fort Lauderdale Hollywood Intl
26.072583
-80.152750
9.0
-5.0
A
America/New_York
9
2013
1
1
555.0
600
-5.0
913.0
854
19.0
B6
...
10.0
2013-01-01 01:00:00
FLL
Fort Lauderdale Hollywood Intl
26.072583
-80.152750
9.0
-5.0
A
America/New_York
10
2013
1
1
555.0
600
-5.0
913.0
854
19.0
B6
...
10.0
2013-01-01 01:00:00
FLL
Fort Lauderdale Hollywood Intl
26.072583
-80.152750
9.0
-5.0
A
America/New_York
11
2013
1
1
557.0
600
-3.0
709.0
723
-14.0
EV
...
10.0
2013-01-01 01:00:00
IAD
Washington Dulles Intl
38.944533
-77.455811
313.0
-5.0
A
America/New_York
12
2013
1
1
557.0
600
-3.0
709.0
723
-14.0
EV
...
10.0
2013-01-01 01:00:00
IAD
Washington Dulles Intl
38.944533
-77.455811
313.0
-5.0
A
America/New_York
13
2013
1
1
557.0
600
-3.0
709.0
723
-14.0
EV
...
10.0
2013-01-01 01:00:00
IAD
Washington Dulles Intl
38.944533
-77.455811
313.0
-5.0
A
America/New_York
14
2013
1
1
557.0
600
-3.0
838.0
846
-8.0
B6
...
10.0
2013-01-01 01:00:00
MCO
Orlando Intl
28.429394
-81.308994
96.0
-5.0
A
America/New_York
15
2013
1
1
557.0
600
-3.0
838.0
846
-8.0
B6
...
10.0
2013-01-01 01:00:00
MCO
Orlando Intl
28.429394
-81.308994
96.0
-5.0
A
America/New_York
16
2013
1
1
557.0
600
-3.0
838.0
846
-8.0
B6
...
10.0
2013-01-01 01:00:00
MCO
Orlando Intl
28.429394
-81.308994
96.0
-5.0
A
America/New_York
17
2013
1
1
558.0
600
-2.0
753.0
745
8.0
AA
...
10.0
2013-01-01 01:00:00
ORD
Chicago Ohare Intl
41.978603
-87.904842
668.0
-6.0
A
America/Chicago
18
2013
1
1
558.0
600
-2.0
753.0
745
8.0
AA
...
10.0
2013-01-01 01:00:00
ORD
Chicago Ohare Intl
41.978603
-87.904842
668.0
-6.0
A
America/Chicago
19
2013
1
1
558.0
600
-2.0
753.0
745
8.0
AA
...
10.0
2013-01-01 01:00:00
ORD
Chicago Ohare Intl
41.978603
-87.904842
668.0
-6.0
A
America/Chicago
20
2013
1
1
558.0
600
-2.0
849.0
851
-2.0
B6
...
10.0
2013-01-01 01:00:00
PBI
Palm Beach Intl
26.683161
-80.095589
19.0
-5.0
A
America/New_York
21
2013
1
1
558.0
600
-2.0
849.0
851
-2.0
B6
...
10.0
2013-01-01 01:00:00
PBI
Palm Beach Intl
26.683161
-80.095589
19.0
-5.0
A
America/New_York
22
2013
1
1
558.0
600
-2.0
849.0
851
-2.0
B6
...
10.0
2013-01-01 01:00:00
PBI
Palm Beach Intl
26.683161
-80.095589
19.0
-5.0
A
America/New_York
23
2013
1
1
558.0
600
-2.0
853.0
856
-3.0
B6
...
10.0
2013-01-01 01:00:00
TPA
Tampa Intl
27.975472
-82.533250
26.0
-5.0
A
America/New_York
24
2013
1
1
558.0
600
-2.0
853.0
856
-3.0
B6
...
10.0
2013-01-01 01:00:00
TPA
Tampa Intl
27.975472
-82.533250
26.0
-5.0
A
America/New_York
25
2013
1
1
558.0
600
-2.0
853.0
856
-3.0
B6
...
10.0
2013-01-01 01:00:00
TPA
Tampa Intl
27.975472
-82.533250
26.0
-5.0
A
America/New_York
26
2013
1
1
558.0
600
-2.0
924.0
917
7.0
UA
...
10.0
2013-01-01 01:00:00
LAX
Los Angeles Intl
33.942536
-118.408075
126.0
-8.0
A
America/Los_Angeles
27
2013
1
1
558.0
600
-2.0
924.0
917
7.0
UA
...
10.0
2013-01-01 01:00:00
LAX
Los Angeles Intl
33.942536
-118.408075
126.0
-8.0
A
America/Los_Angeles
28
2013
1
1
558.0
600
-2.0
924.0
917
7.0
UA
...
10.0
2013-01-01 01:00:00
LAX
Los Angeles Intl
33.942536
-118.408075
126.0
-8.0
A
America/Los_Angeles
29
2013
1
1
558.0
600
-2.0
923.0
937
-14.0
UA
...
10.0
2013-01-01 01:00:00
SFO
San Francisco Intl
37.618972
-122.374889
13.0
-8.0
A
America/Los_Angeles
30
2013
1
1
558.0
600
-2.0
923.0
937
-14.0
UA
...
10.0
2013-01-01 01:00:00
SFO
San Francisco Intl
37.618972
-122.374889
13.0
-8.0
A
America/Los_Angeles
31
2013
1
1
558.0
600
-2.0
923.0
937
-14.0
UA
...
10.0
2013-01-01 01:00:00
SFO
San Francisco Intl
37.618972
-122.374889
13.0
-8.0
A
America/Los_Angeles
32
2013
1
1
559.0
600
-1.0
941.0
910
31.0
AA
...
10.0
2013-01-01 01:00:00
DFW
Dallas Fort Worth Intl
32.896828
-97.037997
607.0
-6.0
A
America/Chicago
33
2013
1
1
559.0
600
-1.0
941.0
910
31.0
AA
...
10.0
2013-01-01 01:00:00
DFW
Dallas Fort Worth Intl
32.896828
-97.037997
607.0
-6.0
A
America/Chicago
34
2013
1
1
559.0
600
-1.0
941.0
910
31.0
AA
...
10.0
2013-01-01 01:00:00
DFW
Dallas Fort Worth Intl
32.896828
-97.037997
607.0
-6.0
A
America/Chicago
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
1007540
2013
9
30
2207.0
2140
27.0
2257.0
2250
7.0
MQ
...
10.0
2013-09-30 16:00:00
BNA
Nashville Intl
36.124472
-86.678194
599.0
-6.0
A
America/Chicago
1007541
2013
9
30
2207.0
2140
27.0
2257.0
2250
7.0
MQ
...
10.0
2013-09-30 16:00:00
BNA
Nashville Intl
36.124472
-86.678194
599.0
-6.0
A
America/Chicago
1007542
2013
9
30
2207.0
2140
27.0
2257.0
2250
7.0
MQ
...
10.0
2013-09-30 16:00:00
BNA
Nashville Intl
36.124472
-86.678194
599.0
-6.0
A
America/Chicago
1007543
2013
9
30
2211.0
2059
72.0
2339.0
2242
57.0
EV
...
10.0
2013-09-30 15:00:00
STL
Lambert St Louis Intl
38.748697
-90.370028
618.0
-6.0
A
America/Chicago
1007544
2013
9
30
2211.0
2059
72.0
2339.0
2242
57.0
EV
...
10.0
2013-09-30 15:00:00
STL
Lambert St Louis Intl
38.748697
-90.370028
618.0
-6.0
A
America/Chicago
1007545
2013
9
30
2211.0
2059
72.0
2339.0
2242
57.0
EV
...
10.0
2013-09-30 15:00:00
STL
Lambert St Louis Intl
38.748697
-90.370028
618.0
-6.0
A
America/Chicago
1007546
2013
9
30
2231.0
2245
-14.0
2335.0
2356
-21.0
B6
...
10.0
2013-09-30 17:00:00
PWM
Portland Intl Jetport
43.646161
-70.309281
77.0
-5.0
A
America/New_York
1007547
2013
9
30
2231.0
2245
-14.0
2335.0
2356
-21.0
B6
...
10.0
2013-09-30 17:00:00
PWM
Portland Intl Jetport
43.646161
-70.309281
77.0
-5.0
A
America/New_York
1007548
2013
9
30
2231.0
2245
-14.0
2335.0
2356
-21.0
B6
...
10.0
2013-09-30 17:00:00
PWM
Portland Intl Jetport
43.646161
-70.309281
77.0
-5.0
A
America/New_York
1007549
2013
9
30
2233.0
2113
80.0
112.0
30
42.0
UA
...
10.0
2013-09-30 16:00:00
SFO
San Francisco Intl
37.618972
-122.374889
13.0
-8.0
A
America/Los_Angeles
1007550
2013
9
30
2233.0
2113
80.0
112.0
30
42.0
UA
...
10.0
2013-09-30 16:00:00
SFO
San Francisco Intl
37.618972
-122.374889
13.0
-8.0
A
America/Los_Angeles
1007551
2013
9
30
2233.0
2113
80.0
112.0
30
42.0
UA
...
10.0
2013-09-30 16:00:00
SFO
San Francisco Intl
37.618972
-122.374889
13.0
-8.0
A
America/Los_Angeles
1007552
2013
9
30
2235.0
2001
154.0
59.0
2249
130.0
B6
...
10.0
2013-09-30 15:00:00
MCO
Orlando Intl
28.429394
-81.308994
96.0
-5.0
A
America/New_York
1007553
2013
9
30
2235.0
2001
154.0
59.0
2249
130.0
B6
...
10.0
2013-09-30 15:00:00
MCO
Orlando Intl
28.429394
-81.308994
96.0
-5.0
A
America/New_York
1007554
2013
9
30
2235.0
2001
154.0
59.0
2249
130.0
B6
...
10.0
2013-09-30 15:00:00
MCO
Orlando Intl
28.429394
-81.308994
96.0
-5.0
A
America/New_York
1007555
2013
9
30
2237.0
2245
-8.0
2345.0
2353
-8.0
B6
...
10.0
2013-09-30 17:00:00
BTV
Burlington Intl
44.471861
-73.153278
335.0
-5.0
A
America/New_York
1007556
2013
9
30
2237.0
2245
-8.0
2345.0
2353
-8.0
B6
...
10.0
2013-09-30 17:00:00
BTV
Burlington Intl
44.471861
-73.153278
335.0
-5.0
A
America/New_York
1007557
2013
9
30
2237.0
2245
-8.0
2345.0
2353
-8.0
B6
...
10.0
2013-09-30 17:00:00
BTV
Burlington Intl
44.471861
-73.153278
335.0
-5.0
A
America/New_York
1007558
2013
9
30
2240.0
2245
-5.0
2334.0
2351
-17.0
B6
...
10.0
2013-09-30 17:00:00
SYR
Syracuse Hancock Intl
43.111187
-76.106311
421.0
-5.0
A
America/New_York
1007559
2013
9
30
2240.0
2245
-5.0
2334.0
2351
-17.0
B6
...
10.0
2013-09-30 17:00:00
SYR
Syracuse Hancock Intl
43.111187
-76.106311
421.0
-5.0
A
America/New_York
1007560
2013
9
30
2240.0
2245
-5.0
2334.0
2351
-17.0
B6
...
10.0
2013-09-30 17:00:00
SYR
Syracuse Hancock Intl
43.111187
-76.106311
421.0
-5.0
A
America/New_York
1007561
2013
9
30
2240.0
2250
-10.0
2347.0
7
-20.0
B6
...
10.0
2013-09-30 17:00:00
BUF
Buffalo Niagara Intl
42.940525
-78.732167
724.0
-5.0
A
America/New_York
1007562
2013
9
30
2240.0
2250
-10.0
2347.0
7
-20.0
B6
...
10.0
2013-09-30 17:00:00
BUF
Buffalo Niagara Intl
42.940525
-78.732167
724.0
-5.0
A
America/New_York
1007563
2013
9
30
2240.0
2250
-10.0
2347.0
7
-20.0
B6
...
10.0
2013-09-30 17:00:00
BUF
Buffalo Niagara Intl
42.940525
-78.732167
724.0
-5.0
A
America/New_York
1007564
2013
9
30
2241.0
2246
-5.0
2345.0
1
-16.0
B6
...
10.0
2013-09-30 17:00:00
ROC
Greater Rochester Intl
43.118866
-77.672389
559.0
-5.0
A
America/New_York
1007565
2013
9
30
2241.0
2246
-5.0
2345.0
1
-16.0
B6
...
10.0
2013-09-30 17:00:00
ROC
Greater Rochester Intl
43.118866
-77.672389
559.0
-5.0
A
America/New_York
1007566
2013
9
30
2241.0
2246
-5.0
2345.0
1
-16.0
B6
...
10.0
2013-09-30 17:00:00
ROC
Greater Rochester Intl
43.118866
-77.672389
559.0
-5.0
A
America/New_York
1007567
2013
9
30
2307.0
2255
12.0
2359.0
2358
1.0
B6
...
10.0
2013-09-30 17:00:00
BOS
General Edward Lawrence Logan Intl
42.364347
-71.005181
19.0
-5.0
A
America/New_York
1007568
2013
9
30
2307.0
2255
12.0
2359.0
2358
1.0
B6
...
10.0
2013-09-30 17:00:00
BOS
General Edward Lawrence Logan Intl
42.364347
-71.005181
19.0
-5.0
A
America/New_York
1007569
2013
9
30
2307.0
2255
12.0
2359.0
2358
1.0
B6
...
10.0
2013-09-30 17:00:00
BOS
General Edward Lawrence Logan Intl
42.364347
-71.005181
19.0
-5.0
A
America/New_York
838182 rows × 38 columns
In [48]:
pred = 'dep_delay'
features = ['month', 'day', 'dep_time', 'arr_time', 'carrier', 'dest', 'air_time',
'distance', 'lat', 'lon', 'alt', 'dewp', 'humid', 'wind_speed', 'wind_gust',
'precip', 'pressure', 'visib']
features_v = df[features]
pred_v = df[pred]
pd.options.mode.chained_assignment = None #default='warn'
# 因为航空公司不是一个数字,我们把它转化为数字哑变量
features_v['carrier'] = pd.factorize(features_v['carrier'])[0]
# dest也不是一个数字,我们也把它转为数字
features_v['dest'] = pd.factorize(features_v['dest'])[0]
In [49]:
features_v
Out[49]:
month
day
dep_time
arr_time
carrier
dest
air_time
distance
lat
lon
alt
dewp
humid
wind_speed
wind_gust
precip
pressure
visib
4
1
1
554.0
812.0
0
0
116.0
762
33.636719
-84.428067
1026.0
26.06
59.37
10.35702
11.918651
0.0
1012.0
10.0
5
1
1
554.0
812.0
0
0
116.0
762
33.636719
-84.428067
1026.0
26.06
59.37
12.65858
14.567241
0.0
1012.6
10.0
6
1
1
554.0
812.0
0
0
116.0
762
33.636719
-84.428067
1026.0
26.06
57.33
13.80936
15.891535
0.0
1011.9
10.0
8
1
1
555.0
913.0
1
1
158.0
1065
26.072583
-80.152750
9.0
26.06
59.37
10.35702
11.918651
0.0
1012.0
10.0
9
1
1
555.0
913.0
1
1
158.0
1065
26.072583
-80.152750
9.0
26.06
59.37
12.65858
14.567241
0.0
1012.6
10.0
10
1
1
555.0
913.0
1
1
158.0
1065
26.072583
-80.152750
9.0
26.06
57.33
13.80936
15.891535
0.0
1011.9
10.0
11
1
1
557.0
709.0
2
2
53.0
229
38.944533
-77.455811
313.0
26.06
59.37
10.35702
11.918651
0.0
1012.0
10.0
12
1
1
557.0
709.0
2
2
53.0
229
38.944533
-77.455811
313.0
26.06
59.37
12.65858
14.567241
0.0
1012.6
10.0
13
1
1
557.0
709.0
2
2
53.0
229
38.944533
-77.455811
313.0
26.06
57.33
13.80936
15.891535
0.0
1011.9
10.0
14
1
1
557.0
838.0
1
3
140.0
944
28.429394
-81.308994
96.0
26.06
59.37
10.35702
11.918651
0.0
1012.0
10.0
15
1
1
557.0
838.0
1
3
140.0
944
28.429394
-81.308994
96.0
26.06
59.37
12.65858
14.567241
0.0
1012.6
10.0
16
1
1
557.0
838.0
1
3
140.0
944
28.429394
-81.308994
96.0
26.06
57.33
13.80936
15.891535
0.0
1011.9
10.0
17
1
1
558.0
753.0
3
4
138.0
733
41.978603
-87.904842
668.0
26.06
59.37
10.35702
11.918651
0.0
1012.0
10.0
18
1
1
558.0
753.0
3
4
138.0
733
41.978603
-87.904842
668.0
26.06
59.37
12.65858
14.567241
0.0
1012.6
10.0
19
1
1
558.0
753.0
3
4
138.0
733
41.978603
-87.904842
668.0
26.06
57.33
13.80936
15.891535
0.0
1011.9
10.0
20
1
1
558.0
849.0
1
5
149.0
1028
26.683161
-80.095589
19.0
26.06
59.37
10.35702
11.918651
0.0
1012.0
10.0
21
1
1
558.0
849.0
1
5
149.0
1028
26.683161
-80.095589
19.0
26.06
59.37
12.65858
14.567241
0.0
1012.6
10.0
22
1
1
558.0
849.0
1
5
149.0
1028
26.683161
-80.095589
19.0
26.06
57.33
13.80936
15.891535
0.0
1011.9
10.0
23
1
1
558.0
853.0
1
6
158.0
1005
27.975472
-82.533250
26.0
26.06
59.37
10.35702
11.918651
0.0
1012.0
10.0
24
1
1
558.0
853.0
1
6
158.0
1005
27.975472
-82.533250
26.0
26.06
59.37
12.65858
14.567241
0.0
1012.6
10.0
25
1
1
558.0
853.0
1
6
158.0
1005
27.975472
-82.533250
26.0
26.06
57.33
13.80936
15.891535
0.0
1011.9
10.0
26
1
1
558.0
924.0
4
7
345.0
2475
33.942536
-118.408075
126.0
26.06
59.37
10.35702
11.918651
0.0
1012.0
10.0
27
1
1
558.0
924.0
4
7
345.0
2475
33.942536
-118.408075
126.0
26.06
59.37
12.65858
14.567241
0.0
1012.6
10.0
28
1
1
558.0
924.0
4
7
345.0
2475
33.942536
-118.408075
126.0
26.06
57.33
13.80936
15.891535
0.0
1011.9
10.0
29
1
1
558.0
923.0
4
8
361.0
2565
37.618972
-122.374889
13.0
26.06
59.37
10.35702
11.918651
0.0
1012.0
10.0
30
1
1
558.0
923.0
4
8
361.0
2565
37.618972
-122.374889
13.0
26.06
59.37
12.65858
14.567241
0.0
1012.6
10.0
31
1
1
558.0
923.0
4
8
361.0
2565
37.618972
-122.374889
13.0
26.06
57.33
13.80936
15.891535
0.0
1011.9
10.0
32
1
1
559.0
941.0
3
9
257.0
1389
32.896828
-97.037997
607.0
26.06
59.37
10.35702
11.918651
0.0
1012.0
10.0
33
1
1
559.0
941.0
3
9
257.0
1389
32.896828
-97.037997
607.0
26.06
59.37
12.65858
14.567241
0.0
1012.6
10.0
34
1
1
559.0
941.0
3
9
257.0
1389
32.896828
-97.037997
607.0
26.06
57.33
13.80936
15.891535
0.0
1011.9
10.0
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
1007540
9
30
2207.0
2257.0
5
45
97.0
764
36.124472
-86.678194
599.0
53.06
54.94
5.75390
6.621473
0.0
1015.0
10.0
1007541
9
30
2207.0
2257.0
5
45
97.0
764
36.124472
-86.678194
599.0
55.04
67.69
6.90468
7.945768
0.0
1015.6
10.0
1007542
9
30
2207.0
2257.0
5
45
97.0
764
36.124472
-86.678194
599.0
51.08
52.67
6.90468
7.945768
0.0
1015.0
10.0
1007543
9
30
2211.0
2339.0
2
40
120.0
872
38.748697
-90.370028
618.0
51.98
49.36
6.90468
7.945768
0.0
1015.2
10.0
1007544
9
30
2211.0
2339.0
2
40
120.0
872
38.748697
-90.370028
618.0
53.06
58.80
8.05546
9.270062
0.0
1015.8
10.0
1007545
9
30
2211.0
2339.0
2
40
120.0
872
38.748697
-90.370028
618.0
51.08
46.04
6.90468
7.945768
0.0
1015.1
10.0
1007546
9
30
2231.0
2335.0
1
51
48.0
273
43.646161
-70.309281
77.0
53.06
58.80
5.75390
6.621473
0.0
1015.4
10.0
1007547
9
30
2231.0
2335.0
1
51
48.0
273
43.646161
-70.309281
77.0
55.94
74.94
6.90468
7.945768
0.0
1016.0
10.0
1007548
9
30
2231.0
2335.0
1
51
48.0
273
43.646161
-70.309281
77.0
51.98
58.65
9.20624
10.594357
0.0
1015.4
10.0
1007549
9
30
2233.0
112.0
4
8
318.0
2565
37.618972
-122.374889
13.0
53.06
54.94
5.75390
6.621473
0.0
1015.0
10.0
1007550
9
30
2233.0
112.0
4
8
318.0
2565
37.618972
-122.374889
13.0
55.04
67.69
6.90468
7.945768
0.0
1015.6
10.0
1007551
9
30
2233.0
112.0
4
8
318.0
2565
37.618972
-122.374889
13.0
51.08
52.67
6.90468
7.945768
0.0
1015.0
10.0
1007552
9
30
2235.0
59.0
1
3
123.0
944
28.429394
-81.308994
96.0
51.98
49.36
6.90468
7.945768
0.0
1015.2
10.0
1007553
9
30
2235.0
59.0
1
3
123.0
944
28.429394
-81.308994
96.0
53.06
58.80
8.05546
9.270062
0.0
1015.8
10.0
1007554
9
30
2235.0
59.0
1
3
123.0
944
28.429394
-81.308994
96.0
51.08
46.04
6.90468
7.945768
0.0
1015.1
10.0
1007555
9
30
2237.0
2345.0
1
47
43.0
266
44.471861
-73.153278
335.0
53.06
58.80
5.75390
6.621473
0.0
1015.4
10.0
1007556
9
30
2237.0
2345.0
1
47
43.0
266
44.471861
-73.153278
335.0
55.94
74.94
6.90468
7.945768
0.0
1016.0
10.0
1007557
9
30
2237.0
2345.0
1
47
43.0
266
44.471861
-73.153278
335.0
51.98
58.65
9.20624
10.594357
0.0
1015.4
10.0
1007558
9
30
2240.0
2334.0
1
29
41.0
209
43.111187
-76.106311
421.0
53.06
58.80
5.75390
6.621473
0.0
1015.4
10.0
1007559
9
30
2240.0
2334.0
1
29
41.0
209
43.111187
-76.106311
421.0
55.94
74.94
6.90468
7.945768
0.0
1016.0
10.0
1007560
9
30
2240.0
2334.0
1
29
41.0
209
43.111187
-76.106311
421.0
51.98
58.65
9.20624
10.594357
0.0
1015.4
10.0
1007561
9
30
2240.0
2347.0
1
20
52.0
301
42.940525
-78.732167
724.0
53.06
58.80
5.75390
6.621473
0.0
1015.4
10.0
1007562
9
30
2240.0
2347.0
1
20
52.0
301
42.940525
-78.732167
724.0
55.94
74.94
6.90468
7.945768
0.0
1016.0
10.0
1007563
9
30
2240.0
2347.0
1
20
52.0
301
42.940525
-78.732167
724.0
51.98
58.65
9.20624
10.594357
0.0
1015.4
10.0
1007564
9
30
2241.0
2345.0
1
28
47.0
264
43.118866
-77.672389
559.0
53.06
58.80
5.75390
6.621473
0.0
1015.4
10.0
1007565
9
30
2241.0
2345.0
1
28
47.0
264
43.118866
-77.672389
559.0
55.94
74.94
6.90468
7.945768
0.0
1016.0
10.0
1007566
9
30
2241.0
2345.0
1
28
47.0
264
43.118866
-77.672389
559.0
51.98
58.65
9.20624
10.594357
0.0
1015.4
10.0
1007567
9
30
2307.0
2359.0
1
19
33.0
187
42.364347
-71.005181
19.0
53.06
58.80
5.75390
6.621473
0.0
1015.4
10.0
1007568
9
30
2307.0
2359.0
1
19
33.0
187
42.364347
-71.005181
19.0
55.94
74.94
6.90468
7.945768
0.0
1016.0
10.0
1007569
9
30
2307.0
2359.0
1
19
33.0
187
42.364347
-71.005181
19.0
51.98
58.65
9.20624
10.594357
0.0
1015.4
10.0
838182 rows × 18 columns
In [50]:
# 因为各个特征的维度各不相同,我们需要做标准化
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features_v)
In [51]:
scaled_features
Out[51]:
array([[-1.62498171, -1.69693092, -1.62908082, ..., -0.11420569,
-0.79438586, 0.29213712],
[-1.62498171, -1.69693092, -1.62908082, ..., -0.11420569,
-0.71389571, 0.29213712],
[-1.62498171, -1.69693092, -1.62908082, ..., -0.11420569,
-0.80780089, 0.29213712],
...,
[ 0.71764151, 1.6218685 , 1.98106144, ..., -0.11420569,
-0.33827498, 0.29213712],
[ 0.71764151, 1.6218685 , 1.98106144, ..., -0.11420569,
-0.25778483, 0.29213712],
[ 0.71764151, 1.6218685 , 1.98106144, ..., -0.11420569,
-0.33827498, 0.29213712]])
In [53]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_r = pca.fit(scaled_features).transform(scaled_features)
In [54]:
X_r
Out[54]:
array([[-0.32039321, -0.88647489],
[-0.32303154, -0.72462068],
[-0.32447915, -0.59863037],
...,
[-2.11022832, 0.78923842],
[-2.10108117, 0.41150146],
[-2.11397223, 1.04517555]])
In [57]:
import matplotlib.pyplot as plt
print('explained variance ratio (first two components): %s'
% str(pca.explained_variance_ratio_))
plt.figure()
lw = 2
plt.scatter(X_r[:,0], X_r[:,1], alpha=.8, lw=lw)
plt.title('PCA of flights dataset')
explained variance ratio (first two components): [ 0.16961823 0.14269873]
Out[57]:
Text(0.5,1,'PCA of flights dataset')
In [61]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sklearn
from sklearn import linear_model, cross_validation, metrics, svm, ensemble
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_fscore_support, accuracy_score
from sklearn.cross_validation import train_test_split, cross_val_score, ShuffleSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
In [62]:
flights = pd.read_csv('data/nycflights13/flights.csv.gz')
weather = pd.read_csv('data/nycflights13/weather.csv.gz')
airports = pd.read_csv('data/nycflights13/airports.csv.gz')
df_withweather = pd.merge(flights, weather, how='left', on=['year', 'month', 'day', 'hour'])
df = pd.merge(df_withweather, airports, how='left', left_on='dest', right_on='faa')
df = df.dropna()
In [63]:
df
Out[63]:
year
month
day
dep_time
sched_dep_time
dep_delay
arr_time
sched_arr_time
arr_delay
carrier
...
visib
time_hour_y
faa
name
lat
lon
alt
tz
dst
tzone
4
2013
1
1
554.0
600
-6.0
812.0
837
-25.0
DL
...
10.0
2013-01-01 01:00:00
ATL
Hartsfield Jackson Atlanta Intl
33.636719
-84.428067
1026.0
-5.0
A
America/New_York
5
2013
1
1
554.0
600
-6.0
812.0
837
-25.0
DL
...
10.0
2013-01-01 01:00:00
ATL
Hartsfield Jackson Atlanta Intl
33.636719
-84.428067
1026.0
-5.0
A
America/New_York
6
2013
1
1
554.0
600
-6.0
812.0
837
-25.0
DL
...
10.0
2013-01-01 01:00:00
ATL
Hartsfield Jackson Atlanta Intl
33.636719
-84.428067
1026.0
-5.0
A
America/New_York
8
2013
1
1
555.0
600
-5.0
913.0
854
19.0
B6
...
10.0
2013-01-01 01:00:00
FLL
Fort Lauderdale Hollywood Intl
26.072583
-80.152750
9.0
-5.0
A
America/New_York
9
2013
1
1
555.0
600
-5.0
913.0
854
19.0
B6
...
10.0
2013-01-01 01:00:00
FLL
Fort Lauderdale Hollywood Intl
26.072583
-80.152750
9.0
-5.0
A
America/New_York
10
2013
1
1
555.0
600
-5.0
913.0
854
19.0
B6
...
10.0
2013-01-01 01:00:00
FLL
Fort Lauderdale Hollywood Intl
26.072583
-80.152750
9.0
-5.0
A
America/New_York
11
2013
1
1
557.0
600
-3.0
709.0
723
-14.0
EV
...
10.0
2013-01-01 01:00:00
IAD
Washington Dulles Intl
38.944533
-77.455811
313.0
-5.0
A
America/New_York
12
2013
1
1
557.0
600
-3.0
709.0
723
-14.0
EV
...
10.0
2013-01-01 01:00:00
IAD
Washington Dulles Intl
38.944533
-77.455811
313.0
-5.0
A
America/New_York
13
2013
1
1
557.0
600
-3.0
709.0
723
-14.0
EV
...
10.0
2013-01-01 01:00:00
IAD
Washington Dulles Intl
38.944533
-77.455811
313.0
-5.0
A
America/New_York
14
2013
1
1
557.0
600
-3.0
838.0
846
-8.0
B6
...
10.0
2013-01-01 01:00:00
MCO
Orlando Intl
28.429394
-81.308994
96.0
-5.0
A
America/New_York
15
2013
1
1
557.0
600
-3.0
838.0
846
-8.0
B6
...
10.0
2013-01-01 01:00:00
MCO
Orlando Intl
28.429394
-81.308994
96.0
-5.0
A
America/New_York
16
2013
1
1
557.0
600
-3.0
838.0
846
-8.0
B6
...
10.0
2013-01-01 01:00:00
MCO
Orlando Intl
28.429394
-81.308994
96.0
-5.0
A
America/New_York
17
2013
1
1
558.0
600
-2.0
753.0
745
8.0
AA
...
10.0
2013-01-01 01:00:00
ORD
Chicago Ohare Intl
41.978603
-87.904842
668.0
-6.0
A
America/Chicago
18
2013
1
1
558.0
600
-2.0
753.0
745
8.0
AA
...
10.0
2013-01-01 01:00:00
ORD
Chicago Ohare Intl
41.978603
-87.904842
668.0
-6.0
A
America/Chicago
19
2013
1
1
558.0
600
-2.0
753.0
745
8.0
AA
...
10.0
2013-01-01 01:00:00
ORD
Chicago Ohare Intl
41.978603
-87.904842
668.0
-6.0
A
America/Chicago
20
2013
1
1
558.0
600
-2.0
849.0
851
-2.0
B6
...
10.0
2013-01-01 01:00:00
PBI
Palm Beach Intl
26.683161
-80.095589
19.0
-5.0
A
America/New_York
21
2013
1
1
558.0
600
-2.0
849.0
851
-2.0
B6
...
10.0
2013-01-01 01:00:00
PBI
Palm Beach Intl
26.683161
-80.095589
19.0
-5.0
A
America/New_York
22
2013
1
1
558.0
600
-2.0
849.0
851
-2.0
B6
...
10.0
2013-01-01 01:00:00
PBI
Palm Beach Intl
26.683161
-80.095589
19.0
-5.0
A
America/New_York
23
2013
1
1
558.0
600
-2.0
853.0
856
-3.0
B6
...
10.0
2013-01-01 01:00:00
TPA
Tampa Intl
27.975472
-82.533250
26.0
-5.0
A
America/New_York
24
2013
1
1
558.0
600
-2.0
853.0
856
-3.0
B6
...
10.0
2013-01-01 01:00:00
TPA
Tampa Intl
27.975472
-82.533250
26.0
-5.0
A
America/New_York
25
2013
1
1
558.0
600
-2.0
853.0
856
-3.0
B6
...
10.0
2013-01-01 01:00:00
TPA
Tampa Intl
27.975472
-82.533250
26.0
-5.0
A
America/New_York
26
2013
1
1
558.0
600
-2.0
924.0
917
7.0
UA
...
10.0
2013-01-01 01:00:00
LAX
Los Angeles Intl
33.942536
-118.408075
126.0
-8.0
A
America/Los_Angeles
27
2013
1
1
558.0
600
-2.0
924.0
917
7.0
UA
...
10.0
2013-01-01 01:00:00
LAX
Los Angeles Intl
33.942536
-118.408075
126.0
-8.0
A
America/Los_Angeles
28
2013
1
1
558.0
600
-2.0
924.0
917
7.0
UA
...
10.0
2013-01-01 01:00:00
LAX
Los Angeles Intl
33.942536
-118.408075
126.0
-8.0
A
America/Los_Angeles
29
2013
1
1
558.0
600
-2.0
923.0
937
-14.0
UA
...
10.0
2013-01-01 01:00:00
SFO
San Francisco Intl
37.618972
-122.374889
13.0
-8.0
A
America/Los_Angeles
30
2013
1
1
558.0
600
-2.0
923.0
937
-14.0
UA
...
10.0
2013-01-01 01:00:00
SFO
San Francisco Intl
37.618972
-122.374889
13.0
-8.0
A
America/Los_Angeles
31
2013
1
1
558.0
600
-2.0
923.0
937
-14.0
UA
...
10.0
2013-01-01 01:00:00
SFO
San Francisco Intl
37.618972
-122.374889
13.0
-8.0
A
America/Los_Angeles
32
2013
1
1
559.0
600
-1.0
941.0
910
31.0
AA
...
10.0
2013-01-01 01:00:00
DFW
Dallas Fort Worth Intl
32.896828
-97.037997
607.0
-6.0
A
America/Chicago
33
2013
1
1
559.0
600
-1.0
941.0
910
31.0
AA
...
10.0
2013-01-01 01:00:00
DFW
Dallas Fort Worth Intl
32.896828
-97.037997
607.0
-6.0
A
America/Chicago
34
2013
1
1
559.0
600
-1.0
941.0
910
31.0
AA
...
10.0
2013-01-01 01:00:00
DFW
Dallas Fort Worth Intl
32.896828
-97.037997
607.0
-6.0
A
America/Chicago
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
1007540
2013
9
30
2207.0
2140
27.0
2257.0
2250
7.0
MQ
...
10.0
2013-09-30 16:00:00
BNA
Nashville Intl
36.124472
-86.678194
599.0
-6.0
A
America/Chicago
1007541
2013
9
30
2207.0
2140
27.0
2257.0
2250
7.0
MQ
...
10.0
2013-09-30 16:00:00
BNA
Nashville Intl
36.124472
-86.678194
599.0
-6.0
A
America/Chicago
1007542
2013
9
30
2207.0
2140
27.0
2257.0
2250
7.0
MQ
...
10.0
2013-09-30 16:00:00
BNA
Nashville Intl
36.124472
-86.678194
599.0
-6.0
A
America/Chicago
1007543
2013
9
30
2211.0
2059
72.0
2339.0
2242
57.0
EV
...
10.0
2013-09-30 15:00:00
STL
Lambert St Louis Intl
38.748697
-90.370028
618.0
-6.0
A
America/Chicago
1007544
2013
9
30
2211.0
2059
72.0
2339.0
2242
57.0
EV
...
10.0
2013-09-30 15:00:00
STL
Lambert St Louis Intl
38.748697
-90.370028
618.0
-6.0
A
America/Chicago
1007545
2013
9
30
2211.0
2059
72.0
2339.0
2242
57.0
EV
...
10.0
2013-09-30 15:00:00
STL
Lambert St Louis Intl
38.748697
-90.370028
618.0
-6.0
A
America/Chicago
1007546
2013
9
30
2231.0
2245
-14.0
2335.0
2356
-21.0
B6
...
10.0
2013-09-30 17:00:00
PWM
Portland Intl Jetport
43.646161
-70.309281
77.0
-5.0
A
America/New_York
1007547
2013
9
30
2231.0
2245
-14.0
2335.0
2356
-21.0
B6
...
10.0
2013-09-30 17:00:00
PWM
Portland Intl Jetport
43.646161
-70.309281
77.0
-5.0
A
America/New_York
1007548
2013
9
30
2231.0
2245
-14.0
2335.0
2356
-21.0
B6
...
10.0
2013-09-30 17:00:00
PWM
Portland Intl Jetport
43.646161
-70.309281
77.0
-5.0
A
America/New_York
1007549
2013
9
30
2233.0
2113
80.0
112.0
30
42.0
UA
...
10.0
2013-09-30 16:00:00
SFO
San Francisco Intl
37.618972
-122.374889
13.0
-8.0
A
America/Los_Angeles
1007550
2013
9
30
2233.0
2113
80.0
112.0
30
42.0
UA
...
10.0
2013-09-30 16:00:00
SFO
San Francisco Intl
37.618972
-122.374889
13.0
-8.0
A
America/Los_Angeles
1007551
2013
9
30
2233.0
2113
80.0
112.0
30
42.0
UA
...
10.0
2013-09-30 16:00:00
SFO
San Francisco Intl
37.618972
-122.374889
13.0
-8.0
A
America/Los_Angeles
1007552
2013
9
30
2235.0
2001
154.0
59.0
2249
130.0
B6
...
10.0
2013-09-30 15:00:00
MCO
Orlando Intl
28.429394
-81.308994
96.0
-5.0
A
America/New_York
1007553
2013
9
30
2235.0
2001
154.0
59.0
2249
130.0
B6
...
10.0
2013-09-30 15:00:00
MCO
Orlando Intl
28.429394
-81.308994
96.0
-5.0
A
America/New_York
1007554
2013
9
30
2235.0
2001
154.0
59.0
2249
130.0
B6
...
10.0
2013-09-30 15:00:00
MCO
Orlando Intl
28.429394
-81.308994
96.0
-5.0
A
America/New_York
1007555
2013
9
30
2237.0
2245
-8.0
2345.0
2353
-8.0
B6
...
10.0
2013-09-30 17:00:00
BTV
Burlington Intl
44.471861
-73.153278
335.0
-5.0
A
America/New_York
1007556
2013
9
30
2237.0
2245
-8.0
2345.0
2353
-8.0
B6
...
10.0
2013-09-30 17:00:00
BTV
Burlington Intl
44.471861
-73.153278
335.0
-5.0
A
America/New_York
1007557
2013
9
30
2237.0
2245
-8.0
2345.0
2353
-8.0
B6
...
10.0
2013-09-30 17:00:00
BTV
Burlington Intl
44.471861
-73.153278
335.0
-5.0
A
America/New_York
1007558
2013
9
30
2240.0
2245
-5.0
2334.0
2351
-17.0
B6
...
10.0
2013-09-30 17:00:00
SYR
Syracuse Hancock Intl
43.111187
-76.106311
421.0
-5.0
A
America/New_York
1007559
2013
9
30
2240.0
2245
-5.0
2334.0
2351
-17.0
B6
...
10.0
2013-09-30 17:00:00
SYR
Syracuse Hancock Intl
43.111187
-76.106311
421.0
-5.0
A
America/New_York
1007560
2013
9
30
2240.0
2245
-5.0
2334.0
2351
-17.0
B6
...
10.0
2013-09-30 17:00:00
SYR
Syracuse Hancock Intl
43.111187
-76.106311
421.0
-5.0
A
America/New_York
1007561
2013
9
30
2240.0
2250
-10.0
2347.0
7
-20.0
B6
...
10.0
2013-09-30 17:00:00
BUF
Buffalo Niagara Intl
42.940525
-78.732167
724.0
-5.0
A
America/New_York
1007562
2013
9
30
2240.0
2250
-10.0
2347.0
7
-20.0
B6
...
10.0
2013-09-30 17:00:00
BUF
Buffalo Niagara Intl
42.940525
-78.732167
724.0
-5.0
A
America/New_York
1007563
2013
9
30
2240.0
2250
-10.0
2347.0
7
-20.0
B6
...
10.0
2013-09-30 17:00:00
BUF
Buffalo Niagara Intl
42.940525
-78.732167
724.0
-5.0
A
America/New_York
1007564
2013
9
30
2241.0
2246
-5.0
2345.0
1
-16.0
B6
...
10.0
2013-09-30 17:00:00
ROC
Greater Rochester Intl
43.118866
-77.672389
559.0
-5.0
A
America/New_York
1007565
2013
9
30
2241.0
2246
-5.0
2345.0
1
-16.0
B6
...
10.0
2013-09-30 17:00:00
ROC
Greater Rochester Intl
43.118866
-77.672389
559.0
-5.0
A
America/New_York
1007566
2013
9
30
2241.0
2246
-5.0
2345.0
1
-16.0
B6
...
10.0
2013-09-30 17:00:00
ROC
Greater Rochester Intl
43.118866
-77.672389
559.0
-5.0
A
America/New_York
1007567
2013
9
30
2307.0
2255
12.0
2359.0
2358
1.0
B6
...
10.0
2013-09-30 17:00:00
BOS
General Edward Lawrence Logan Intl
42.364347
-71.005181
19.0
-5.0
A
America/New_York
1007568
2013
9
30
2307.0
2255
12.0
2359.0
2358
1.0
B6
...
10.0
2013-09-30 17:00:00
BOS
General Edward Lawrence Logan Intl
42.364347
-71.005181
19.0
-5.0
A
America/New_York
1007569
2013
9
30
2307.0
2255
12.0
2359.0
2358
1.0
B6
...
10.0
2013-09-30 17:00:00
BOS
General Edward Lawrence Logan Intl
42.364347
-71.005181
19.0
-5.0
A
America/New_York
838182 rows × 38 columns
In [64]:
pred = 'dep_delay'
features = ['month', 'day', 'dep_time', 'arr_time', 'carrier', 'dest', 'air_time',
'distance', 'lat', 'lon', 'alt', 'dewp', 'humid', 'wind_speed', 'wind_gust',
'precip', 'pressure', 'visib']
features_v = df[features]
pred_v = df[pred]
how_late_is_late = 15.0
pd.options.mode.chained_assignment = None #default='warn'
# 因为航空公司不是一个数字,我们把它转化为数字哑变量
features_v['carrier'] = pd.factorize(features_v['carrier'])[0]
# dest也不是一个数字,我们也把它转为数字
features_v['dest'] = pd.factorize(features_v['dest'])[0]
scaler = StandardScaler()
scaled_features_v = scaler.fit_transform(features_v)
features_train, features_test, pred_train, pred_test = train_test_split(
scaled_features_v, pred_v, test_size=0.30, random_state=0)
In [66]:
# 使用logistic回归来执行分类
clf_lr = sklearn.linear_model.LogisticRegression(penalty='l2',
class_weight='balanced')
logistic_fit = clf_lr.fit(features_train, np.where(pred_train >= how_late_is_late, 1, 0))
predictions = clf_lr.predict(features_test)
In [67]:
# summary Report
# Confusion Matrix
cm_lr = confusion_matrix(np.where(pred_test >= how_late_is_late, 1, 0),
predictions)
print("Confusion Matrix")
print(pd.DataFrame(cm_lr))
# 获取精确值
report_lr = precision_recall_fscore_support(
list(np.where(pred_test >= how_late_is_late, 1, 0)),
list(predictions), average='binary')
#打印精度值
print("\nprecision = %0.2f, recall = %0.2f, F1 = %0.2f, accuracy = %0.2f"
% (report_lr[0], report_lr[1], report_lr[2],
accuracy_score(list(np.where(pred_test >= how_late_is_late, 1, 0)),
list(predictions))))
Confusion Matrix
0 1
0 135181 64481
1 17910 33883
precision = 0.34, recall = 0.65, F1 = 0.45, accuracy = 0.67
In [68]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sklearn
from sklearn.cluster import KMeans
from sklearn import linear_model, cross_validation, cluster
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_fscore_support, accuracy_score
from sklearn.cross_validation import train_test_split, cross_val_score, ShuffleSplit
from sklearn.preprocessing import StandardScaler, OneHotEncoder
flights = pd.read_csv('data/nycflights13/flights.csv.gz')
weather = pd.read_csv('data/nycflights13/weather.csv.gz')
airports = pd.read_csv('data/nycflights13/airports.csv.gz')
df_withweather = pd.merge(flights, weather, how='left', on=['year', 'month', 'day', 'hour'])
df = pd.merge(df_withweather, airports, how='left', left_on='dest', right_on='faa')
df = df.dropna()
In [69]:
pred = 'dep_delay'
features = ['month', 'day', 'dep_time', 'arr_time', 'carrier', 'dest', 'air_time',
'distance', 'lat', 'lon', 'alt', 'dewp', 'humid', 'wind_speed', 'wind_gust',
'precip', 'pressure', 'visib']
features_v = df[features]
pred_v = df[pred]
how_late_is_late = 15.0
pd.options.mode.chained_assignment = None #default='warn'
# 因为航空公司不是一个数字,我们把它转化为数字哑变量
features_v['carrier'] = pd.factorize(features_v['carrier'])[0]
# dest也不是一个数字,我们也把它转为数字
features_v['dest'] = pd.factorize(features_v['dest'])[0]
scaler = StandardScaler()
scaled_features_v = scaler.fit_transform(features_v)
features_train, features_test, pred_train, pred_test = train_test_split(
scaled_features_v, pred_v, test_size=0.30, random_state=0)
In [71]:
cluster = sklearn.cluster.KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', random_state=None, verbose=0)
cluster.fit(features_train)
Out[71]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=8, n_init=10, n_jobs=1, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
In [72]:
# 预测测试数据
result = cluster.predict(features_test)
In [73]:
result
Out[73]:
array([7, 0, 7, ..., 7, 5, 0])
In [78]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
reduced_data = PCA(n_components=2).fit_transform(features_train)
kmeans = KMeans(init='k-means++', n_clusters=8, n_init=10)
kmeans.fit(reduced_data)
# mesh的步长
h = .02
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
z = z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(z, interpolation='nearest',
extend=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap=plt.cm.Paired
#aspect='auto'
# origin='lower'
)
plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
marker='x', s=169, linewidths=3,
color='w', zorder=10)
plt.title('K-Means clustering on the dataset (PCA-reduced data)\n'
'Centroids are marked with white cross')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-78-605c50db6054> in <module>()
20 plt.imshow(z, interpolation='nearest',
21 extend=(xx.min(), xx.max(), yy.min(), yy.max()),
---> 22 cmap=plt.cm.Paired
23 #aspect='auto'
24 # origin='lower'
d:\app\Anaconda3\lib\site-packages\matplotlib\pyplot.py in imshow(X, cmap, norm, aspect, interpolation, alpha, vmin, vmax, origin, extent, shape, filternorm, filterrad, imlim, resample, url, hold, data, **kwargs)
3078 filternorm=filternorm, filterrad=filterrad,
3079 imlim=imlim, resample=resample, url=url, data=data,
-> 3080 **kwargs)
3081 finally:
3082 ax._hold = washold
d:\app\Anaconda3\lib\site-packages\matplotlib\__init__.py in inner(ax, *args, **kwargs)
1708 warnings.warn(msg % (label_namer, func.__name__),
1709 RuntimeWarning, stacklevel=2)
-> 1710 return func(ax, *args, **kwargs)
1711 pre_doc = inner.__doc__
1712 if pre_doc is None:
d:\app\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py in imshow(self, X, cmap, norm, aspect, interpolation, alpha, vmin, vmax, origin, extent, shape, filternorm, filterrad, imlim, resample, url, **kwargs)
5190 im = mimage.AxesImage(self, cmap, norm, interpolation, origin, extent,
5191 filternorm=filternorm, filterrad=filterrad,
-> 5192 resample=resample, **kwargs)
5193
5194 im.set_data(X)
d:\app\Anaconda3\lib\site-packages\matplotlib\image.py in __init__(self, ax, cmap, norm, interpolation, origin, extent, filternorm, filterrad, resample, **kwargs)
755 filterrad=filterrad,
756 resample=resample,
--> 757 **kwargs
758 )
759
d:\app\Anaconda3\lib\site-packages\matplotlib\image.py in __init__(self, ax, cmap, norm, interpolation, origin, filternorm, filterrad, resample, **kwargs)
233 self._imcache = None
234
--> 235 self.update(kwargs)
236
237 def __getstate__(self):
d:\app\Anaconda3\lib\site-packages\matplotlib\artist.py in update(self, props)
845 try:
846 ret = [_update_property(self, k, v)
--> 847 for k, v in props.items()]
848 finally:
849 self.eventson = store
d:\app\Anaconda3\lib\site-packages\matplotlib\artist.py in <listcomp>(.0)
845 try:
846 ret = [_update_property(self, k, v)
--> 847 for k, v in props.items()]
848 finally:
849 self.eventson = store
d:\app\Anaconda3\lib\site-packages\matplotlib\artist.py in _update_property(self, k, v)
838 func = getattr(self, 'set_' + k, None)
839 if not callable(func):
--> 840 raise AttributeError('Unknown property %s' % k)
841 return func(v)
842
AttributeError: Unknown property extend
扩展我们的算法:有时我们需要处理大量数据,并且采样已经无效,这个时候可以通过把数据分到多个机器来处理。
Spark是一个用来并行进行大数据处理的API。它将数据切割到集群来处理。在开发阶段,我们可以只在本地运行。
我们使用PySpark Shell来连接到集群。
运行下面路径的pyspark,会启动PySpark Shell
~/spark/bin/pyspark (Max/Linux)
C:\spark\bin\pyspark (Windows)
此时,可以在Shell中运行文件加载:
lines = sc.textFile("README.md")
lines.first() # 加载第一行
可以在http://localhost:4040
查看PySpark运行的Job
大多数情况下,我们希望能够在Jupyter Notebook中运行PySpark,为此,我们需要设置环境变量:
PYSPARK_PYTHON=python3
PYSPARK_DRIVER_PYTHON="jupyter"
PYSPARK_DRIVER_PYTHON_OPTS="notebook"
然后运行~/spark/bin/pyspark
,最后一个命令会启动一个jupyter server,样子跟我们用的一样。
In [1]:
lines = sc.text('README.md')
lines.take(5)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-1-2f627e434df0> in <module>()
----> 1 lines = sc.text('README.md')
2 lines.take(5)
NameError: name 'sc' is not defined
我们看看http://localhost:4040 可以查看运行的Job
In [2]:
linesWithSpark = lines.filter(lambda line: 'spark' in line)
linesWithSpark.count()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-2-a9d79c569bce> in <module>()
----> 1 linesWithSpark = lines.filter(lambda line: 'spark' in line)
2 linesWithSpark.count()
NameError: name 'lines' is not defined
Spark的基本类型是RDD(resilient distributed dataset),它是基本分布式数据类型。RDD有两类操作,第一个是变换(transformation),返回值仍然是RDD,另外一种是动作(action),用来计算结果。Spark的操作是Lazy的,也就是说只有在执行action时才会真正的开始处理。
In [ ]:
Content source: liulixiang1988/documents
Similar notebooks: