AVISO: ESTE JUPYTER NOTEBOOK VIENE DE UNA VERSION DE DATABRICKS NOTEBOOK. NO LO HE TESTEADO AQUI COMPLETAMENTE Y LA PARTE EN LA QUE SE APLICA RANDOM FORESTS PUEDE DAR ALGUN ERROR(AUNQUE NO CREO). LOS RESULTADOS SOBRE LOS QUE HABLO ABAJO SON LOS QUE ME HAN DADO A MI AL APLICAR EL ALGORITMO, Y AL VOLVERLO A EJECUTAR PUEDEN VARIAR UN POCO (PERO NO VARIARA SIGNIFICATIVAMENTE). EN EL PDF Y VERSION DE DATABRICKS QUE ACOMPAÑAN A ESTE NOTEBOOK SI VIENEN LOS RESULTADOS FINALES AL EJECUTAR EL ALGORITMO.
Partimos del siguiente archivo de datos(07/2009-03/2017) provenientes del medidor de calidad del aire instalado en el Palacio de Congresos de Granada
Out[11]:
|
date |
NO2 |
CO |
PART |
O3 |
SO2 |
0 |
2009-07-20 00:10:00 |
28 |
1708 |
50 |
74 |
6 |
1 |
2009-07-20 00:20:00 |
28 |
1694 |
51 |
76 |
6 |
2 |
2009-07-20 00:30:00 |
27 |
1700 |
45 |
77 |
6 |
3 |
2009-07-20 00:40:00 |
25 |
1662 |
33 |
79 |
6 |
4 |
2009-07-20 00:50:00 |
26 |
1680 |
23 |
76 |
6 |
5 |
2009-07-20 01:00:00 |
25 |
1673 |
23 |
78 |
6 |
6 |
2009-07-20 01:10:00 |
28 |
1692 |
29 |
66 |
6 |
7 |
2009-07-20 01:20:00 |
34 |
1729 |
31 |
48 |
6 |
8 |
2009-07-20 01:30:00 |
35 |
1723 |
35 |
53 |
6 |
9 |
2009-07-20 01:40:00 |
33 |
1717 |
42 |
58 |
6 |
10 |
2009-07-20 01:50:00 |
30 |
1695 |
46 |
63 |
6 |
11 |
2009-07-20 02:00:00 |
31 |
1719 |
59 |
57 |
6 |
12 |
2009-07-20 02:10:00 |
31 |
1722 |
74 |
54 |
6 |
13 |
2009-07-20 02:20:00 |
30 |
1702 |
76 |
56 |
6 |
14 |
2009-07-20 02:30:00 |
28 |
1678 |
70 |
61 |
6 |
15 |
2009-07-20 02:40:00 |
27 |
1683 |
62 |
60 |
6 |
16 |
2009-07-20 02:50:00 |
27 |
1685 |
50 |
58 |
6 |
17 |
2009-07-20 03:00:00 |
28 |
1691 |
40 |
55 |
6 |
18 |
2009-07-20 03:10:00 |
28 |
1698 |
43 |
50 |
6 |
19 |
2009-07-20 03:20:00 |
28 |
1697 |
49 |
50 |
6 |
20 |
2009-07-20 03:30:00 |
26 |
1650 |
48 |
57 |
6 |
21 |
2009-07-20 03:40:00 |
25 |
1663 |
47 |
60 |
6 |
22 |
2009-07-20 03:50:00 |
24 |
1660 |
45 |
57 |
5 |
23 |
2009-07-20 04:00:00 |
24 |
1649 |
40 |
56 |
5 |
24 |
2009-07-20 04:10:00 |
23 |
1665 |
38 |
58 |
6 |
25 |
2009-07-20 04:20:00 |
23 |
1652 |
45 |
54 |
6 |
26 |
2009-07-20 04:30:00 |
24 |
1671 |
42 |
51 |
6 |
27 |
2009-07-20 04:40:00 |
24 |
1662 |
30 |
50 |
6 |
28 |
2009-07-20 04:50:00 |
24 |
1677 |
25 |
47 |
6 |
29 |
2009-07-20 05:00:00 |
25 |
1680 |
19 |
46 |
5 |
... |
... |
... |
... |
... |
... |
... |
388862 |
2017-03-02 19:10:00 |
84 |
334 |
14 |
31 |
11 |
388863 |
2017-03-02 19:20:00 |
85 |
350 |
15 |
24 |
13 |
388864 |
2017-03-02 19:30:00 |
87 |
350 |
16 |
24 |
13 |
388865 |
2017-03-02 19:40:00 |
74 |
319 |
10 |
45 |
11 |
388866 |
2017-03-02 19:50:00 |
52 |
283 |
9 |
57 |
9 |
388867 |
2017-03-02 20:00:00 |
53 |
334 |
9 |
53 |
9 |
388868 |
2017-03-02 20:10:00 |
70 |
396 |
8 |
34 |
13 |
388869 |
2017-03-02 20:20:00 |
71 |
334 |
7 |
37 |
9 |
388870 |
2017-03-02 20:30:00 |
90 |
533 |
13 |
12 |
12 |
388871 |
2017-03-02 20:40:00 |
88 |
461 |
16 |
12 |
11 |
388872 |
2017-03-02 20:50:00 |
85 |
438 |
15 |
18 |
11 |
388873 |
2017-03-02 21:00:00 |
74 |
395 |
15 |
16 |
10 |
388874 |
2017-03-02 21:10:00 |
89 |
494 |
16 |
8 |
10 |
388875 |
2017-03-02 21:20:00 |
86 |
674 |
19 |
9 |
9 |
388876 |
2017-03-02 21:30:00 |
87 |
604 |
20 |
11 |
11 |
388877 |
2017-03-02 21:40:00 |
78 |
440 |
21 |
14 |
10 |
388878 |
2017-03-02 21:50:00 |
73 |
369 |
19 |
22 |
9 |
388879 |
2017-03-02 22:00:00 |
63 |
350 |
18 |
27 |
11 |
388880 |
2017-03-02 22:10:00 |
51 |
316 |
18 |
36 |
9 |
388881 |
2017-03-02 22:20:00 |
55 |
336 |
21 |
27 |
8 |
388882 |
2017-03-02 22:30:00 |
60 |
334 |
19 |
30 |
7 |
388883 |
2017-03-02 22:40:00 |
40 |
233 |
15 |
45 |
8 |
388884 |
2017-03-02 22:50:00 |
35 |
233 |
14 |
48 |
8 |
388885 |
2017-03-02 23:00:00 |
35 |
233 |
13 |
45 |
7 |
388886 |
2017-03-02 23:10:00 |
44 |
233 |
12 |
42 |
7 |
388887 |
2017-03-02 23:20:00 |
37 |
233 |
11 |
48 |
6 |
388888 |
2017-03-02 23:30:00 |
32 |
233 |
11 |
52 |
7 |
388889 |
2017-03-02 23:40:00 |
30 |
233 |
10 |
54 |
7 |
388890 |
2017-03-02 23:50:00 |
27 |
233 |
11 |
55 |
6 |
388891 |
2017-03-02 00:00:00 |
25 |
233 |
9 |
58 |
7 |
388892 rows × 6 columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 388892 entries, 0 to 388891
Data columns (total 7 columns):
NO2 388892 non-null object
CO 388892 non-null object
PART 388892 non-null object
O3 388892 non-null object
SO2 388892 non-null object
WEEKDAY 388892 non-null int64
TIME_minutesofday 388892 non-null int64
dtypes: int64(2), object(5)
memory usage: 20.8+ MB
Nuestros datos presentan cerca de 390000 filas. Podemos ver a continuacion como la columna relativa a la concentracion de particulas("PART") es la que presenta una mayor cantidad de \nfilas vacias/erroneas. Aun asi esto una perdida de datos no muy significativa: ~ 3-4%
Out[22]:
NO2 True
CO True
PART True
O3 True
SO2 True
WEEKDAY False
TIME_minutesofday False
dtype: bool
Out[23]:
NO2 7156
CO 5280
PART 14772
O3 2824
SO2 4103
WEEKDAY 0
TIME_minutesofday 0
dtype: int64
FORZAMOS VALORES NUMERICOS. ALLI DONDE NO ES POSIBLE SERA PORQUE NO HABIA DATOS, O ERAN TEXTO. ESOS PASAN A SER NP.NAN, AHORA MAS ABAJO LES METEMOS EL VALOR MEDIO DE LA COLUMNA A LA QUE PERTENEZCAN. AUNQUE SE PUEDE VER ARRIBA QUE EL UNICO QUE LE AFECTA DE VERDAD ES A LA COLUMNA 'PART' QUE HA PERDIDO CASI 15000 VALORES, EL RESTO ESTAN BASTANTE BIEN
A continuacion hacemos lo siguiente:
Limpiar el dataset y dejarlo preparado para que se pueda aplicar Random Forests.
Añadir columnas relativas al dia de la semana, hora del dia y si es fin de semana o no (considerando V-D y solo S-D)
Out[26]:
|
NO2 |
CO |
PART |
O3 |
SO2 |
WEEKDAY |
TIME_minutesofday |
WEEKEND_VD |
WEEKEND_SD |
0 |
28.0 |
1708.0 |
50.0 |
74.0 |
6.0 |
0 |
10 |
1 |
1 |
1 |
28.0 |
1694.0 |
51.0 |
76.0 |
6.0 |
0 |
20 |
1 |
1 |
2 |
27.0 |
1700.0 |
45.0 |
77.0 |
6.0 |
0 |
30 |
1 |
1 |
3 |
25.0 |
1662.0 |
33.0 |
79.0 |
6.0 |
0 |
40 |
1 |
1 |
4 |
26.0 |
1680.0 |
23.0 |
76.0 |
6.0 |
0 |
50 |
1 |
1 |
5 |
25.0 |
1673.0 |
23.0 |
78.0 |
6.0 |
0 |
60 |
1 |
1 |
6 |
28.0 |
1692.0 |
29.0 |
66.0 |
6.0 |
0 |
70 |
1 |
1 |
7 |
34.0 |
1729.0 |
31.0 |
48.0 |
6.0 |
0 |
80 |
1 |
1 |
8 |
35.0 |
1723.0 |
35.0 |
53.0 |
6.0 |
0 |
90 |
1 |
1 |
9 |
33.0 |
1717.0 |
42.0 |
58.0 |
6.0 |
0 |
100 |
1 |
1 |
10 |
30.0 |
1695.0 |
46.0 |
63.0 |
6.0 |
0 |
110 |
1 |
1 |
11 |
31.0 |
1719.0 |
59.0 |
57.0 |
6.0 |
0 |
120 |
1 |
1 |
12 |
31.0 |
1722.0 |
74.0 |
54.0 |
6.0 |
0 |
130 |
1 |
1 |
13 |
30.0 |
1702.0 |
76.0 |
56.0 |
6.0 |
0 |
140 |
1 |
1 |
14 |
28.0 |
1678.0 |
70.0 |
61.0 |
6.0 |
0 |
150 |
1 |
1 |
15 |
27.0 |
1683.0 |
62.0 |
60.0 |
6.0 |
0 |
160 |
1 |
1 |
16 |
27.0 |
1685.0 |
50.0 |
58.0 |
6.0 |
0 |
170 |
1 |
1 |
17 |
28.0 |
1691.0 |
40.0 |
55.0 |
6.0 |
0 |
180 |
1 |
1 |
18 |
28.0 |
1698.0 |
43.0 |
50.0 |
6.0 |
0 |
190 |
1 |
1 |
19 |
28.0 |
1697.0 |
49.0 |
50.0 |
6.0 |
0 |
200 |
1 |
1 |
20 |
26.0 |
1650.0 |
48.0 |
57.0 |
6.0 |
0 |
210 |
1 |
1 |
21 |
25.0 |
1663.0 |
47.0 |
60.0 |
6.0 |
0 |
220 |
1 |
1 |
22 |
24.0 |
1660.0 |
45.0 |
57.0 |
5.0 |
0 |
230 |
1 |
1 |
23 |
24.0 |
1649.0 |
40.0 |
56.0 |
5.0 |
0 |
240 |
1 |
1 |
24 |
23.0 |
1665.0 |
38.0 |
58.0 |
6.0 |
0 |
250 |
1 |
1 |
25 |
23.0 |
1652.0 |
45.0 |
54.0 |
6.0 |
0 |
260 |
1 |
1 |
26 |
24.0 |
1671.0 |
42.0 |
51.0 |
6.0 |
0 |
270 |
1 |
1 |
27 |
24.0 |
1662.0 |
30.0 |
50.0 |
6.0 |
0 |
280 |
1 |
1 |
28 |
24.0 |
1677.0 |
25.0 |
47.0 |
6.0 |
0 |
290 |
1 |
1 |
29 |
25.0 |
1680.0 |
19.0 |
46.0 |
5.0 |
0 |
300 |
1 |
1 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
388862 |
84.0 |
334.0 |
14.0 |
31.0 |
11.0 |
3 |
1150 |
0 |
0 |
388863 |
85.0 |
350.0 |
15.0 |
24.0 |
13.0 |
3 |
1160 |
0 |
0 |
388864 |
87.0 |
350.0 |
16.0 |
24.0 |
13.0 |
3 |
1170 |
0 |
0 |
388865 |
74.0 |
319.0 |
10.0 |
45.0 |
11.0 |
3 |
1180 |
0 |
0 |
388866 |
52.0 |
283.0 |
9.0 |
57.0 |
9.0 |
3 |
1190 |
0 |
0 |
388867 |
53.0 |
334.0 |
9.0 |
53.0 |
9.0 |
3 |
1200 |
0 |
0 |
388868 |
70.0 |
396.0 |
8.0 |
34.0 |
13.0 |
3 |
1210 |
0 |
0 |
388869 |
71.0 |
334.0 |
7.0 |
37.0 |
9.0 |
3 |
1220 |
0 |
0 |
388870 |
90.0 |
533.0 |
13.0 |
12.0 |
12.0 |
3 |
1230 |
0 |
0 |
388871 |
88.0 |
461.0 |
16.0 |
12.0 |
11.0 |
3 |
1240 |
0 |
0 |
388872 |
85.0 |
438.0 |
15.0 |
18.0 |
11.0 |
3 |
1250 |
0 |
0 |
388873 |
74.0 |
395.0 |
15.0 |
16.0 |
10.0 |
3 |
1260 |
0 |
0 |
388874 |
89.0 |
494.0 |
16.0 |
8.0 |
10.0 |
3 |
1270 |
0 |
0 |
388875 |
86.0 |
674.0 |
19.0 |
9.0 |
9.0 |
3 |
1280 |
0 |
0 |
388876 |
87.0 |
604.0 |
20.0 |
11.0 |
11.0 |
3 |
1290 |
0 |
0 |
388877 |
78.0 |
440.0 |
21.0 |
14.0 |
10.0 |
3 |
1300 |
0 |
0 |
388878 |
73.0 |
369.0 |
19.0 |
22.0 |
9.0 |
3 |
1310 |
0 |
0 |
388879 |
63.0 |
350.0 |
18.0 |
27.0 |
11.0 |
3 |
1320 |
0 |
0 |
388880 |
51.0 |
316.0 |
18.0 |
36.0 |
9.0 |
3 |
1330 |
0 |
0 |
388881 |
55.0 |
336.0 |
21.0 |
27.0 |
8.0 |
3 |
1340 |
0 |
0 |
388882 |
60.0 |
334.0 |
19.0 |
30.0 |
7.0 |
3 |
1350 |
0 |
0 |
388883 |
40.0 |
233.0 |
15.0 |
45.0 |
8.0 |
3 |
1360 |
0 |
0 |
388884 |
35.0 |
233.0 |
14.0 |
48.0 |
8.0 |
3 |
1370 |
0 |
0 |
388885 |
35.0 |
233.0 |
13.0 |
45.0 |
7.0 |
3 |
1380 |
0 |
0 |
388886 |
44.0 |
233.0 |
12.0 |
42.0 |
7.0 |
3 |
1390 |
0 |
0 |
388887 |
37.0 |
233.0 |
11.0 |
48.0 |
6.0 |
3 |
1400 |
0 |
0 |
388888 |
32.0 |
233.0 |
11.0 |
52.0 |
7.0 |
3 |
1410 |
0 |
0 |
388889 |
30.0 |
233.0 |
10.0 |
54.0 |
7.0 |
3 |
1420 |
0 |
0 |
388890 |
27.0 |
233.0 |
11.0 |
55.0 |
6.0 |
3 |
1430 |
0 |
0 |
388891 |
25.0 |
233.0 |
9.0 |
58.0 |
7.0 |
3 |
0 |
0 |
0 |
388892 rows × 9 columns
ANALIZAMOS AHORA QUE VARIABLES INFLUYEN MAS EN LOS NIVELES DE NO2
LA CANTIDAD DE O3(48%) Y LA HORA DEL DIA(20%) ES LO QUE MAS INFLUYE EN LOS NIVELES DE NO2. TAMBIEN TIENE INFLUENCIA EL CO.EL RESTO DE VARIABLES SON INSIGNIFICANTES
ANALIZAMOS AHORA QUE VARIABLES INFLUYEN MAS EN LOS NIVELES DE PARTICULAS
HAY 5 VARIABLES QUE PARECEN TENER IMPORTANCIA EN LOS NIVELES DE PARTICULAS. EL C0,NO2,HORA DEL DIA, 03 Y S02
ANALIZAMOS AHORA QUE VARIABLES INFLUYEN MAS EN LOS NIVELES DE S02
Hay 5 variables que parecen tener importancia en los niveles de S02. N02 y HORA DEL DIA sobre todo. Tambien C0, 03, PARTICULAS
ANALIZAMOS AHORA QUE VARIABLES INFLUYEN MAS EN LOS NIVELES DE O3
La cantidad de N02(47%) es lo que mas influye en los niveles de 03. Influyen tambien la hora del dia (25%) y la cantidad de CO(11%). El resto de parametros son insignificantes
Los analisis de N02 Y O3, cuyos modelos han alcanzado una precision por encima del 80% son aceptables. Sin embargo los analisis de SO2 (con un ~65% de precision) y sobre todo de PARTIculas (45%) son demasiado pobres y habria que realizar distintas iteraciones variando los parametros del algoritmo para alcanzar mayores precisiones