Envinroment preparation

First line if magic enabling matplotlib inline plots


In [1]:
%matplotlib inline

Then we have round of inports:

  • pandas is our main data storage module
  • glob and os are used for filename manipulations

In [2]:
import dask.dataframe as dd
import pandas as pd
import glob
import os

Files index and identification


In [3]:
filenames = [ os.path.splitext(wholeFilename)[0] for wholeFilename in 
             [ os.path.basename(wholePath) for wholePath in glob.glob("../input/2*.xlsx") ] ]

In [4]:
dataFiles = pd.DataFrame({"filename": filenames})
dataFiles["year"], dataFiles["pollutant"], dataFiles["resolution"] = dataFiles["filename"].str.split('_', 2).str

In [5]:
dataFiles.head()


Out[5]:
filename year pollutant resolution
0 2010_Pb(PM10)_24g 2010 Pb(PM10) 24g
1 2007_As(PM10)_24g 2007 As(PM10) 24g
2 2005_PM2.5_24g 2005 PM2.5 24g
3 2009_PM2.5_1g 2009 PM2.5 1g
4 2011_NH4+(PM2.5)_24g 2011 NH4+(PM2.5) 24g

In [6]:
dataFiles["year"].value_counts()


Out[6]:
2012    35
2011    35
2014    35
2013    34
2015    29
2010    25
2009    24
2008    24
2007    18
2006    18
2005    17
2004    17
2003    17
2002    14
2001    11
2000     6
Name: year, dtype: int64

In [7]:
dataFiles["pollutant"].value_counts()


Out[7]:
NO2            32
SO2            31
PM10           29
C6H6           28
PM2.5          22
O3             16
NOx            16
Cd(PM10)       15
Ni(PM10)       15
BaP(PM10)      15
Pb(PM10)       14
As(PM10)       14
CO             13
IP(PM10)        8
BjF(PM10)       8
BbF(PM10)       8
BkF(PM10)       8
BaA(PM10)       8
DBahA(PM10)     7
formaldehyd     5
Na+(PM2.5)      4
NH4+(PM2.5)     4
NO3-(PM2.5)     4
Ca2+(PM2.5)     4
OC(PM2.5)       4
Mg2+(PM2.5)     4
Cl              4
EC(PM2.5)       4
SO42            4
K+(PM2.5)       4
PM25            2
Hg(TGM)         2
DBah(PM10)      1
depozycja       1
Jony            1
Name: pollutant, dtype: int64

In [8]:
dataFiles["resolution"].value_counts()


Out[8]:
24g            236
1g             113
(PM2.5)-24g      4
(PM2.5)_24g      4
w_PM25_24g       1
Name: resolution, dtype: int64

Fixing data files identification


In [9]:
dataFiles.ix[dataFiles["resolution"] == "(PM2.5)-24g", 'pollutant'] = "Cl_(PM2.5)"
dataFiles.ix[dataFiles["resolution"] == "(PM2.5)-24g", 'resolution'] = "24g"


/home/damian/.virtualenvs/kaggle/lib/python2.7/site-packages/ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  """Entry point for launching an IPython kernel.

In [10]:
dataFiles.ix[dataFiles["resolution"] == "(PM2.5)_24g", 'pollutant'] = "SO42_(PM2.5)"
dataFiles.ix[dataFiles["resolution"] == "(PM2.5)_24g", 'resolution'] = "24g"

In [11]:
dataFiles.ix[dataFiles["resolution"] == "w_PM25_24g", 'pollutant'] = "Jony_w_PM25"
dataFiles.ix[dataFiles["resolution"] == "w_PM25_24g", 'resolution'] = "24g"

Now resolution column should be correct:


In [12]:
dataFiles["resolution"].value_counts()


Out[12]:
24g    245
1g     113
Name: resolution, dtype: int64

Lets also fix worngly labelled pollutant:


In [13]:
dataFiles.ix[dataFiles["pollutant"] == "PM25", 'pollutant'] = "PM2.5"

In [14]:
dataFiles.describe()


Out[14]:
filename year pollutant resolution
count 359 359 359 358
unique 359 16 34 2
top 2012_NO2_24g 2012 NO2 24g
freq 1 35 32 245

There is still one empty cell in resolution column. Lets identify it:


In [15]:
dataFiles[dataFiles["resolution"].isnull()]


Out[15]:
filename year pollutant resolution
312 2015_depozycja 2015 depozycja NaN

After manually examinign 2015_depozycja file I found that it cointains new type of data, which will be useless in planned analysis. I decided to remove it from working memory.


In [16]:
dataFiles.drop(dataFiles[dataFiles["filename"] == "2015_depozycja"].index, inplace=True)

In [17]:
dataFiles.describe()


Out[17]:
filename year pollutant resolution
count 358 358 358 358
unique 358 16 33 2
top 2012_NO2_24g 2012 NO2 24g
freq 1 35 32 245

Looking for worst measuring station for each pollutant in 2015


In [18]:
importantPollutants = ["PM10", "PM2.5", "O3", "NO2", "SO2", "C6H6", "CO"]
pollutants2015 = dataFiles[(dataFiles["year"] == "2015") & (dataFiles["resolution"] == "1g") & 
                           (dataFiles["pollutant"].isin(importantPollutants))]

In [19]:
pollutants2015


Out[19]:
filename year pollutant resolution
14 2015_NO2_1g 2015 NO2 1g
106 2015_CO_1g 2015 CO 1g
138 2015_SO2_1g 2015 SO2 1g
141 2015_O3_1g 2015 O3 1g
207 2015_C6H6_1g 2015 C6H6 1g
310 2015_PM10_1g 2015 PM10 1g
326 2015_PM25_1g 2015 PM2.5 1g

In [20]:
from tqdm import tqdm

In [21]:
from collections import Counter

In [22]:
#worstStation = {}
#for index, dataRow in tqdm(pollutants2015.iterrows(), total=len(pollutants2015.index)):
#    dataFromFile = pd.read_excel("../input/" + dataRow["filename"] + ".xlsx", skiprows=[1,2])
#    dataFromFile = dataFromFile.rename(columns={"Kod stacji":"Godzina"})
#    dataFromFile = dataFromFile.set_index("Godzina")
#    worstStation[dataRow["pollutant"]] = dataFromFile.max().sort_values(ascending = False).index[0]

Building one big data frame


In [23]:
pollutants = importantPollutants
years = sorted(list(dataFiles["year"].unique()))

In [ ]:
pollutantsYears = dataFiles[(dataFiles["year"].isin(years)) & (dataFiles["resolution"] == "1g") & 
                           (dataFiles["pollutant"].isin(pollutants))]

In [ ]:
bigDataFrame = pd.DataFrame()
for dataYear in years: 
    print(dataYear)
    yearDataFrame = pd.DataFrame()
    for index, dataRow in tqdm(pollutantsYears[pollutantsYears["year"] == dataYear].iterrows(), total=len(pollutantsYears[pollutantsYears["year"] == dataYear].index)):
        data = pd.read_excel("../input/" + dataRow["filename"] + ".xlsx", skiprows=[1,2])
        data = data.rename(columns={"Kod stacji":"Hour"})

        year = int(dataRow["year"])
        rng = pd.date_range(start = str(year) + '-01-01 01:00:00', end = str(year+1) + '-01-01 00:00:00', freq='H')

        # workaround for 2006_PM2.5_1g, 2012_PM10_1g, 2012_O3_1g
        try:
            data["Hour"] = rng
        except ValueError:
            print("File {} has some mess with timestamps".format(dataRow["filename"]))
            continue

        data = data.set_index("Hour")
        data = data.stack()
        data = pd.DataFrame(data, columns=[dataRow["pollutant"]])
        data.index.set_names(['Hour', 'Station'], inplace=True)

        yearDataFrame = pd.concat([yearDataFrame, data], axis=1)
    
    bigDataFrame = bigDataFrame.append(yearDataFrame)


  0%|          | 0/2 [00:00<?, ?it/s]
2000
100%|██████████| 2/2 [00:08<00:00,  3.86s/it]
  0%|          | 0/3 [00:00<?, ?it/s]
2001
100%|██████████| 3/3 [00:12<00:00,  3.84s/it]
  0%|          | 0/3 [00:00<?, ?it/s]
2002
100%|██████████| 3/3 [00:13<00:00,  4.16s/it]
  0%|          | 0/6 [00:00<?, ?it/s]
2003
100%|██████████| 6/6 [00:57<00:00, 10.64s/it]
  0%|          | 0/6 [00:00<?, ?it/s]
2004
100%|██████████| 6/6 [01:24<00:00, 14.12s/it]
  0%|          | 0/6 [00:00<?, ?it/s]
2005
100%|██████████| 6/6 [01:52<00:00, 20.58s/it]
  0%|          | 0/7 [00:00<?, ?it/s]
2006
 14%|█▍        | 1/7 [00:00<00:04,  1.46it/s]
File 2006_PM2.5_1g has some mess with timestamps
100%|██████████| 7/7 [02:05<00:00, 17.49s/it]
  0%|          | 0/7 [00:00<?, ?it/s]
2007
100%|██████████| 7/7 [02:34<00:00, 22.75s/it]
2008
100%|██████████| 7/7 [02:35<00:00, 19.62s/it]
  0%|          | 0/7 [00:00<?, ?it/s]
2009
100%|██████████| 7/7 [02:21<00:00, 21.75s/it]
  0%|          | 0/7 [00:00<?, ?it/s]
2010
100%|██████████| 7/7 [02:25<00:00, 20.74s/it]
  0%|          | 0/7 [00:00<?, ?it/s]
2011
100%|██████████| 7/7 [02:24<00:00, 21.05s/it]
  0%|          | 0/7 [00:00<?, ?it/s]
2012
 57%|█████▋    | 4/7 [01:10<00:52, 17.36s/it]
File 2012_PM10_1g has some mess with timestamps
100%|██████████| 7/7 [02:11<00:00, 19.16s/it]
File 2012_O3_1g has some mess with timestamps
  0%|          | 0/7 [00:00<?, ?it/s]
2013
100%|██████████| 7/7 [02:34<00:00, 24.34s/it]
100%|██████████| 7/7 [04:18<00:00, 31.53s/it]
  0%|          | 0/7 [00:00<?, ?it/s]
2015
100%|██████████| 7/7 [02:54<00:00, 23.20s/it]

In [ ]:
bigDataFrame.to_pickle("../output/bigDataFrame.pkl")

In [ ]:
bigDataFrame


Out[ ]:
C6H6 CO NO2 O3 PM10 PM2.5 SO2
Hour Station
2000-01-01 01:00:00 Pm.a01a NaN NaN 25.000000 NaN NaN NaN NaN
Pm.a03a NaN NaN NaN 38.000000 NaN NaN NaN
2000-01-01 02:00:00 DsCzer02 NaN NaN 6.000000 35.000000 NaN NaN NaN
DsJelw05 NaN NaN 14.000000 29.000000 NaN NaN NaN
DsSniezka NaN NaN NaN 61.000000 NaN NaN NaN
LbJarczew NaN NaN NaN 25.000000 NaN NaN NaN
MpKrakowWIOSAKra6117 NaN NaN 62.000000 NaN NaN NaN NaN
MpKrakowWIOSBulw6118 NaN NaN 48.000000 NaN NaN NaN NaN
MpKrakowWIOSPrad6115 NaN NaN 38.000000 NaN NaN NaN NaN
MzWarPodIMGW NaN NaN NaN 9.000000 NaN NaN NaN
MzWarszKrucza NaN NaN NaN 12.000000 NaN NaN NaN
Pm.a01a NaN NaN 24.000000 NaN NaN NaN NaN
Pm.a02a NaN NaN 27.000000 NaN NaN NaN NaN
Pm.a03a NaN NaN 13.000000 39.000000 NaN NaN NaN
Pm.a04a NaN NaN 11.000000 NaN NaN NaN NaN
Pm.a06a NaN NaN 13.000000 NaN NaN NaN NaN
Pm.a07a NaN NaN 7.000000 NaN NaN NaN NaN
Pm.a08a NaN NaN 11.000000 NaN NaN NaN NaN
Pm.a09a NaN NaN 10.000000 39.000000 NaN NaN NaN
Pm08LEBAiEMEPa NaN NaN NaN 33.000000 NaN NaN NaN
SlBytomByto_modrz NaN NaN 55.000000 NaN NaN NaN NaN
SlKatowKato_kossu NaN NaN NaN 11.000000 NaN NaN NaN
SlSosnoSosn_narut NaN NaN 55.000000 NaN NaN NaN NaN
SlUstroCies_sanat NaN NaN NaN 19.000000 NaN NaN NaN
WmPuszcz_IOS_Borecka NaN NaN NaN 40.000000 NaN NaN NaN
2000-01-01 03:00:00 DsCzer02 NaN NaN 3.000000 42.000000 NaN NaN NaN
DsJelw05 NaN NaN 12.000000 34.000000 NaN NaN NaN
DsSniezka NaN NaN NaN 62.000000 NaN NaN NaN
LbJarczew NaN NaN NaN 20.000000 NaN NaN NaN
MpKrakowWIOSAKra6117 NaN NaN 56.000000 NaN NaN NaN NaN
... ... ... ... ... ... ... ... ...
2016-01-01 00:00:00 SlKatoKossut NaN NaN 48.290000 5.449680 157.855000 133.811000 26.860100
SlKatoPlebA4 NaN 1.295760 63.356200 NaN NaN NaN 36.425100
SlRybniBorki 10.985600 0.837930 20.880100 38.192600 72.472000 NaN 32.779800
SlSosnoLubel NaN NaN 51.530000 NaN 197.927000 NaN 57.767100
SlTychyTolst NaN NaN 33.202800 NaN 317.277000 NaN 60.939800
SlUstronSana NaN NaN 2.934920 65.855900 10.680200 NaN 7.359760
SlWodzGalczy NaN 2.095970 45.876400 18.443900 223.011000 NaN 80.302100
SlZabSkloCur NaN 1.491160 39.449300 18.626100 187.469000 NaN 59.658100
SlZlotPotLes NaN NaN 8.782320 5.976600 53.907800 51.620700 8.746380
SlZorySikors NaN 1.438560 NaN NaN NaN NaN 68.223300
SlZywieKoper NaN NaN 20.399000 NaN 130.338000 NaN 44.293300
WmElbBazynsk 1.679240 0.290620 6.775208 46.842045 35.258751 NaN 4.430197
WmGoldJacwie NaN 0.381811 10.383073 38.974819 50.992085 NaN 4.488664
WmMragParkow NaN NaN NaN 43.796345 42.431252 NaN NaN
WmOlsPuszkin 0.658847 0.462765 11.895022 38.350243 47.043335 40.265392 2.864131
WmOstrChrobr NaN 0.554376 10.615822 35.995342 43.242916 NaN 2.747365
WmPuszczaBor NaN NaN NaN 36.000000 NaN NaN NaN
WpBoroDrapal NaN NaN 18.765600 17.502500 64.171900 NaN 9.077800
WpKaliSawick NaN 0.644600 19.310900 NaN 95.235000 87.123600 101.044000
WpKoniWyszyn NaN 0.550630 14.855800 33.285800 54.186900 NaN 15.420500
WpPiaskiKrzy NaN NaN 14.105100 31.850400 NaN NaN 15.532500
WpPilaKusoci NaN 0.751580 21.919400 NaN 112.891000 NaN 26.197800
WpPoznDabrow 4.455180 0.722890 30.756600 15.803900 87.058900 NaN 15.778800
WpPoznPolank NaN 0.823150 26.569600 NaN 68.198500 55.935100 16.245400
ZpKoszArKraj NaN NaN 16.816300 NaN 49.697500 NaN 12.236100
ZpSzczAndr01 NaN NaN 10.232900 31.618800 83.180500 76.853900 8.514580
ZpSzczLacz04 NaN NaN NaN NaN 46.621600 NaN NaN
ZpSzczPils02 1.969150 0.480210 28.611000 NaN NaN 29.842200 9.950620
ZpSzczecPrze NaN NaN 29.005200 NaN 61.945100 NaN 19.032500
ZpWiduBulRyb NaN NaN 7.560370 39.732600 NaN NaN 4.129960

16550991 rows × 7 columns


In [ ]:
daskBigDataFrame = dd.from_pandas(bigDataFrame)


---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-28-128c0d7d8d46> in <module>()
----> 1 daskBigDataFrame = dd.from_pandas(bigDataFrame)

/home/damian/.virtualenvs/kaggle/local/lib/python2.7/site-packages/dask/dataframe/io/io.pyc in from_pandas(data, npartitions, chunksize, sort, name)
    165     """
    166     if isinstance(getattr(data, 'index', None), pd.MultiIndex):
--> 167         raise NotImplementedError("Dask does not support MultiIndex Dataframes.")
    168 
    169     if not isinstance(data, (pd.Series, pd.DataFrame)):

NotImplementedError: Dask does not support MultiIndex Dataframes.