Ejercicios de análisis de datos con pandas

Fuente: https://github.com/PyDataMadrid2016/Conference-Info/tree/master/workshops_materials/20160408_1100_Pandas_for_beginners/tutorial por Kiko Correoso, licencia MIT

En la carpeta de datos tenemos un fichero que se llama model.txt que contiene datos de medidas de viento: velocidad, orientación, temperatura...


In [1]:
!head ../data/model.txt


Lat=0  Lon=0  Hub-Height=987  Timezone=88   ASL-Height(avg. 3km-grid)=0
Provider (www.provider.noway) - Computed at 3km resolution based on GLOBAL data (designed for correlation purposes)
 
YYYYMMDD HHMM  M(m/s) D(deg)  T(C)  De(k/m3) PRE(hPa)      RiNumber  RH(%)
19840101 0000   20.8    243    7.3    1.25    1002.8           0.04   86.3
19840101 0100   20.8    243    7.3    1.25    1002.2           0.04   86.9
19840101 0200   20.6    243    7.4    1.24    1001.8           0.04   87.0
19840101 0300   20.4    244    7.5    1.24    1001.5           0.05   86.2
19840101 0400   20.3    245    7.5    1.24    1001.2           0.05   85.6
19840101 0500   20.4    245    7.6    1.24    1000.9           0.06   85.2

In [2]:
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl

from IPython.display import display

In [3]:
model = pd.read_csv(
    "../data/model.txt", delim_whitespace=True, skiprows = 3,
    parse_dates = {'Timestamp': [0, 1]}, index_col = 'Timestamp')

In [4]:
model.head()


Out[4]:
M(m/s) D(deg) T(C) De(k/m3) PRE(hPa) RiNumber RH(%)
Timestamp
1984-01-01 00:00:00 20.8 243 7.3 1.25 1002.8 0.04 86.3
1984-01-01 01:00:00 20.8 243 7.3 1.25 1002.2 0.04 86.9
1984-01-01 02:00:00 20.6 243 7.4 1.24 1001.8 0.04 87.0
1984-01-01 03:00:00 20.4 244 7.5 1.24 1001.5 0.05 86.2
1984-01-01 04:00:00 20.3 245 7.5 1.24 1001.2 0.05 85.6

Ejercicios

Sobre el conjunto de datos model:

  1. Representar la matriz scatter de la velocidad y orientación del viento de los primeros mil registros.
  2. Misma matriz scatter para los 1000 registros con mayor velocidad, ordenados.
  3. Histograma de la velocidad del viento con 36 particiones.
  4. Histórico de la velocidad media, con los datos agrupados por años y meses.
  5. Tabla de velocidades medias en función del año (filas) y del mes (columnas).
  6. Gráfica con los históricos de cada año, agrupados por meses, superpuestos.

Representamos la matriz scatter de la velocidad y orientación del viento de los primeros mil registros:


In [5]:
pd.tools.plotting.scatter_matrix(model.loc[model.index[:1000], 'M(m/s)':'D(deg)'])


Out[5]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f7e8ed38940>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f7e8edc7208>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f7e8eda34a8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f7e8ecbcef0>]], dtype=object)

Misma matriz scatter para los 1000 registros con mayor velocidad:


In [6]:
pd.tools.plotting.scatter_matrix(
    model.loc[model.sort_values('M(m/s)', ascending=False).index[:1000],
              'M(m/s)':'D(deg)']
)


Out[6]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f7e8ebccc88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f7e8ebad1d0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f7e8eccc710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f7e8eb2e828>]], dtype=object)

In [7]:
model.loc[:, 'M(m/s)'].plot.hist(bins=np.arange(0, 35))


Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7e8ea98a90>

In [8]:
model['month'] = model.index.month
model['year'] = model.index.year

Histórico de la velocidad media:


In [9]:
model.groupby(by = ['year', 'month']).mean().head(24)


Out[9]:
M(m/s) D(deg) T(C) De(k/m3) PRE(hPa) RiNumber RH(%)
year month
1984 1 13.821505 233.603495 4.621505 1.247419 993.974328 -2.823320 81.038575
2 10.043534 164.089080 3.522701 1.269368 1007.659339 -29.335790 83.860201
3 8.708199 153.081989 4.138306 1.261425 1003.437231 0.250726 84.740591
4 7.939583 162.113889 6.781111 1.255986 1008.487083 -1.655014 74.165417
5 8.283871 123.590054 8.034677 1.238763 998.939113 -0.384704 87.813306
6 7.330833 231.652778 12.068194 1.228417 1005.009861 11.744944 81.090694
7 7.710618 222.166667 14.508199 1.219906 1006.686962 3.071788 78.730376
8 7.324194 147.930108 16.510484 1.210497 1006.095565 4.732742 79.204704
9 9.335694 231.116667 13.960000 1.210347 997.008889 -6.703986 81.153889
10 10.985215 225.610215 12.109140 1.223790 1001.571505 -9.919731 83.965188
11 10.789306 189.393056 9.877639 1.223389 993.419167 -183.106069 86.480694
12 9.600941 213.807796 6.825000 1.253763 1006.855242 -28.626452 87.389651
1985 1 10.424059 157.752688 1.499059 1.269126 1000.074059 -21.708602 82.745430
2 8.649851 150.980655 1.323512 1.283512 1010.598214 1.706592 82.009226
3 9.097043 190.330645 3.800000 1.259046 1000.441667 -0.927110 87.532258
4 10.895833 217.081944 7.155694 1.242097 998.885694 -5.238778 82.452083
5 9.147984 166.154570 9.715726 1.233145 1001.046371 6.960806 87.404570
6 8.583056 201.036111 11.734444 1.225500 1001.283194 -0.569833 82.841944
7 8.865860 216.310484 15.546640 1.211720 1003.435484 2.259113 80.224194
8 10.379704 224.643817 14.996774 1.210685 1000.788306 0.740094 80.920296
9 8.181250 228.433333 14.561250 1.220333 1007.310000 -2.072278 80.053889
10 8.021640 156.831989 12.262231 1.235470 1011.755645 -126.452137 82.420565
11 10.734167 196.769444 5.608333 1.252042 1000.843611 -18.886847 78.602639
12 12.567339 220.547043 7.274328 1.243387 1000.259409 -5.167849 88.165457

In [10]:
model.groupby(by=['year', 'month']).mean().plot(y='M(m/s)', figsize=(15, 5))


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7e8e9fc908>

Media móvil de los datos agrupados por mes y año:


In [11]:
monthly = model.groupby(by=['year', 'month']).mean()
monthly['ma'] = monthly.loc[:, 'M(m/s)'].rolling(5, center=True).mean()
monthly.head()


Out[11]:
M(m/s) D(deg) T(C) De(k/m3) PRE(hPa) RiNumber RH(%) ma
year month
1984 1 13.821505 233.603495 4.621505 1.247419 993.974328 -2.823320 81.038575 NaN
2 10.043534 164.089080 3.522701 1.269368 1007.659339 -29.335790 83.860201 NaN
3 8.708199 153.081989 4.138306 1.261425 1003.437231 0.250726 84.740591 9.759339
4 7.939583 162.113889 6.781111 1.255986 1008.487083 -1.655014 74.165417 8.461204
5 8.283871 123.590054 8.034677 1.238763 998.939113 -0.384704 87.813306 7.994621

In [12]:
monthly.loc[:, ['M(m/s)', 'ma']].plot(figsize=(15, 6))


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7e8e918908>

In [13]:
monthly.loc[:, 'M(m/s)'].reset_index().pivot(index='year', columns='month')


Out[13]:
M(m/s)
month 1 2 3 4 5 6 7 8 9 10 11 12
year
1984 13.821505 10.043534 8.708199 7.939583 8.283871 7.330833 7.710618 7.324194 9.335694 10.985215 10.789306 9.600941
1985 10.424059 8.649851 9.097043 10.895833 9.147984 8.583056 8.865860 10.379704 8.181250 8.021640 10.734167 12.567339
1986 13.830972 10.303423 10.832527 9.217222 11.092742 9.570000 6.824866 8.847043 7.099028 10.331855 12.446944 13.712500
1987 8.762231 9.502530 10.386962 9.063056 9.670161 7.983889 7.672312 7.507796 8.731528 10.418011 9.445417 10.355376
1988 12.345430 13.447557 10.783333 8.511806 9.107661 7.594167 10.986022 8.857124 10.319167 9.518683 8.726111 11.036559
1989 9.891935 12.448958 11.776478 9.206944 8.491667 7.417083 7.892876 8.117876 8.210139 11.445430 9.406389 9.345430
1990 13.390726 16.317262 11.582661 9.273889 7.007796 8.217778 8.460618 7.672581 8.973333 12.126210 9.382361 12.120161
1991 11.082124 8.971131 9.425000 10.917083 8.013441 10.012361 8.981317 7.620565 7.437778 9.339382 11.320278 9.228898
1992 9.519624 9.928879 11.797849 9.496667 9.234140 7.170139 7.781048 10.290054 9.600278 10.010215 12.911806 9.820027
1993 13.553629 8.300298 9.620296 9.226944 10.691801 7.086528 8.857527 8.148925 8.027778 9.998925 8.559306 13.214651
1994 12.743817 9.300446 13.058468 10.817361 8.309677 8.898056 7.125806 8.196505 9.523194 9.311290 9.085000 12.423925
1995 14.024866 13.239137 12.197043 9.416944 7.321909 9.166389 8.548118 8.264382 8.653056 9.555376 10.049861 8.207661
1996 10.754839 11.089511 8.685753 8.493611 10.265726 7.620417 8.129167 9.162634 9.287639 10.456048 11.589306 9.658468
1997 8.347715 14.924107 9.867876 8.124444 8.584274 9.646667 7.433871 7.436156 8.018889 9.434005 10.480139 11.958468
1998 13.109677 11.514286 10.942339 9.500417 8.895833 10.409583 9.251882 8.233871 9.785556 13.341532 9.758194 12.096640
1999 12.528763 11.269048 8.439247 9.330000 9.694220 7.580694 8.846640 7.202285 9.258333 10.698253 11.538472 13.665591
2000 11.494892 12.468247 10.172715 9.103472 8.910349 8.529167 7.106048 6.768145 9.485417 12.016129 12.598889 12.427823
2001 10.253495 9.584673 9.664516 10.243333 9.074731 8.480000 7.584005 7.993952 9.700417 12.024194 10.040139 10.264516
2002 11.843817 14.686756 9.946237 9.673611 10.154435 8.813611 8.414382 6.051613 7.703750 10.212903 10.006389 10.368280
2003 11.521909 9.675744 8.608065 9.995833 9.328091 7.710972 8.186559 6.483871 6.286944 10.084005 10.470139 10.409140
2004 12.250134 11.419828 10.138710 7.813889 6.522446 8.699167 7.430780 8.822043 10.807083 11.328091 8.940000 9.402823
2005 13.587500 10.366071 9.831720 8.748194 9.718414 7.874444 7.609946 7.816129 7.758333 10.028763 10.983056 10.680376
2006 9.297984 10.054167 10.873118 9.237639 9.704032 6.799444 7.897984 8.187634 8.796111 10.690323 12.469722 11.947446
2007 14.101075 9.645387 10.546640 8.262500 9.168145 8.840694 9.709005 8.060215 9.180694 7.139113 11.032639 11.233602
2008 14.116129 9.943103 12.370161 9.460139 9.980376 7.843889 8.795296 10.270161 8.654861 10.218548 10.714444 9.353360
2009 10.223387 8.581101 9.415591 8.026250 9.916532 6.409583 9.587366 8.451210 8.748472 9.085215 12.744306 10.260215
2010 8.289919 9.695387 9.692876 7.802083 7.061559 7.421528 7.606586 9.204167 8.591944 10.626747 9.750833 9.293548
2011 10.500941 11.319940 8.006317 8.849028 9.512500 9.312500 8.217742 8.245430 10.351806 10.840323 10.015833 13.737366
2012 11.287500 9.791810 7.621102 9.605000 9.024597 10.365000 8.137097 8.434812 9.145694 9.307258 10.382639 12.034812
2013 9.740323 9.398512 9.320027 10.267222 9.640860 9.429167 6.978495 7.817070 8.170417 11.430376 9.734028 12.709946
2014 12.961156 14.528125 9.064785 8.428750 8.235349 6.188056 8.442473 8.891801 6.275000 10.676882 9.564861 12.112500
2015 12.050134 10.035863 10.561156 8.027083 9.739382 9.018056 9.500134 8.748522 8.111250 NaN NaN NaN

In [14]:
monthly.loc[:, 'M(m/s)'].reset_index().pivot(
    index='year', columns='month'
).T.loc['M(m/s)'].plot(
    figsize=(15, 5), legend=False
)


Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7e8e8d8b70>