Copyright 2015 Enthought, Inc. All Rights Reserved

Solar power and cloudy skies

In this exercise, we would to compare the distribution of solar power when the sky is clear and when it is cloudy.

Part 1 - missing data

First of all, we need to deal with missing data in the dataset.

  1. Execute the code below to read the data from disk. The DataFrame contains the date and time of the measurements, the solar power in W, and a flag indicating if the sky is cloudy.
  2. Drop all the rows with missing data for the cloudy column, since we will not be able to use them in the analysis.
  3. Interpolate the missing data from the column containing the power, since it varies relatively smoothly. (Hint: look at the options for the method keyword to find an appropriate interpolation function)

In [6]:
%matplotlib inline

import pandas as pd
# Load the data and plot its columns.
solar = pd.read_csv('solar.csv', index_col=0, parse_dates=True)
solar.plot(subplots=True, figsize=(15, 5));



In [7]:
solar.head(20)


Out[7]:
power (W) cloudy
time
2014-11-01 11:00:00 NaN NaN
2014-11-01 12:00:00 258056.833333 0
2014-11-01 13:00:00 256733.083333 0
2014-11-01 14:00:00 211958.333333 0
2014-11-02 11:00:00 214609.333333 0
2014-11-02 12:00:00 NaN 0
2014-11-02 13:00:00 264711.666667 0
2014-11-02 14:00:00 214235.250000 0
2014-11-03 11:00:00 62213.750000 1
2014-11-03 12:00:00 24674.166667 NaN
2014-11-03 13:00:00 19988.833333 1
2014-11-03 14:00:00 NaN 1
2014-11-04 11:00:00 178302.083333 1
2014-11-04 12:00:00 178779.583333 1
2014-11-04 13:00:00 191248.500000 0
2014-11-04 14:00:00 157144.416667 NaN
2014-11-05 11:00:00 290787.583333 0
2014-11-05 12:00:00 292474.250000 0
2014-11-05 13:00:00 265415.916667 NaN
2014-11-05 14:00:00 NaN 0

In [8]:
# 2. Drop all the rows with missing data for the cloudy column.
solar = solar.dropna(subset=['cloudy'])
# Alternative: solar = solar[solar.cloudy.notnull()]

In [9]:
solar.head(20)


Out[9]:
power (W) cloudy
time
2014-11-01 12:00:00 258056.833333 0
2014-11-01 13:00:00 256733.083333 0
2014-11-01 14:00:00 211958.333333 0
2014-11-02 11:00:00 214609.333333 0
2014-11-02 12:00:00 NaN 0
2014-11-02 13:00:00 264711.666667 0
2014-11-02 14:00:00 214235.250000 0
2014-11-03 11:00:00 62213.750000 1
2014-11-03 13:00:00 19988.833333 1
2014-11-03 14:00:00 NaN 1
2014-11-04 11:00:00 178302.083333 1
2014-11-04 12:00:00 178779.583333 1
2014-11-04 13:00:00 191248.500000 0
2014-11-05 11:00:00 290787.583333 0
2014-11-05 12:00:00 292474.250000 0
2014-11-05 14:00:00 NaN 0
2014-11-06 11:00:00 287645.000000 0
2014-11-06 12:00:00 NaN 0
2014-11-06 13:00:00 256735.833333 0
2014-11-07 11:00:00 258159.916667 0

In [3]:
solar.plot(subplots=True, figsize=(15, 5));



In [11]:
# 3. Interpolate the missing data from the column containing the power.
solar = solar.interpolate(method='time')
solar.plot(subplots=True, figsize=(15, 5));



In [12]:
solar.head(20)


Out[12]:
power (W) cloudy
time
2014-11-01 12:00:00 258056.833333 0
2014-11-01 13:00:00 256733.083333 0
2014-11-01 14:00:00 211958.333333 0
2014-11-02 11:00:00 214609.333333 0
2014-11-02 12:00:00 239660.500000 0
2014-11-02 13:00:00 264711.666667 0
2014-11-02 14:00:00 214235.250000 0
2014-11-03 11:00:00 62213.750000 1
2014-11-03 13:00:00 19988.833333 1
2014-11-03 14:00:00 27184.890152 1
2014-11-04 11:00:00 178302.083333 1
2014-11-04 12:00:00 178779.583333 1
2014-11-04 13:00:00 191248.500000 0
2014-11-05 11:00:00 290787.583333 0
2014-11-05 12:00:00 292474.250000 0
2014-11-05 14:00:00 292054.315217 0
2014-11-06 11:00:00 287645.000000 0
2014-11-06 12:00:00 272190.416667 0
2014-11-06 13:00:00 256735.833333 0
2014-11-07 11:00:00 258159.916667 0

Part 2 - cloudy days power

  1. Group the data by the cloudy flag.
  2. Compute the mean and standard deviation of each group, in two separate commands.
  3. Create a new dataframe with a column for the mean and one for the standard deviation in a single command.

In [13]:
# 1. Group the data by the cloudy flag.
g = solar.groupby('cloudy')

In [14]:
# 2. Compute the mean and standard deviation of each group, in two separate commands.
g.mean()


Out[14]:
power (W)
cloudy
0 251829.315135
1 130718.476240

In [15]:
g.std()


Out[15]:
power (W)
cloudy
0 32934.612433
1 56075.938960

In [16]:
# 3. Create a new dataframe with a column for the mean and one for the standard deviation in a single command.
import numpy as np
g.agg([np.mean, np.std])


Out[16]:
power (W)
mean std
cloudy
0 251829.315135 32934.612433
1 130718.476240 56075.938960

In [ ]: