Solar power and cloudy skies

In this exercise, we would to compare the distribution of solar power when the sky is clear and when it is cloudy.

Part 1 - missing data

First of all, we need to deal with missing data in the dataset.

Execute the code below to read the data from disk. The DataFrame contains the date and time of the measurements, the solar power in W, and a flag indicating if the sky is cloudy.
Drop all the rows with missing data for the cloudy column, since we will not be able to use them in the analysis.
Interpolate the missing data from the column containing the power, since it varies relatively smoothly. (Hint: look at the options for the method keyword to find an appropriate interpolation function)



In [6]:

    
%matplotlib inline

import pandas as pd
# Load the data and plot its columns.
solar = pd.read_csv('solar.csv', index_col=0, parse_dates=True)
solar.plot(subplots=True, figsize=(15, 5));



In [7]:

    
solar.head(20)









    Out[7]:






  
    
      
      power (W)
      cloudy
    
    
      time
      
      
    
  
  
    
      2014-11-01 11:00:00
                 NaN
      NaN
    
    
      2014-11-01 12:00:00
       258056.833333
        0
    
    
      2014-11-01 13:00:00
       256733.083333
        0
    
    
      2014-11-01 14:00:00
       211958.333333
        0
    
    
      2014-11-02 11:00:00
       214609.333333
        0
    
    
      2014-11-02 12:00:00
                 NaN
        0
    
    
      2014-11-02 13:00:00
       264711.666667
        0
    
    
      2014-11-02 14:00:00
       214235.250000
        0
    
    
      2014-11-03 11:00:00
        62213.750000
        1
    
    
      2014-11-03 12:00:00
        24674.166667
      NaN
    
    
      2014-11-03 13:00:00
        19988.833333
        1
    
    
      2014-11-03 14:00:00
                 NaN
        1
    
    
      2014-11-04 11:00:00
       178302.083333
        1
    
    
      2014-11-04 12:00:00
       178779.583333
        1
    
    
      2014-11-04 13:00:00
       191248.500000
        0
    
    
      2014-11-04 14:00:00
       157144.416667
      NaN
    
    
      2014-11-05 11:00:00
       290787.583333
        0
    
    
      2014-11-05 12:00:00
       292474.250000
        0
    
    
      2014-11-05 13:00:00
       265415.916667
      NaN
    
    
      2014-11-05 14:00:00
                 NaN
        0



In [8]:

    
# 2. Drop all the rows with missing data for the cloudy column.
solar = solar.dropna(subset=['cloudy'])
# Alternative: solar = solar[solar.cloudy.notnull()]



In [9]:

    
solar.head(20)









    Out[9]:






  
    
      
      power (W)
      cloudy
    
    
      time
      
      
    
  
  
    
      2014-11-01 12:00:00
       258056.833333
       0
    
    
      2014-11-01 13:00:00
       256733.083333
       0
    
    
      2014-11-01 14:00:00
       211958.333333
       0
    
    
      2014-11-02 11:00:00
       214609.333333
       0
    
    
      2014-11-02 12:00:00
                 NaN
       0
    
    
      2014-11-02 13:00:00
       264711.666667
       0
    
    
      2014-11-02 14:00:00
       214235.250000
       0
    
    
      2014-11-03 11:00:00
        62213.750000
       1
    
    
      2014-11-03 13:00:00
        19988.833333
       1
    
    
      2014-11-03 14:00:00
                 NaN
       1
    
    
      2014-11-04 11:00:00
       178302.083333
       1
    
    
      2014-11-04 12:00:00
       178779.583333
       1
    
    
      2014-11-04 13:00:00
       191248.500000
       0
    
    
      2014-11-05 11:00:00
       290787.583333
       0
    
    
      2014-11-05 12:00:00
       292474.250000
       0
    
    
      2014-11-05 14:00:00
                 NaN
       0
    
    
      2014-11-06 11:00:00
       287645.000000
       0
    
    
      2014-11-06 12:00:00
                 NaN
       0
    
    
      2014-11-06 13:00:00
       256735.833333
       0
    
    
      2014-11-07 11:00:00
       258159.916667
       0



In [3]:

    
solar.plot(subplots=True, figsize=(15, 5));



In [11]:

    
# 3. Interpolate the missing data from the column containing the power.
solar = solar.interpolate(method='time')
solar.plot(subplots=True, figsize=(15, 5));



In [12]:

    
solar.head(20)









    Out[12]:






  
    
      
      power (W)
      cloudy
    
    
      time
      
      
    
  
  
    
      2014-11-01 12:00:00
       258056.833333
       0
    
    
      2014-11-01 13:00:00
       256733.083333
       0
    
    
      2014-11-01 14:00:00
       211958.333333
       0
    
    
      2014-11-02 11:00:00
       214609.333333
       0
    
    
      2014-11-02 12:00:00
       239660.500000
       0
    
    
      2014-11-02 13:00:00
       264711.666667
       0
    
    
      2014-11-02 14:00:00
       214235.250000
       0
    
    
      2014-11-03 11:00:00
        62213.750000
       1
    
    
      2014-11-03 13:00:00
        19988.833333
       1
    
    
      2014-11-03 14:00:00
        27184.890152
       1
    
    
      2014-11-04 11:00:00
       178302.083333
       1
    
    
      2014-11-04 12:00:00
       178779.583333
       1
    
    
      2014-11-04 13:00:00
       191248.500000
       0
    
    
      2014-11-05 11:00:00
       290787.583333
       0
    
    
      2014-11-05 12:00:00
       292474.250000
       0
    
    
      2014-11-05 14:00:00
       292054.315217
       0
    
    
      2014-11-06 11:00:00
       287645.000000
       0
    
    
      2014-11-06 12:00:00
       272190.416667
       0
    
    
      2014-11-06 13:00:00
       256735.833333
       0
    
    
      2014-11-07 11:00:00
       258159.916667
       0

Part 2 - cloudy days power

Group the data by the cloudy flag.
Compute the mean and standard deviation of each group, in two separate commands.
Create a new dataframe with a column for the mean and one for the standard deviation in a single command.



In [13]:

    
# 1. Group the data by the cloudy flag.
g = solar.groupby('cloudy')



In [14]:

    
# 2. Compute the mean and standard deviation of each group, in two separate commands.
g.mean()









    Out[14]:






  
    
      
      power (W)
    
    
      cloudy
      
    
  
  
    
      0
       251829.315135
    
    
      1
       130718.476240



In [15]:

    
g.std()









    Out[15]:






  
    
      
      power (W)
    
    
      cloudy
      
    
  
  
    
      0
       32934.612433
    
    
      1
       56075.938960



In [16]:

    
# 3. Create a new dataframe with a column for the mean and one for the standard deviation in a single command.
import numpy as np
g.agg([np.mean, np.std])









    Out[16]:






  
    
      
      power (W)
    
    
      
      mean
      std
    
    
      cloudy
      
      
    
  
  
    
      0
       251829.315135
       32934.612433
    
    
      1
       130718.476240
       56075.938960



In [ ]:

	power (W)	cloudy
time
2014-11-01 11:00:00	NaN	NaN
2014-11-01 12:00:00	258056.833333	0
2014-11-01 13:00:00	256733.083333	0
2014-11-01 14:00:00	211958.333333	0
2014-11-02 11:00:00	214609.333333	0
2014-11-02 12:00:00	NaN	0
2014-11-02 13:00:00	264711.666667	0
2014-11-02 14:00:00	214235.250000	0
2014-11-03 11:00:00	62213.750000	1
2014-11-03 12:00:00	24674.166667	NaN
2014-11-03 13:00:00	19988.833333	1
2014-11-03 14:00:00	NaN	1
2014-11-04 11:00:00	178302.083333	1
2014-11-04 12:00:00	178779.583333	1
2014-11-04 13:00:00	191248.500000	0
2014-11-04 14:00:00	157144.416667	NaN
2014-11-05 11:00:00	290787.583333	0
2014-11-05 12:00:00	292474.250000	0
2014-11-05 13:00:00	265415.916667	NaN
2014-11-05 14:00:00	NaN	0

	power (W)
	mean	std
cloudy
0	251829.315135	32934.612433
1	130718.476240	56075.938960