# Combined Cycle Power Plant Regression

This is an analysis of the Combined Cycle Power Plant dataset in the UCI archive.

``````

In [1]:

import numpy as np
import pandas as pd
%pylab inline
pylab.style.use('ggplot')
import seaborn as sns

``````
``````

Populating the interactive namespace from numpy and matplotlib

``````
``````

In [2]:

``````
``````

In [3]:

``````
``````

Out[3]:

AT
V
AP
RH
PE

0
14.96
41.76
1024.07
73.17
463.26

1
25.18
62.96
1020.04
59.08
444.37

2
5.11
39.40
1012.16
92.14
488.56

3
20.86
57.32
1010.24
76.64
446.48

4
10.82
37.50
1009.23
96.62
473.90

``````

## Attribute Information

Features consist of hourly average ambient variables

``````- Temperature (T) in the range 1.81°C and 37.11°C,
- Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in teh range 25.36-81.56 cm Hg
- Net hourly electrical energy output (EP) 420.26-495.76 MW

``````

The averages are taken from various sensors located around the plant that record the ambient variables every second. The variables are given without normalization.

## Univariate Analysis

``````

In [4]:

for c in pp_data.columns:
_ = pylab.figure()
pp_data.loc[:, c].plot(kind='hist')

``````
``````

``````

## Correlation With the Target column

``````

In [5]:

feature_data = pp_data.drop('AT', axis=1)
corrs = feature_data.corrwith(pp_data.loc[:, 'AT'])
corrs.sort_values(ascending=False).plot(kind='barh')

``````
``````

Out[5]:

<matplotlib.axes._subplots.AxesSubplot at 0x230a44e3be0>

``````

## Feature Correlations

``````

In [6]:

f_corrs = feature_data.corr()
sns.heatmap(f_corrs, annot=True)

``````
``````

Out[6]:

<matplotlib.axes._subplots.AxesSubplot at 0x230a4699978>

``````

## Bivariate Analysis

``````

In [7]:

fig, axes = pylab.subplots(1, 4, figsize=(16, 8))

for i, c in enumerate(feature_data.columns):
sns.regplot(x=c, y='AT', data=pp_data, ax=axes[i])

``````
``````

``````
``````

In [8]:

from sklearn.feature_selection import f_regression

``````
``````

In [9]:

f_scores, f_probs = f_regression(feature_data, pp_data.loc[:, 'AT'])
f_imp_df = pd.DataFrame({'scores': f_scores, 'probs': f_probs}, index=feature_data.columns)
f_imp_df.plot(kind='barh', subplots=True)

``````
``````

Out[9]:

array([<matplotlib.axes._subplots.AxesSubplot object at 0x00000230A5834A20>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000230A5EC4588>], dtype=object)

``````

## OLS Regression

``````

In [10]:

import statsmodels.formula.api as sm

model = sm.ols(formula='AT ~ PE + RH + V', data=pp_data)
result = model.fit()
result.summary()

``````
``````

Out[10]:

OLS Regression Results

Dep. Variable:           AT          R-squared:             0.937

Method:             Least Squares    F-statistic:        4.724e+04

Date:             Sat, 05 Aug 2017   Prob (F-statistic):    0.00

Time:                 23:00:43       Log-Likelihood:      -19584.

No. Observations:        9568        AIC:                3.918e+04

Df Residuals:            9564        BIC:                3.921e+04

Df Model:                   3

Covariance Type:      nonrobust

coef     std err      t      P>|t|  [0.025    0.975]

Intercept   178.4983     1.194   149.505  0.000   176.158   180.839

PE           -0.3397     0.002  -144.495  0.000    -0.344    -0.335

RH           -0.1058     0.001   -74.086  0.000    -0.109    -0.103

V             0.0602     0.003    19.677  0.000     0.054     0.066

Omnibus:       619.587   Durbin-Watson:         2.044

Prob(Omnibus):  0.000    Jarque-Bera (JB):   2165.856

Skew:          -0.264    Prob(JB):               0.00

Kurtosis:       5.270    Cond. No.           2.89e+04

``````

## Cross-Validation

``````

In [11]:

from sklearn.model_selection import KFold
from sklearn.metrics import r2_score

``````
``````

In [20]:

n_splits = 10
fold = KFold(n_splits=n_splits, shuffle=True)

scores = []

for train_idx, test_idx in fold.split(pp_data):
model = sm.ols(formula='AT ~ PE + RH + V', data=pp_data.loc[train_idx])
result = model.fit()
test_features = pp_data.loc[test_idx].drop('AT', axis=1)

predictions = result.predict(test_features)
actual = pp_data.loc[test_idx, 'AT']

score = r2_score(actual, predictions)
scores.append(score)

scores = pd.Series(scores)

``````
``````

In [22]:

scores.plot(kind='bar')

``````
``````

Out[22]:

<matplotlib.axes._subplots.AxesSubplot at 0x230a6379128>

``````