Analysis ZR-Data

Load the libraries


In [1]:
import pandas as pd
import numpy as np
import pylab as plt
import statsmodels.formula.api as sm
%matplotlib inline
pd.set_option('display.precision', 4)
pd.set_option('display.max_columns', 999)

Load the data


In [2]:
ZR=pd.ExcelFile('ZR_Daten_DDR_1976_1990.xlsx')
df=ZR.parse("Tabelle1")

Describe the data


In [3]:
df=df[(df['ERTRAG (dt/ha)']<600) & (df['ERTRAG (dt/ha)']>100)]

In [4]:
print df.describe()


       ERNTEJAHR  Mittlere Ackerzahl  TAG_AUSSAAT  TAG_AUFGANG  \
count   4474.000            4474.000     4474.000     4474.000   
mean    1982.478              50.162      111.010      128.035   
std        3.993              14.513       11.116        9.731   
min     1976.000              24.000       75.000       94.000   
25%     1979.000              36.000      104.000      122.000   
50%     1982.000              48.000      110.000      128.000   
75%     1986.000              56.000      118.000      134.000   
max     1989.000              80.000      159.000      177.000   

       Tage Aussaat bis Aufgang  N_DUENGUNG  TAG_ERNTEABSCHLUSS  \
count                  4474.000    4474.000            4474.000   
mean                     17.025     175.150             303.910   
std                       6.008      39.984              16.133   
min                       1.000      33.000             254.000   
25%                      13.000     152.000             293.000   
50%                      16.000     176.000             303.000   
75%                      21.000     200.000             314.000   
max                      47.000     437.000             357.000   

       ERTRAG (dt/ha)      LT_1      LT_2      LT_3      LT_4      LT_5  \
count        4474.000  4474.000  4474.000  4474.000  4474.000  4474.000   
mean          300.750    -0.897    -0.738     3.454     7.068    12.581   
std            79.588     3.453     2.827     2.239     1.170     1.562   
min           101.000    -8.100    -8.700    -2.500     4.000     8.600   
25%           248.000    -3.900    -2.100     2.200     6.300    11.500   
50%           304.950    -0.250    -0.600     3.200     7.000    12.700   
75%           352.000     1.500     1.100     5.200     7.800    13.800   
max           599.000     5.000     4.800     7.700    10.000    15.700   

           LT_6      LT_7      LT_8      LT_9     LT_10     LT_11     LT_12  \
count  4474.000  4474.000  4474.000  4474.000  4474.000  4474.000  4474.000   
mean     15.556    17.209    16.896    13.563     9.489     4.286     1.385   
std       1.238     1.644     1.067     1.402     1.006     1.543     1.928   
min      11.700    13.700    13.400    10.400     6.400    -0.800    -3.900   
25%      14.800    15.900    16.200    12.700     8.900     3.300     0.100   
50%      15.700    17.300    16.800    13.700     9.500     4.300     1.800   
75%      16.400    18.300    17.700    14.300    10.100     5.400     2.800   
max      19.300    21.200    19.400    17.300    11.700     7.400     4.600   

        LT_4-10      NI_1      NI_2      NI_3      NI_4      NI_5      NI_6  \
count  4474.000  4474.000  4474.000  4474.000  4474.000  4474.000  4474.000   
mean     13.194    44.088    27.492    41.959    42.342    49.789    62.011   
std       0.867    25.396    16.130    23.333    26.475    30.085    31.956   
min      10.543     4.800     2.400     5.300     1.000     3.200     8.600   
25%      12.586    26.300    14.800    24.300    20.300    28.900    40.600   
50%      13.114    38.400    25.800    39.200    39.900    43.800    57.200   
75%      13.757    56.500    35.100    54.400    61.700    66.300    79.000   
max      15.043   179.500    92.300   117.100   157.000   171.700   248.000   

           NI_7      NI_8      NI_9     NI_10     NI_11     NI_12   NI_4-10  
count  4474.000  4474.000  4474.000  4474.000  4474.000  4474.000  4474.000  
mean     58.684    55.530    45.897    35.755    42.952    50.090    50.001  
std      31.761    31.382    25.088    21.885    20.500    23.467    13.914  
min       1.000     5.500     4.900     3.200     4.900     7.700    19.000  
25%      36.000    32.600    28.500    16.700    29.975    33.400    39.371  
50%      53.100    49.900    42.600    34.100    42.200    44.000    49.986  
75%      77.600    69.600    60.000    47.600    53.700    64.200    59.086  
max     168.500   198.600   127.400   122.500   152.700   150.400    93.871  

[8 rows x 34 columns]

Corrcoef between yield and the rest


In [5]:
corr=df.corr()
print corr['ERTRAG (dt/ha)']


ERNTEJAHR                   0.144
Mittlere Ackerzahl          0.135
TAG_AUSSAAT                -0.140
TAG_AUFGANG                -0.164
Tage Aussaat bis Aufgang   -0.006
N_DUENGUNG                  0.022
TAG_ERNTEABSCHLUSS          0.096
ERTRAG (dt/ha)              1.000
LT_1                       -0.163
LT_2                       -0.106
LT_3                        0.050
LT_4                        0.058
LT_5                       -0.023
LT_6                       -0.185
LT_7                       -0.136
LT_8                       -0.137
LT_9                       -0.110
LT_10                       0.106
LT_11                       0.008
LT_12                       0.043
LT_4-10                    -0.101
NI_1                       -0.137
NI_2                        0.038
NI_3                        0.003
NI_4                        0.078
NI_5                        0.090
NI_6                        0.107
NI_7                        0.120
NI_8                        0.242
NI_9                        0.189
NI_10                       0.012
NI_11                       0.032
NI_12                       0.070
NI_4-10                     0.252
Name: ERTRAG (dt/ha), dtype: float64

Regression for Erntejahr


In [6]:
data=pd.DataFrame({'Y':df['ERTRAG (dt/ha)'],'X':df['ERNTEJAHR']})

In [7]:
model= sm.ols(formula="Y ~ X", data=data).fit()

In [8]:
print model.params


Intercept   -5383.564
X               2.867
dtype: float64

In [9]:
print model.summary()


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.021
Model:                            OLS   Adj. R-squared:                  0.020
Method:                 Least Squares   F-statistic:                     94.50
Date:                Thu, 03 Mar 2016   Prob (F-statistic):           4.05e-22
Time:                        10:11:59   Log-Likelihood:                -25883.
No. Observations:                4474   AIC:                         5.177e+04
Df Residuals:                    4472   BIC:                         5.178e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept  -5383.5639    584.731     -9.207      0.000     -6529.927 -4237.201
X              2.8673      0.295      9.721      0.000         2.289     3.446
==============================================================================
Omnibus:                       10.652   Durbin-Watson:                   1.232
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               12.055
Skew:                           0.060   Prob(JB):                      0.00241
Kurtosis:                       3.224   Cond. No.                     9.84e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.84e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

In [10]:
df['ERTRAG (dt/ha)']-=(2.867*(df['ERNTEJAHR']-1976))

In [11]:
df['ERTRAG (dt/ha)'].describe()


Out[11]:
count    4474.000
mean      282.178
std        78.760
min        66.596
25%       230.322
50%       287.064
75%       334.266
max       596.133
Name: ERTRAG (dt/ha), dtype: float64

In [12]:
data1=pd.DataFrame({'Y':df['ERTRAG (dt/ha)'],'X':df['TAG_AUSSAAT']})

In [13]:
model1= sm.ols(formula="Y ~ X", data=data1).fit()
print model1.params


Intercept    365.712
X             -0.752
dtype: float64

In [14]:
print model1.summary()


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.011
Model:                            OLS   Adj. R-squared:                  0.011
Method:                 Least Squares   F-statistic:                     51.02
Date:                Thu, 03 Mar 2016   Prob (F-statistic):           1.06e-12
Time:                        10:12:07   Log-Likelihood:                -25858.
No. Observations:                4474   AIC:                         5.172e+04
Df Residuals:                    4472   BIC:                         5.173e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept    365.7124     11.753     31.116      0.000       342.671   388.754
X             -0.7525      0.105     -7.143      0.000        -0.959    -0.546
==============================================================================
Omnibus:                       17.663   Durbin-Watson:                   1.224
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               22.513
Skew:                           0.049   Prob(JB):                     1.29e-05
Kurtosis:                       3.333   Cond. No.                     1.12e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.12e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

In [15]:
df['ERTRAG (dt/ha)']-=(0.752*(df['TAG_AUSSAAT']-75)) # wrong due to errors in sowing date

In [16]:
df['ERTRAG (dt/ha)'].describe()


Out[16]:
count    4474.000
mean      255.099
std        80.080
min        25.163
25%       201.619
50%       259.074
75%       307.868
max       558.533
Name: ERTRAG (dt/ha), dtype: float64

Histograms to support model building


In [17]:
plt.hist(df['ERTRAG (dt/ha)'].dropna(),bins=50)
plt.xlabel('ERTRAG (dt/ha)')
plt.grid(True)
plt.show()



In [18]:
plt.hist(df['Mittlere Ackerzahl'].dropna(),bins=50)
plt.xlabel('Mittlere Ackerzahl')
plt.grid(True)
plt.show()



In [19]:
plt.hist(df['LT_6']+df['LT_7']+df['LT_8']+df['LT_9'],bins=50)
plt.xlabel('LT6..LT9')
plt.grid(True)
plt.show()



In [20]:
plt.hist(df['NI_6']+df['NI_7']+df['NI_8']+df['NI_9'],bins=50)
plt.xlabel('NI_6..NI_9')
plt.grid(True)
plt.show()



In [20]:


In [20]: