version 1.1

Введение (Introduction)

Данный блокнот является дополнительным материалом к статье по демонстрации примеров анализа данных и линейной регрессии представленной публикации на портале Habrahabr – https://habrahabr.ru/post/343216/

Учитывая возможные ошибки вызванные техническими и «человеческими» факторами при обработке данных, рекомендуется применение данного набора исключительно в демонстрационных целях.

This notebook is an additional material to the article on demonstrating examples of data analysis and linear regression. More detailed on the Habrahabr - https://habrahabr.ru/post/343216/ Materials may contain errors, not recommended for serious research.
P.S. English text from google translate :)

Описание данных (Data description)

Данные о работе органов государственной власти г. Москвы с обращениями граждан. Собраны вручную с официального портала Мэра и Правительства Москвы - https://www.mos.ru/feedback/reviews/
num – Индекс записи
year – год записи
month – месяц записи
total_appeals – общее количество обращений за месяц
appeals_to_mayor – общее количество обращений в адрес Мэра
res_positive- количество положительных решений
res_explained – количество обращений на которые дали разъяснения
res_negative – количество обращений с отрицательным решением El_form_to_mayor – количество обращений к Мэру в электронной форме Pap_form_to_mayor - – количество обращений к Мэру на бумажных носителях to_10K_total_VAO…to_10K_total_YUZAO – количество обращений на 10000 населения в различных округах Москвы to_10K_mayor_VAO… to_10K_mayor_YUZAO– количество обращений в адрес Мэра и правительства Москвы на 10000 населения в различных округах города

Data on the work with appeals of citizens of the executive power of Moscow. Manually collected from the official portal of the Mayor and the Government of Moscow - https://www.mos.ru/feedback/reviews/
num - Record index
year is the year of recording
month - recording month
total_appeals - total number of hits per month
appeals_to_mayor - total number of appeals to the Mayor
res_positive - the number of appeals with positive decisions
res_explained - the number of appeals that were explained
res_negative - number of appeals with negative decision
El_form_to_mayor - the number of appeals to the Mayor in electronic form
Pap_form_to_mayor - - number of appeals to the Mayor on paper
to_10K_total_VAO ... to_10K_total_YUZAO - the number of appeals per 10000 population in various districts of Moscow
to_10K_mayor_VAO ... to_10K_mayor_YUZAO- the number of appeals to the Mayor and the Government of Moscow for 10,000 people in various districts of the city

Загрузим данные и библиотеки

Let's import libriaries and load data



In [1]:

    
#import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import requests, bs4
import time
from  sklearn import model_selection
from  collections import OrderedDict
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import linear_model

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib



In [2]:

    
#load data
df = pd.read_csv('msc_appel_data.csv', sep='\t', index_col='num')



In [3]:

    
df.tail(12)









    Out[3]:







  
    
      
      year
      month
      total_appeals
      appeals_to_mayor
      res_positive
      res_explained
      res_negative
      El_form_to_mayor
      Pap_form_to_mayor
      to_10K_total_VAO
      ...
      to_10K_total_TiNAO
      to_10K_mayor_TiNAO
      to_10K_total_CAO
      to_10K_mayor_СAO
      to_10K_total_YUAO
      to_10K_mayor_YUAO
      to_10K_total_YUVAO
      to_10K_mayor_YUVAO
      to_10K_total_YUZAO
      to_10K_mayor_YUZAO
    
    
      num
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      11
      2016
      November
      115505
      24609
      19225
      43490
      5303
      18894
      5727
      61
      ...
      83
      20
      103
      23
      54
      7
      54
      9
      63
      9
    
    
      12
      2016
      December
      121014
      26161
      30224
      74575
      8666
      22097
      4064
      69
      ...
      85
      29
      129
      35
      68
      13
      58
      14
      71
      16
    
    
      13
      2017
      January
      91687
      19192
      25206
      56111
      6649
      16447
      2745
      56
      ...
      58
      18
      86
      22
      43
      9
      52
      12
      59
      12
    
    
      14
      2017
      February
      101381
      21666
      24933
      63953
      8113
      18284
      3382
      63
      ...
      67
      22
      97
      28
      50
      12
      55
      14
      58
      18
    
    
      15
      2017
      March
      135820
      29142
      33418
      85030
      9924
      24259
      4883
      80
      ...
      90
      35
      150
      32
      69
      17
      75
      18
      82
      19
    
    
      16
      2017
      April
      124645
      28222
      29975
      81685
      7797
      22879
      5343
      74
      ...
      78
      27
      128
      34
      61
      16
      71
      18
      76
      16
    
    
      17
      2017
      May
      132311
      33344
      30373
      88947
      7589
      27621
      5723
      84
      ...
      80
      28
      141
      44
      62
      15
      79
      24
      75
      17
    
    
      18
      2017
      June
      123521
      27173
      29427
      81738
      7049
      22410
      4763
      78
      ...
      75
      26
      134
      37
      56
      13
      66
      78
      78
      18
    
    
      19
      2017
      July
      117468
      25336
      29803
      76352
      6256
      21174
      4162
      72
      ...
      101
      51
      127
      32
      55
      12
      60
      14
      70
      14
    
    
      20
      2017
      August
      123718
      25633
      31333
      79803
      7191
      20977
      4655
      80
      ...
      101
      51
      127
      32
      55
      12
      60
      14
      70
      14
    
    
      21
      2017
      September
      116687
      26181
      28510
      76011
      6932
      21962
      4219
      71
      ...
      78
      33
      121
      33
      56
      11
      60
      14
      74
      19
    
    
      22
      2017
      October
      152210
      36046
      30545
      67793
      7214
      31504
      4542
      92
      ...
      94
      42
      146
      36
      74
      17
      80
      20
      88
      19
    
  

12 rows × 31 columns

Первый взгляд на данные (First look at the data)

Посмотрим на корреляцию между некоторыми столбцами, а именно всеми кроме связанных с округами Москвы. Построим их попарные диаграммы

Let's look at the correlation between some columns, namely all except those connected with the districts of Moscow. We construct their pair diagrams



In [4]:

    
columns_to_show = ['res_positive', 'res_explained', 'res_negative',
                   'total_appeals', 'appeals_to_mayor','El_form_to_mayor', 'Pap_form_to_mayor']
data=df[columns_to_show]



In [5]:

    
grid = sns.pairplot(df[columns_to_show])
savefig('1.png')

Посмотрим подробней на некоторые комбинации, в которых есть намек на линейную зависимость. Получим количественную оценку в виде коэффициента корреляции Пирсона.

Let's look in more detail at some combinations in which there is a similarity with linear dependence. We obtain a quantitative estimate in the form of a Pearson correlation coefficient.



In [6]:

    
print("Correlation coefficient for a explained review result to the total number of appeals =",
       df.res_explained.corr(df.total_appeals) )

print("Corr.coeff. for a  total number of appeals to mayor to the total number of appeals to mayor in electronic form =",
       df.appeals_to_mayor.corr(df.El_form_to_mayor) )









    



Correlation coefficient for a explained review result to the total number of appeals = 0.674507032284
Corr.coeff. for a  total number of appeals to mayor to the total number of appeals to mayor in electronic form = 0.742010766317

С одной стороны, очевидно, что чем больше общее количество обращений или обращений в электронной форме, тем больше всего обращений. С другой стороны, надо отметить, что эта зависимость не полностью линейная и наверняка мы не смогли учесть всего.

On the one hand, it is obvious that the greater the total number of appeals or appeals in electronic form. The greater the number of appeals. On the other hand, it should be noted that this dependence is not completely linear and we could not possibly have taken into account everything.

Давайте рассмотрим, еще что-нибудь. Например, найдем округ Москвы, где за год больше всего обращений граждан на 10000 человек населения.

Let's look at something else. For example, we will find the district of Moscow, where for the year the most appeals of citizens for 10,000 people.



In [7]:

    
district_columns = ['to_10K_total_VAO', 'to_10K_total_ZAO', 'to_10K_total_ZelAO',
        'to_10K_total_SAO','to_10K_total_SVAO','to_10K_total_SZAO','to_10K_total_TiNAO','to_10K_total_CAO',
        'to_10K_total_YUAO','to_10K_total_YUVAO','to_10K_total_YUZAO']



In [8]:

    
y_pos = np.arange(len(district_columns))

short_district_columns=district_columns.copy()
for i in range(len(short_district_columns)):
    short_district_columns[i] = short_district_columns[i].replace('to_10K_total_','')


distr_sum = df[district_columns].sum()

plt.figure(figsize=(16,9))
plt.bar(y_pos, distr_sum, align='center', alpha=0.5)

plt.xticks(y_pos, short_district_columns)
plt.ylabel('Number of appeals')
plt.title('Number of appeals per 10,000 people for all time')

savefig('2.png')

Добавим другие данные из сети (Add other data from the network)



In [9]:

    
""" To remind
district_columns = ['to_10K_total_VAO', 'to_10K_total_ZAO', 'to_10K_total_ZelAO',
        'to_10K_total_SAO','to_10K_total_SVAO','to_10K_total_SZAO','to_10K_total_TiNAO','to_10K_total_CAO',
        'to_10K_total_YUAO','to_10K_total_YUVAO','to_10K_total_YUZAO']  
"""
# we will collect the data manually from 
# https://ru.wikipedia.org/wiki/%D0%90%D0%B4%D0%BC%D0%B8%D0%BD%D0%B8%D1%81%D1%82%D1%80%D0%B0%D1%82%D0%B8%D0%B2%D0%BD%D0%BE-%D1%82%D0%B5%D1%80%D1%80%D0%B8%D1%82%D0%BE%D1%80%D0%B8%D0%B0%D0%BB%D1%8C%D0%BD%D0%BE%D0%B5_%D0%B4%D0%B5%D0%BB%D0%B5%D0%BD%D0%B8%D0%B5_%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D1%8B

#the data is filled in the same order as the district_columns
district_population=[1507198,1368731,239861,1160576,1415283,990696,339231,769630,1776789,1385385,1427284]

#transition from 1/10000 to citizens' appeal to the entire population of the district
total_appel_dep=district_population*distr_sum/10000

plt.figure(figsize=(16,9))
plt.bar(y_pos, total_appel_dep, align='center', alpha=0.5)
plt.xticks(y_pos, short_district_columns)
plt.ylabel('Number of appeals')
plt.title('Number of appeals per total pщulation of district for all time')

savefig('3.png')

Посмотрим связано ли как-нибудь количество положительных решений по обращениям с ценами на нефть. Соберем данные автоматически напишем простенький сборщик данных с сайта.

Let's see whether there is somehow related to the number of positive decisions on the treatment of oil prices. Let's collect the data automatically write a simple data collector from the site.



In [10]:

    
#we use beautifulsoup

oil_page=requests.get('https://worldtable.info/yekonomika/cena-na-neft-marki-brent-tablica-s-1986-po-20.html')
b=bs4.BeautifulSoup(oil_page.text, "html.parser")
table=b.select('.item-description')
table = b.find('div', {'class': 'item-description'})
table_tr=table.find_all('tr') 

d_parse=OrderedDict()
for tr in table_tr[1:len(table_tr)-1]:
    td=tr.find_all('td')
    d_parse[td[0].get_text()]=float(td[1].get_text())



In [11]:

    
# dictionary selection boundaries
d_start=358
d_end=379 #because the site has no data for October
#d_end=380 if the authors in the data source fill in the values for October, you must enter 380


# Uncomment all if grabber doesn't work
#d_parse=[("январь 2016", 30.8), ("февраль 2016", 33.2), ("март 2016", 39.25), ("апрель 2016", 42.78), ("май 2016", 47.09),
# ("июнь 2016", 49.78), ("июль 2016", 46.63), ("август 2016", 46.37), ("сентябрь 2016", 47.68), ("октябрь 2016", 51.1),
# ("ноябрь 2016", 47.97), ("декабрь 2016", 54.44), ("январь 2017", 55.98), ("февраль 2017", 55.95), ("март 2017", 53.38),
# ("апрель 2017", 53.54), ("май 2017", 50.66), ("июнь 2017", 47.91), ("июль 2017", 49.51), ("август 2017", 51.82) , ("сентябрь 2017", 55.74)]
#d_parse=dict(d_parse)
#d_start=0
#d_end=20



In [12]:

    
# values from January 2016 to October 2017 


oil_price=list(d_parse.values())[d_start:d_end]
oil_price.append(57.64) #delete this when the source site shows data for October
#In the collected data the October's the data was calculated manually, 
#in the future if it is fixed in the source, you can delete these lines and the code (oil_price.append(57.64)) above

df['oil_price']=oil_price
df.tail(5)









    Out[12]:







  
    
      
      year
      month
      total_appeals
      appeals_to_mayor
      res_positive
      res_explained
      res_negative
      El_form_to_mayor
      Pap_form_to_mayor
      to_10K_total_VAO
      ...
      to_10K_mayor_TiNAO
      to_10K_total_CAO
      to_10K_mayor_СAO
      to_10K_total_YUAO
      to_10K_mayor_YUAO
      to_10K_total_YUVAO
      to_10K_mayor_YUVAO
      to_10K_total_YUZAO
      to_10K_mayor_YUZAO
      oil_price
    
    
      num
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      18
      2017
      June
      123521
      27173
      29427
      81738
      7049
      22410
      4763
      78
      ...
      26
      134
      37
      56
      13
      66
      78
      78
      18
      47.91
    
    
      19
      2017
      July
      117468
      25336
      29803
      76352
      6256
      21174
      4162
      72
      ...
      51
      127
      32
      55
      12
      60
      14
      70
      14
      49.51
    
    
      20
      2017
      August
      123718
      25633
      31333
      79803
      7191
      20977
      4655
      80
      ...
      51
      127
      32
      55
      12
      60
      14
      70
      14
      51.82
    
    
      21
      2017
      September
      116687
      26181
      28510
      76011
      6932
      21962
      4219
      71
      ...
      33
      121
      33
      56
      11
      60
      14
      74
      19
      55.74
    
    
      22
      2017
      October
      152210
      36046
      30545
      67793
      7214
      31504
      4542
      92
      ...
      42
      146
      36
      74
      17
      80
      20
      88
      19
      57.64
    
  

5 rows × 32 columns



In [13]:

    
print("Correlation coefficient for the total number of appeals result to the oil price (in US $) =",
       df.total_appeals.corr(df.oil_price) )
print("Correlation coefficient for a positive review result to the oil price (in US $) =",
       df.res_positive.corr(df.oil_price) )









    



Correlation coefficient for the total number of appeals result to the oil price (in US $) = 0.511729317856
Correlation coefficient for a positive review result to the oil price (in US $) = -0.0143742853151

Линейная регрессия (Linear regression)

Произведём некоторые манипуляции с исходными данными для того чтобы строить модель линейной регрессии в последующих ячейках

Let's make some manipulations with the initial data in order to build a linear regression model in the cells below



In [14]:

    
df2=df.copy()

#Let's make a separate column for each value of our categorical variable
df2=pd.get_dummies(df2,prefix=['month'])

#Let's code the month with numbers
d={'January':1, 'February':2, 'March':3, 'April':4, 'May':5, 'June':6, 'July':7,
       'August':8, 'September':9, 'October':10, 'November':11, 'December':12}
month=df.month.map(d)

#We paste the information about the date from several columns
dt=list()
for year,mont in zip(df2.year.values, month.values):
    s=str(year)+' '+str(mont)+' 1'
    dt.append(s)
#convert the received data into the DateTime type and replace them with a column year      
df2.rename(columns={'year': 'DateTime'}, inplace=True)
df2['DateTime']=pd.to_datetime(dt, format='%Y %m %d')

df2.head(5)









    Out[14]:







  
    
      
      DateTime
      total_appeals
      appeals_to_mayor
      res_positive
      res_explained
      res_negative
      El_form_to_mayor
      Pap_form_to_mayor
      to_10K_total_VAO
      to_10K_mayor_VAO
      ...
      month_December
      month_February
      month_January
      month_July
      month_June
      month_March
      month_May
      month_November
      month_October
      month_September
    
    
      num
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      2016-01-01
      79217
      22110
      26950
      42764
      6405
      19418
      2692
      44
      14
      ...
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      2016-02-01
      102704
      26736
      30071
      59325
      8239
      22846
      3890
      54
      15
      ...
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      2016-03-01
      112527
      26972
      30820
      67568
      8435
      22250
      4722
      62
      16
      ...
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      4
      2016-04-01
      121050
      30179
      31289
      75471
      8359
      26086
      4093
      67
      17
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      5
      2016-05-01
      119504
      40300
      26433
      80753
      6969
      22677
      17623
      65
      24
      ...
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
    
  

5 rows × 43 columns

Построим модель на основании большинства столбцов таблицы в качестве признаков, без учета данных о месяцах. Посмотрим, как это поможет нам предсказать, число положительных решений по обращениям граждан.

We will construct the model on the basis of the majority of columns of the table as features, without taking the data on the months. Let's see how this will help us predict the number of positive decisions on citizens' appeals.



In [15]:

    
#Prepare the data

cols_for_regression=columns_to_show+district_columns

cols_for_regression.remove('res_positive')
cols_for_regression.remove('total_appeals')

X=df2[cols_for_regression].values
y=df2['res_positive']

#Scale the data
scaler =StandardScaler()
X_scal=scaler.fit_transform(X)
y_scal=scaler.fit_transform(y)

Мы будем использовать линейную регрессию с регуляризацией Гребень (Ridge). Данные поделим в соотношении 80% к 20 % (обучение / контроль), также будем проверять качество модели с помощью кросс валидации (в данном случае разбиение будет один цикл – один образец).

We will use linear regression with Ridge regularization. The data will be divided in the ratio of 80% to 20% (train/ test), and we will also check the quality of the model using cross validation (in this case, the split will be one cycle - one sample).



In [16]:

    
X_train, X_test, y_train, y_test = train_test_split(X_scal, y_scal, test_size=0.2, random_state=42)
#y_train=np.reshape(y_train,[y_train.shape[0],1])
#y_test=np.reshape(y_test,[y_test.shape[0],1])

loo = model_selection.LeaveOneOut()

#alpha coefficient is taken  at a rough guess

lr = linear_model.Ridge(alpha=55.0)
scores = model_selection.cross_val_score(lr , X_train, y_train, scoring='mean_squared_error', cv=loo,)
print('CV Score:', scores.mean())

lr .fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Test Score:', lr.score(X_test,y_test))









    



CV Score: -0.815581551634
Coefficients: [ 0.12182525  0.12288722 -0.00529611  0.04583662 -0.05581949  0.04276631
  0.05321937 -0.03056365  0.06787864  0.02837965 -0.02939786  0.03727351
  0.03720912  0.02695507  0.04852694  0.02903891]
Test Score: 0.169584745627

Посмотрим, как влияет цена на нефть на качество предсказания.

Let's see how the price of oil affects the quality of the prediction.



In [17]:

    
X_oil=df2[cols_for_regression+['oil_price']].values
y_oil=df2['res_positive']
scaler =StandardScaler()
X_scal_oil=scaler.fit_transform(X_oil)
y_scal_oil=scaler.fit_transform(y_oil)

X_train, X_test, y_train, y_test = train_test_split(X_scal_oil, y_scal_oil, test_size=0.2, random_state=42)
#y_train=np.reshape(y_train,[y_train.shape[0],1])
#y_test=np.reshape(y_test,[y_test.shape[0],1])
lr = linear_model.Ridge()

loo = model_selection.LeaveOneOut()
lr = linear_model.Ridge(alpha=55.0)
scores = model_selection.cross_val_score(lr , X_train, y_train, scoring='mean_squared_error', cv=loo,)
print('CV Score:', scores.mean())

lr .fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Test Score:', lr.score(X_test,y_test))









    



CV Score: -0.818032969557
Coefficients: [ 0.12173974  0.12273574 -0.00573119  0.04561351 -0.05623505  0.04338597
  0.05325354 -0.02974894  0.06789575  0.02955784 -0.02873256  0.03734026
  0.03784244  0.02769245  0.04873924  0.02973559 -0.01962187]
Test Score: 0.216108757647



In [18]:

    
# plot for test data
plt.figure(figsize=(16,9))

plt.scatter(lr.predict(X_test), y_test,  color='black')
plt.plot(y_test, y_test, '-', color='green',
         linewidth=1)


plt.xlabel('relative number of positive results (predict)')
plt.ylabel('relative number of positive results (test)')
plt.title="Regression on test data"

print('predict: {0} '.format(lr.predict(X_test)))
print('real: {0} '.format(y_test))

savefig('4.png')









    



predict: [-1.02297898 -0.33270701 -0.11068435 -0.19509654  0.36931812] 
real: [-0.60191197 -1.27237515  0.86566113  0.43552759  0.4036166 ]

При идеально точном предсказании, все 4 точки должны были бы располагаться на линии.

With an ideally accurate prediction, all 4 points would have to be on the line.

Временной ряд (Time series)

До этого мы брали случайные точки дал контроля предсказания. Давайте теперь рассмотрим тоже самое, но в контексте временного тренда. Будем предсказывать количество положительных решений в «будущем».

Previously, we took random points to give control of the prediction. Let's now consider the same thing, but in the context of the trend. We will predict the number of positive decisions in the "future".

Для начала посмотрим на нашу прошлую модель с ценами на нефть.

First, look at our past model with oil prices.



In [19]:

    
l_bord = 18
r_bord = 22

X_train=X_scal_oil[0:l_bord]
X_test=X_scal_oil[l_bord:r_bord]
y_train=y_scal_oil[0:l_bord]
y_test=y_scal_oil[l_bord:r_bord]

loo = model_selection.LeaveOneOut()
lr = linear_model.Ridge(alpha=7.0)
scores = model_selection.cross_val_score(lr , X_train, y_train, scoring='mean_squared_error', cv=loo,)
print('CV Score:', scores.mean())

lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Test Score:', lr.score(X_test,y_test))


# plot for test data
plt.figure(figsize=(19,10))

#trainline
plt.scatter(df2.DateTime.values[0:l_bord], lr.predict(X_train),  color='black')
plt.plot(df2.DateTime.values[0:l_bord], y_train, '--', color='green',
         linewidth=3)

#test line
plt.scatter(df2.DateTime.values[l_bord:r_bord], lr.predict(X_test),  color='black')
plt.plot(df2.DateTime.values[l_bord:r_bord], y_test, '--', color='blue',
         linewidth=3)

#connecting line
plt.plot([df2.DateTime.values[l_bord-1],df2.DateTime.values[l_bord]],  [y_train[l_bord-1],y_test[0]] , 
         color='magenta',linewidth=2, label='train to test')

plt.xlabel('Date')
plt.ylabel('Relative number of positive results')
plt.title="Time series"

print('predict: {0} '.format(lr.predict(X_test)))
print('real: {0} '.format(y_test))

savefig('5.1.png')









    



CV Score: -1.02561995329
Coefficients: [ 0.33279532  0.2238806  -0.0485607   0.09061464 -0.1674544  -0.04906857
  0.00490234 -0.06593315  0.23676848  0.17651176 -0.09812897 -0.04787692
 -0.0110044  -0.01962801  0.1399906   0.00992883 -0.20845613]
Test Score: -0.30433612959
predict: [-0.02762611  0.240008   -0.0187591   0.91056472] 
real: [ 0.34644274  0.85502414 -0.08335839  0.5930881 ]

Уберем цены на нефть, зато добавим закодированные данные о месяцах.

We remove oil prices, but we add coded data about the months.



In [20]:

    
l_bord = 18
r_bord = 22

cols_months=['month_December', 'month_February', 'month_January', 'month_July', 'month_June', 'month_March', 'month_May', 'month_November',
'month_October','month_September','month_April','month_August']

X_month=df2[cols_for_regression+cols_months].values
y_month=df2['res_positive']
scaler =StandardScaler()
X_scal_month=scaler.fit_transform(X_month)
y_scal_month=scaler.fit_transform(y_month)


X_train=X_scal_month[0:l_bord]
X_test=X_scal_month[l_bord:r_bord]
y_train=y_scal_month[0:l_bord]
y_test=y_scal_month[l_bord:r_bord]


loo = model_selection.LeaveOneOut()
lr = linear_model.Ridge(alpha=7.0)
scores = model_selection.cross_val_score(lr , X_train, y_train, scoring='mean_squared_error', cv=loo,)
print('CV Score:', scores.mean())

lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Test Score:', lr.score(X_test,y_test))


# plot for test data
plt.figure(figsize=(19,10))

#trainline
plt.scatter(df2.DateTime.values[0:l_bord], lr.predict(X_train),  color='black')
plt.plot(df2.DateTime.values[0:l_bord], y_train, '--', color='green',
         linewidth=3)
#test line
plt.scatter(df2.DateTime.values[l_bord:r_bord], lr.predict(X_test),  color='black')
plt.plot(df2.DateTime.values[l_bord:r_bord], y_test, '--', color='blue',
         linewidth=3)
#connecting line
plt.plot([df2.DateTime.values[l_bord-1],df2.DateTime.values[l_bord]],  [y_train[l_bord-1],y_test[0]] , color='magenta',linewidth=2, label='train to test')

plt.xlabel('Date')
plt.ylabel('Relative number of positive results')
plt.title="Time series"

print('predict: {0} '.format(lr.predict(X_test)))
print('real: {0} '.format(y_test))

savefig('5.2.png')









    



CV Score: -0.840389139824
Coefficients: [ 0.11060554  0.12126109  0.05229622  0.19837054 -0.12913572 -0.02836399
  0.06109993 -0.12671021  0.14328666  0.07217274 -0.09248108  0.01684538
  0.00488502 -0.01321408  0.10103911  0.01138832  0.04120886 -0.06630141
 -0.00164204 -0.07582667  0.02054273  0.1213505  -0.153386   -0.31654704
  0.04172504  0.15709389  0.0006044   0.15534103]
Test Score: -2.71838992675
predict: [-0.19952794  0.71283705  0.67317836  1.53424908] 
real: [ 0.34644274  0.85502414 -0.08335839  0.5930881 ]

Возможно вы сможете применить библиотеку Statsmodels (http://www.statsmodels.org/stable/index.html), для анализа временного тренда, но мне кажется на данный момент данных для хорошего анализа немного недостаточно.

Perhaps you can use the Statsmodels library (http://www.statsmodels.org/stable/index.html) to analyze the time trend, but it seems to me that at the moment the data for a good analysis is not enough.

	year	month	total_appeals	appeals_to_mayor	res_positive	res_explained	res_negative	El_form_to_mayor	Pap_form_to_mayor	to_10K_total_VAO	...	to_10K_total_TiNAO	to_10K_mayor_TiNAO	to_10K_total_CAO	to_10K_mayor_СAO	to_10K_total_YUAO	to_10K_mayor_YUAO	to_10K_total_YUVAO	to_10K_mayor_YUVAO	to_10K_total_YUZAO	to_10K_mayor_YUZAO
num
11	2016	November	115505	24609	19225	43490	5303	18894	5727	61	...	83	20	103	23	54	7	54	9	63	9
12	2016	December	121014	26161	30224	74575	8666	22097	4064	69	...	85	29	129	35	68	13	58	14	71	16
13	2017	January	91687	19192	25206	56111	6649	16447	2745	56	...	58	18	86	22	43	9	52	12	59	12
14	2017	February	101381	21666	24933	63953	8113	18284	3382	63	...	67	22	97	28	50	12	55	14	58	18
15	2017	March	135820	29142	33418	85030	9924	24259	4883	80	...	90	35	150	32	69	17	75	18	82	19
16	2017	April	124645	28222	29975	81685	7797	22879	5343	74	...	78	27	128	34	61	16	71	18	76	16
17	2017	May	132311	33344	30373	88947	7589	27621	5723	84	...	80	28	141	44	62	15	79	24	75	17
18	2017	June	123521	27173	29427	81738	7049	22410	4763	78	...	75	26	134	37	56	13	66	78	78	18
19	2017	July	117468	25336	29803	76352	6256	21174	4162	72	...	101	51	127	32	55	12	60	14	70	14
20	2017	August	123718	25633	31333	79803	7191	20977	4655	80	...	101	51	127	32	55	12	60	14	70	14
21	2017	September	116687	26181	28510	76011	6932	21962	4219	71	...	78	33	121	33	56	11	60	14	74	19
22	2017	October	152210	36046	30545	67793	7214	31504	4542	92	...	94	42	146	36	74	17	80	20	88	19

	DateTime	total_appeals	appeals_to_mayor	res_positive	res_explained	res_negative	El_form_to_mayor	Pap_form_to_mayor	to_10K_total_VAO	to_10K_mayor_VAO	...	month_December	month_February	month_January	month_July	month_June	month_March	month_May	month_November	month_October	month_September
num
1	2016-01-01	79217	22110	26950	42764	6405	19418	2692	44	14	...	0	0	1	0	0	0	0	0	0	0
2	2016-02-01	102704	26736	30071	59325	8239	22846	3890	54	15	...	0	1	0	0	0	0	0	0	0	0
3	2016-03-01	112527	26972	30820	67568	8435	22250	4722	62	16	...	0	0	0	0	0	1	0	0	0	0
4	2016-04-01	121050	30179	31289	75471	8359	26086	4093	67	17	...	0	0	0	0	0	0	0	0	0	0
5	2016-05-01	119504	40300	26433	80753	6969	22677	17623	65	24	...	0	0	0	0	0	0	1	0	0	0