version 1.1

Введение (Introduction)

Данный блокнот является дополнительным материалом к статье по демонстрации примеров анализа данных и линейной регрессии представленной публикации на портале Habrahabr – https://habrahabr.ru/post/343216/

Учитывая возможные ошибки вызванные техническими и «человеческими» факторами при обработке данных, рекомендуется применение данного набора исключительно в демонстрационных целях.


This notebook is an additional material to the article on demonstrating examples of data analysis and linear regression. More detailed on the Habrahabr - https://habrahabr.ru/post/343216/ Materials may contain errors, not recommended for serious research.
P.S. English text from google translate :)

Описание данных (Data description)

Данные о работе органов государственной власти г. Москвы с обращениями граждан. Собраны вручную с официального портала Мэра и Правительства Москвы - https://www.mos.ru/feedback/reviews/
num – Индекс записи
year – год записи
month – месяц записи
total_appeals – общее количество обращений за месяц
appeals_to_mayor – общее количество обращений в адрес Мэра
res_positive- количество положительных решений
res_explained – количество обращений на которые дали разъяснения
res_negative – количество обращений с отрицательным решением El_form_to_mayor – количество обращений к Мэру в электронной форме Pap_form_to_mayor - – количество обращений к Мэру на бумажных носителях to_10K_total_VAO…to_10K_total_YUZAO – количество обращений на 10000 населения в различных округах Москвы to_10K_mayor_VAO… to_10K_mayor_YUZAO– количество обращений в адрес Мэра и правительства Москвы на 10000 населения в различных округах города


Data on the work with appeals of citizens of the executive power of Moscow. Manually collected from the official portal of the Mayor and the Government of Moscow - https://www.mos.ru/feedback/reviews/
num - Record index
year is the year of recording
month - recording month
total_appeals - total number of hits per month
appeals_to_mayor - total number of appeals to the Mayor
res_positive - the number of appeals with positive decisions
res_explained - the number of appeals that were explained
res_negative - number of appeals with negative decision
El_form_to_mayor - the number of appeals to the Mayor in electronic form
Pap_form_to_mayor - - number of appeals to the Mayor on paper
to_10K_total_VAO ... to_10K_total_YUZAO - the number of appeals per 10000 population in various districts of Moscow
to_10K_mayor_VAO ... to_10K_mayor_YUZAO- the number of appeals to the Mayor and the Government of Moscow for 10,000 people in various districts of the city

Загрузим данные и библиотеки


Let's import libriaries and load data


In [1]:
#import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import requests, bs4
import time
from  sklearn import model_selection
from  collections import OrderedDict
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import linear_model

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [2]:
#load data
df = pd.read_csv('msc_appel_data.csv', sep='\t', index_col='num')

In [3]:
df.tail(12)


Out[3]:
year month total_appeals appeals_to_mayor res_positive res_explained res_negative El_form_to_mayor Pap_form_to_mayor to_10K_total_VAO ... to_10K_total_TiNAO to_10K_mayor_TiNAO to_10K_total_CAO to_10K_mayor_СAO to_10K_total_YUAO to_10K_mayor_YUAO to_10K_total_YUVAO to_10K_mayor_YUVAO to_10K_total_YUZAO to_10K_mayor_YUZAO
num
11 2016 November 115505 24609 19225 43490 5303 18894 5727 61 ... 83 20 103 23 54 7 54 9 63 9
12 2016 December 121014 26161 30224 74575 8666 22097 4064 69 ... 85 29 129 35 68 13 58 14 71 16
13 2017 January 91687 19192 25206 56111 6649 16447 2745 56 ... 58 18 86 22 43 9 52 12 59 12
14 2017 February 101381 21666 24933 63953 8113 18284 3382 63 ... 67 22 97 28 50 12 55 14 58 18
15 2017 March 135820 29142 33418 85030 9924 24259 4883 80 ... 90 35 150 32 69 17 75 18 82 19
16 2017 April 124645 28222 29975 81685 7797 22879 5343 74 ... 78 27 128 34 61 16 71 18 76 16
17 2017 May 132311 33344 30373 88947 7589 27621 5723 84 ... 80 28 141 44 62 15 79 24 75 17
18 2017 June 123521 27173 29427 81738 7049 22410 4763 78 ... 75 26 134 37 56 13 66 78 78 18
19 2017 July 117468 25336 29803 76352 6256 21174 4162 72 ... 101 51 127 32 55 12 60 14 70 14
20 2017 August 123718 25633 31333 79803 7191 20977 4655 80 ... 101 51 127 32 55 12 60 14 70 14
21 2017 September 116687 26181 28510 76011 6932 21962 4219 71 ... 78 33 121 33 56 11 60 14 74 19
22 2017 October 152210 36046 30545 67793 7214 31504 4542 92 ... 94 42 146 36 74 17 80 20 88 19

12 rows × 31 columns

Первый взгляд на данные (First look at the data)

Посмотрим на корреляцию между некоторыми столбцами, а именно всеми кроме связанных с округами Москвы. Построим их попарные диаграммы


Let's look at the correlation between some columns, namely all except those connected with the districts of Moscow. We construct their pair diagrams


In [4]:
columns_to_show = ['res_positive', 'res_explained', 'res_negative',
                   'total_appeals', 'appeals_to_mayor','El_form_to_mayor', 'Pap_form_to_mayor']
data=df[columns_to_show]

In [5]:
grid = sns.pairplot(df[columns_to_show])
savefig('1.png')


Посмотрим подробней на некоторые комбинации, в которых есть намек на линейную зависимость. Получим количественную оценку в виде коэффициента корреляции Пирсона.


Let's look in more detail at some combinations in which there is a similarity with linear dependence. We obtain a quantitative estimate in the form of a Pearson correlation coefficient.


In [6]:
print("Correlation coefficient for a explained review result to the total number of appeals =",
       df.res_explained.corr(df.total_appeals) )

print("Corr.coeff. for a  total number of appeals to mayor to the total number of appeals to mayor in electronic form =",
       df.appeals_to_mayor.corr(df.El_form_to_mayor) )


Correlation coefficient for a explained review result to the total number of appeals = 0.674507032284
Corr.coeff. for a  total number of appeals to mayor to the total number of appeals to mayor in electronic form = 0.742010766317

С одной стороны, очевидно, что чем больше общее количество обращений или обращений в электронной форме, тем больше всего обращений. С другой стороны, надо отметить, что эта зависимость не полностью линейная и наверняка мы не смогли учесть всего.


On the one hand, it is obvious that the greater the total number of appeals or appeals in electronic form. The greater the number of appeals. On the other hand, it should be noted that this dependence is not completely linear and we could not possibly have taken into account everything.

Давайте рассмотрим, еще что-нибудь. Например, найдем округ Москвы, где за год больше всего обращений граждан на 10000 человек населения.


Let's look at something else. For example, we will find the district of Moscow, where for the year the most appeals of citizens for 10,000 people.


In [7]:
district_columns = ['to_10K_total_VAO', 'to_10K_total_ZAO', 'to_10K_total_ZelAO',
        'to_10K_total_SAO','to_10K_total_SVAO','to_10K_total_SZAO','to_10K_total_TiNAO','to_10K_total_CAO',
        'to_10K_total_YUAO','to_10K_total_YUVAO','to_10K_total_YUZAO']

In [8]:
y_pos = np.arange(len(district_columns))

short_district_columns=district_columns.copy()
for i in range(len(short_district_columns)):
    short_district_columns[i] = short_district_columns[i].replace('to_10K_total_','')


distr_sum = df[district_columns].sum()

plt.figure(figsize=(16,9))
plt.bar(y_pos, distr_sum, align='center', alpha=0.5)

plt.xticks(y_pos, short_district_columns)
plt.ylabel('Number of appeals')
plt.title('Number of appeals per 10,000 people for all time')

savefig('2.png')


Добавим другие данные из сети (Add other data from the network)


In [9]:
""" To remind
district_columns = ['to_10K_total_VAO', 'to_10K_total_ZAO', 'to_10K_total_ZelAO',
        'to_10K_total_SAO','to_10K_total_SVAO','to_10K_total_SZAO','to_10K_total_TiNAO','to_10K_total_CAO',
        'to_10K_total_YUAO','to_10K_total_YUVAO','to_10K_total_YUZAO']  
"""
# we will collect the data manually from 
# https://ru.wikipedia.org/wiki/%D0%90%D0%B4%D0%BC%D0%B8%D0%BD%D0%B8%D1%81%D1%82%D1%80%D0%B0%D1%82%D0%B8%D0%B2%D0%BD%D0%BE-%D1%82%D0%B5%D1%80%D1%80%D0%B8%D1%82%D0%BE%D1%80%D0%B8%D0%B0%D0%BB%D1%8C%D0%BD%D0%BE%D0%B5_%D0%B4%D0%B5%D0%BB%D0%B5%D0%BD%D0%B8%D0%B5_%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D1%8B

#the data is filled in the same order as the district_columns
district_population=[1507198,1368731,239861,1160576,1415283,990696,339231,769630,1776789,1385385,1427284]

#transition from 1/10000 to citizens' appeal to the entire population of the district
total_appel_dep=district_population*distr_sum/10000

plt.figure(figsize=(16,9))
plt.bar(y_pos, total_appel_dep, align='center', alpha=0.5)
plt.xticks(y_pos, short_district_columns)
plt.ylabel('Number of appeals')
plt.title('Number of appeals per total pщulation of district for all time')

savefig('3.png')


Посмотрим связано ли как-нибудь количество положительных решений по обращениям с ценами на нефть. Соберем данные автоматически напишем простенький сборщик данных с сайта.


Let's see whether there is somehow related to the number of positive decisions on the treatment of oil prices. Let's collect the data automatically write a simple data collector from the site.


In [10]:
#we use beautifulsoup

oil_page=requests.get('https://worldtable.info/yekonomika/cena-na-neft-marki-brent-tablica-s-1986-po-20.html')
b=bs4.BeautifulSoup(oil_page.text, "html.parser")
table=b.select('.item-description')
table = b.find('div', {'class': 'item-description'})
table_tr=table.find_all('tr') 

d_parse=OrderedDict()
for tr in table_tr[1:len(table_tr)-1]:
    td=tr.find_all('td')
    d_parse[td[0].get_text()]=float(td[1].get_text())

In [11]:
# dictionary selection boundaries
d_start=358
d_end=379 #because the site has no data for October
#d_end=380 if the authors in the data source fill in the values for October, you must enter 380


# Uncomment all if grabber doesn't work
#d_parse=[("январь 2016", 30.8), ("февраль 2016", 33.2), ("март 2016", 39.25), ("апрель 2016", 42.78), ("май 2016", 47.09),
# ("июнь 2016", 49.78), ("июль 2016", 46.63), ("август 2016", 46.37), ("сентябрь 2016", 47.68), ("октябрь 2016", 51.1),
# ("ноябрь 2016", 47.97), ("декабрь 2016", 54.44), ("январь 2017", 55.98), ("февраль 2017", 55.95), ("март 2017", 53.38),
# ("апрель 2017", 53.54), ("май 2017", 50.66), ("июнь 2017", 47.91), ("июль 2017", 49.51), ("август 2017", 51.82) , ("сентябрь 2017", 55.74)]
#d_parse=dict(d_parse)
#d_start=0
#d_end=20

In [12]:
# values from January 2016 to October 2017 


oil_price=list(d_parse.values())[d_start:d_end]
oil_price.append(57.64) #delete this when the source site shows data for October
#In the collected data the October's the data was calculated manually, 
#in the future if it is fixed in the source, you can delete these lines and the code (oil_price.append(57.64)) above

df['oil_price']=oil_price
df.tail(5)


Out[12]:
year month total_appeals appeals_to_mayor res_positive res_explained res_negative El_form_to_mayor Pap_form_to_mayor to_10K_total_VAO ... to_10K_mayor_TiNAO to_10K_total_CAO to_10K_mayor_СAO to_10K_total_YUAO to_10K_mayor_YUAO to_10K_total_YUVAO to_10K_mayor_YUVAO to_10K_total_YUZAO to_10K_mayor_YUZAO oil_price
num
18 2017 June 123521 27173 29427 81738 7049 22410 4763 78 ... 26 134 37 56 13 66 78 78 18 47.91
19 2017 July 117468 25336 29803 76352 6256 21174 4162 72 ... 51 127 32 55 12 60 14 70 14 49.51
20 2017 August 123718 25633 31333 79803 7191 20977 4655 80 ... 51 127 32 55 12 60 14 70 14 51.82
21 2017 September 116687 26181 28510 76011 6932 21962 4219 71 ... 33 121 33 56 11 60 14 74 19 55.74
22 2017 October 152210 36046 30545 67793 7214 31504 4542 92 ... 42 146 36 74 17 80 20 88 19 57.64

5 rows × 32 columns


In [13]:
print("Correlation coefficient for the total number of appeals result to the oil price (in US $) =",
       df.total_appeals.corr(df.oil_price) )
print("Correlation coefficient for a positive review result to the oil price (in US $) =",
       df.res_positive.corr(df.oil_price) )


Correlation coefficient for the total number of appeals result to the oil price (in US $) = 0.511729317856
Correlation coefficient for a positive review result to the oil price (in US $) = -0.0143742853151

Линейная регрессия (Linear regression)

Произведём некоторые манипуляции с исходными данными для того чтобы строить модель линейной регрессии в последующих ячейках


Let's make some manipulations with the initial data in order to build a linear regression model in the cells below


In [14]:
df2=df.copy()

#Let's make a separate column for each value of our categorical variable
df2=pd.get_dummies(df2,prefix=['month'])

#Let's code the month with numbers
d={'January':1, 'February':2, 'March':3, 'April':4, 'May':5, 'June':6, 'July':7,
       'August':8, 'September':9, 'October':10, 'November':11, 'December':12}
month=df.month.map(d)

#We paste the information about the date from several columns
dt=list()
for year,mont in zip(df2.year.values, month.values):
    s=str(year)+' '+str(mont)+' 1'
    dt.append(s)
#convert the received data into the DateTime type and replace them with a column year      
df2.rename(columns={'year': 'DateTime'}, inplace=True)
df2['DateTime']=pd.to_datetime(dt, format='%Y %m %d')

df2.head(5)


Out[14]:
DateTime total_appeals appeals_to_mayor res_positive res_explained res_negative El_form_to_mayor Pap_form_to_mayor to_10K_total_VAO to_10K_mayor_VAO ... month_December month_February month_January month_July month_June month_March month_May month_November month_October month_September
num
1 2016-01-01 79217 22110 26950 42764 6405 19418 2692 44 14 ... 0 0 1 0 0 0 0 0 0 0
2 2016-02-01 102704 26736 30071 59325 8239 22846 3890 54 15 ... 0 1 0 0 0 0 0 0 0 0
3 2016-03-01 112527 26972 30820 67568 8435 22250 4722 62 16 ... 0 0 0 0 0 1 0 0 0 0
4 2016-04-01 121050 30179 31289 75471 8359 26086 4093 67 17 ... 0 0 0 0 0 0 0 0 0 0
5 2016-05-01 119504 40300 26433 80753 6969 22677 17623 65 24 ... 0 0 0 0 0 0 1 0 0 0

5 rows × 43 columns

Построим модель на основании большинства столбцов таблицы в качестве признаков, без учета данных о месяцах. Посмотрим, как это поможет нам предсказать, число положительных решений по обращениям граждан.


We will construct the model on the basis of the majority of columns of the table as features, without taking the data on the months. Let's see how this will help us predict the number of positive decisions on citizens' appeals.


In [15]:
#Prepare the data

cols_for_regression=columns_to_show+district_columns

cols_for_regression.remove('res_positive')
cols_for_regression.remove('total_appeals')

X=df2[cols_for_regression].values
y=df2['res_positive']

#Scale the data
scaler =StandardScaler()
X_scal=scaler.fit_transform(X)
y_scal=scaler.fit_transform(y)

Мы будем использовать линейную регрессию с регуляризацией Гребень (Ridge). Данные поделим в соотношении 80% к 20 % (обучение / контроль), также будем проверять качество модели с помощью кросс валидации (в данном случае разбиение будет один цикл – один образец).


We will use linear regression with Ridge regularization. The data will be divided in the ratio of 80% to 20% (train/ test), and we will also check the quality of the model using cross validation (in this case, the split will be one cycle - one sample).


In [16]:
X_train, X_test, y_train, y_test = train_test_split(X_scal, y_scal, test_size=0.2, random_state=42)
#y_train=np.reshape(y_train,[y_train.shape[0],1])
#y_test=np.reshape(y_test,[y_test.shape[0],1])

loo = model_selection.LeaveOneOut()

#alpha coefficient is taken  at a rough guess

lr = linear_model.Ridge(alpha=55.0)
scores = model_selection.cross_val_score(lr , X_train, y_train, scoring='mean_squared_error', cv=loo,)
print('CV Score:', scores.mean())

lr .fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Test Score:', lr.score(X_test,y_test))


CV Score: -0.815581551634
Coefficients: [ 0.12182525  0.12288722 -0.00529611  0.04583662 -0.05581949  0.04276631
  0.05321937 -0.03056365  0.06787864  0.02837965 -0.02939786  0.03727351
  0.03720912  0.02695507  0.04852694  0.02903891]
Test Score: 0.169584745627

Посмотрим, как влияет цена на нефть на качество предсказания.


Let's see how the price of oil affects the quality of the prediction.


In [17]:
X_oil=df2[cols_for_regression+['oil_price']].values
y_oil=df2['res_positive']
scaler =StandardScaler()
X_scal_oil=scaler.fit_transform(X_oil)
y_scal_oil=scaler.fit_transform(y_oil)

X_train, X_test, y_train, y_test = train_test_split(X_scal_oil, y_scal_oil, test_size=0.2, random_state=42)
#y_train=np.reshape(y_train,[y_train.shape[0],1])
#y_test=np.reshape(y_test,[y_test.shape[0],1])
lr = linear_model.Ridge()

loo = model_selection.LeaveOneOut()
lr = linear_model.Ridge(alpha=55.0)
scores = model_selection.cross_val_score(lr , X_train, y_train, scoring='mean_squared_error', cv=loo,)
print('CV Score:', scores.mean())

lr .fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Test Score:', lr.score(X_test,y_test))


CV Score: -0.818032969557
Coefficients: [ 0.12173974  0.12273574 -0.00573119  0.04561351 -0.05623505  0.04338597
  0.05325354 -0.02974894  0.06789575  0.02955784 -0.02873256  0.03734026
  0.03784244  0.02769245  0.04873924  0.02973559 -0.01962187]
Test Score: 0.216108757647

In [18]:
# plot for test data
plt.figure(figsize=(16,9))

plt.scatter(lr.predict(X_test), y_test,  color='black')
plt.plot(y_test, y_test, '-', color='green',
         linewidth=1)


plt.xlabel('relative number of positive results (predict)')
plt.ylabel('relative number of positive results (test)')
plt.title="Regression on test data"

print('predict: {0} '.format(lr.predict(X_test)))
print('real: {0} '.format(y_test))

savefig('4.png')


predict: [-1.02297898 -0.33270701 -0.11068435 -0.19509654  0.36931812] 
real: [-0.60191197 -1.27237515  0.86566113  0.43552759  0.4036166 ] 

При идеально точном предсказании, все 4 точки должны были бы располагаться на линии.


With an ideally accurate prediction, all 4 points would have to be on the line.

Временной ряд (Time series)

До этого мы брали случайные точки дал контроля предсказания. Давайте теперь рассмотрим тоже самое, но в контексте временного тренда. Будем предсказывать количество положительных решений в «будущем».


Previously, we took random points to give control of the prediction. Let's now consider the same thing, but in the context of the trend. We will predict the number of positive decisions in the "future".

Для начала посмотрим на нашу прошлую модель с ценами на нефть.


First, look at our past model with oil prices.


In [19]:
l_bord = 18
r_bord = 22

X_train=X_scal_oil[0:l_bord]
X_test=X_scal_oil[l_bord:r_bord]
y_train=y_scal_oil[0:l_bord]
y_test=y_scal_oil[l_bord:r_bord]

loo = model_selection.LeaveOneOut()
lr = linear_model.Ridge(alpha=7.0)
scores = model_selection.cross_val_score(lr , X_train, y_train, scoring='mean_squared_error', cv=loo,)
print('CV Score:', scores.mean())

lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Test Score:', lr.score(X_test,y_test))


# plot for test data
plt.figure(figsize=(19,10))

#trainline
plt.scatter(df2.DateTime.values[0:l_bord], lr.predict(X_train),  color='black')
plt.plot(df2.DateTime.values[0:l_bord], y_train, '--', color='green',
         linewidth=3)

#test line
plt.scatter(df2.DateTime.values[l_bord:r_bord], lr.predict(X_test),  color='black')
plt.plot(df2.DateTime.values[l_bord:r_bord], y_test, '--', color='blue',
         linewidth=3)

#connecting line
plt.plot([df2.DateTime.values[l_bord-1],df2.DateTime.values[l_bord]],  [y_train[l_bord-1],y_test[0]] , 
         color='magenta',linewidth=2, label='train to test')

plt.xlabel('Date')
plt.ylabel('Relative number of positive results')
plt.title="Time series"

print('predict: {0} '.format(lr.predict(X_test)))
print('real: {0} '.format(y_test))

savefig('5.1.png')


CV Score: -1.02561995329
Coefficients: [ 0.33279532  0.2238806  -0.0485607   0.09061464 -0.1674544  -0.04906857
  0.00490234 -0.06593315  0.23676848  0.17651176 -0.09812897 -0.04787692
 -0.0110044  -0.01962801  0.1399906   0.00992883 -0.20845613]
Test Score: -0.30433612959
predict: [-0.02762611  0.240008   -0.0187591   0.91056472] 
real: [ 0.34644274  0.85502414 -0.08335839  0.5930881 ] 

Уберем цены на нефть, зато добавим закодированные данные о месяцах.


We remove oil prices, but we add coded data about the months.


In [20]:
l_bord = 18
r_bord = 22

cols_months=['month_December', 'month_February', 'month_January', 'month_July', 'month_June', 'month_March', 'month_May', 'month_November',
'month_October','month_September','month_April','month_August']

X_month=df2[cols_for_regression+cols_months].values
y_month=df2['res_positive']
scaler =StandardScaler()
X_scal_month=scaler.fit_transform(X_month)
y_scal_month=scaler.fit_transform(y_month)


X_train=X_scal_month[0:l_bord]
X_test=X_scal_month[l_bord:r_bord]
y_train=y_scal_month[0:l_bord]
y_test=y_scal_month[l_bord:r_bord]


loo = model_selection.LeaveOneOut()
lr = linear_model.Ridge(alpha=7.0)
scores = model_selection.cross_val_score(lr , X_train, y_train, scoring='mean_squared_error', cv=loo,)
print('CV Score:', scores.mean())

lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Test Score:', lr.score(X_test,y_test))


# plot for test data
plt.figure(figsize=(19,10))

#trainline
plt.scatter(df2.DateTime.values[0:l_bord], lr.predict(X_train),  color='black')
plt.plot(df2.DateTime.values[0:l_bord], y_train, '--', color='green',
         linewidth=3)
#test line
plt.scatter(df2.DateTime.values[l_bord:r_bord], lr.predict(X_test),  color='black')
plt.plot(df2.DateTime.values[l_bord:r_bord], y_test, '--', color='blue',
         linewidth=3)
#connecting line
plt.plot([df2.DateTime.values[l_bord-1],df2.DateTime.values[l_bord]],  [y_train[l_bord-1],y_test[0]] , color='magenta',linewidth=2, label='train to test')

plt.xlabel('Date')
plt.ylabel('Relative number of positive results')
plt.title="Time series"

print('predict: {0} '.format(lr.predict(X_test)))
print('real: {0} '.format(y_test))

savefig('5.2.png')


CV Score: -0.840389139824
Coefficients: [ 0.11060554  0.12126109  0.05229622  0.19837054 -0.12913572 -0.02836399
  0.06109993 -0.12671021  0.14328666  0.07217274 -0.09248108  0.01684538
  0.00488502 -0.01321408  0.10103911  0.01138832  0.04120886 -0.06630141
 -0.00164204 -0.07582667  0.02054273  0.1213505  -0.153386   -0.31654704
  0.04172504  0.15709389  0.0006044   0.15534103]
Test Score: -2.71838992675
predict: [-0.19952794  0.71283705  0.67317836  1.53424908] 
real: [ 0.34644274  0.85502414 -0.08335839  0.5930881 ] 

Возможно вы сможете применить библиотеку Statsmodels (http://www.statsmodels.org/stable/index.html), для анализа временного тренда, но мне кажется на данный момент данных для хорошего анализа немного недостаточно.


Perhaps you can use the Statsmodels library (http://www.statsmodels.org/stable/index.html) to analyze the time trend, but it seems to me that at the moment the data for a good analysis is not enough.