version 1.1
Данный блокнот является дополнительным материалом к статье по демонстрации примеров анализа данных и линейной регрессии представленной публикации на портале Habrahabr – https://habrahabr.ru/post/343216/
Учитывая возможные ошибки вызванные техническими и «человеческими» факторами при обработке данных, рекомендуется применение данного набора исключительно в демонстрационных целях.
This notebook is an additional material to the article on demonstrating examples of data analysis and linear regression.
More detailed on the Habrahabr - https://habrahabr.ru/post/343216/
Materials may contain errors, not recommended for serious research.
P.S. English text from google translate :)
Данные о работе органов государственной власти г. Москвы с обращениями граждан. Собраны вручную с официального портала Мэра и Правительства Москвы - https://www.mos.ru/feedback/reviews/
num – Индекс записи
year – год записи
month – месяц записи
total_appeals – общее количество обращений за месяц
appeals_to_mayor – общее количество обращений в адрес Мэра
res_positive- количество положительных решений
res_explained – количество обращений на которые дали разъяснения
res_negative – количество обращений с отрицательным решением
El_form_to_mayor – количество обращений к Мэру в электронной форме
Pap_form_to_mayor - – количество обращений к Мэру на бумажных носителях
to_10K_total_VAO…to_10K_total_YUZAO – количество обращений на 10000 населения в различных округах Москвы
to_10K_mayor_VAO… to_10K_mayor_YUZAO– количество обращений в адрес Мэра и правительства Москвы на 10000 населения в различных округах города
Data on the work with appeals of citizens of the executive power of Moscow. Manually collected from the official portal of the Mayor and the Government of Moscow - https://www.mos.ru/feedback/reviews/
num - Record index
year is the year of recording
month - recording month
total_appeals - total number of hits per month
appeals_to_mayor - total number of appeals to the Mayor
res_positive - the number of appeals with positive decisions
res_explained - the number of appeals that were explained
res_negative - number of appeals with negative decision
El_form_to_mayor - the number of appeals to the Mayor in electronic form
Pap_form_to_mayor - - number of appeals to the Mayor on paper
to_10K_total_VAO ... to_10K_total_YUZAO - the number of appeals per 10000 population in various districts of Moscow
to_10K_mayor_VAO ... to_10K_mayor_YUZAO- the number of appeals to the Mayor and the Government of Moscow for 10,000 people in various districts of the city
Загрузим данные и библиотеки
Let's import libriaries and load data
In [1]:
#import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import requests, bs4
import time
from sklearn import model_selection
from collections import OrderedDict
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import linear_model
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
%pylab inline
In [2]:
#load data
df = pd.read_csv('msc_appel_data.csv', sep='\t', index_col='num')
In [3]:
df.tail(12)
Out[3]:
Посмотрим на корреляцию между некоторыми столбцами, а именно всеми кроме связанных с округами Москвы. Построим их попарные диаграммы
Let's look at the correlation between some columns, namely all except those connected with the districts of Moscow. We construct their pair diagrams
In [4]:
columns_to_show = ['res_positive', 'res_explained', 'res_negative',
'total_appeals', 'appeals_to_mayor','El_form_to_mayor', 'Pap_form_to_mayor']
data=df[columns_to_show]
In [5]:
grid = sns.pairplot(df[columns_to_show])
savefig('1.png')
Посмотрим подробней на некоторые комбинации, в которых есть намек на линейную зависимость. Получим количественную оценку в виде коэффициента корреляции Пирсона.
Let's look in more detail at some combinations in which there is a similarity with linear dependence. We obtain a quantitative estimate in the form of a Pearson correlation coefficient.
In [6]:
print("Correlation coefficient for a explained review result to the total number of appeals =",
df.res_explained.corr(df.total_appeals) )
print("Corr.coeff. for a total number of appeals to mayor to the total number of appeals to mayor in electronic form =",
df.appeals_to_mayor.corr(df.El_form_to_mayor) )
С одной стороны, очевидно, что чем больше общее количество обращений или обращений в электронной форме, тем больше всего обращений. С другой стороны, надо отметить, что эта зависимость не полностью линейная и наверняка мы не смогли учесть всего.
On the one hand, it is obvious that the greater the total number of appeals or appeals in electronic form. The greater the number of appeals. On the other hand, it should be noted that this dependence is not completely linear and we could not possibly have taken into account everything.
Давайте рассмотрим, еще что-нибудь. Например, найдем округ Москвы, где за год больше всего обращений граждан на 10000 человек населения.
Let's look at something else. For example, we will find the district of Moscow, where for the year the most appeals of citizens for 10,000 people.
In [7]:
district_columns = ['to_10K_total_VAO', 'to_10K_total_ZAO', 'to_10K_total_ZelAO',
'to_10K_total_SAO','to_10K_total_SVAO','to_10K_total_SZAO','to_10K_total_TiNAO','to_10K_total_CAO',
'to_10K_total_YUAO','to_10K_total_YUVAO','to_10K_total_YUZAO']
In [8]:
y_pos = np.arange(len(district_columns))
short_district_columns=district_columns.copy()
for i in range(len(short_district_columns)):
short_district_columns[i] = short_district_columns[i].replace('to_10K_total_','')
distr_sum = df[district_columns].sum()
plt.figure(figsize=(16,9))
plt.bar(y_pos, distr_sum, align='center', alpha=0.5)
plt.xticks(y_pos, short_district_columns)
plt.ylabel('Number of appeals')
plt.title('Number of appeals per 10,000 people for all time')
savefig('2.png')
In [9]:
""" To remind
district_columns = ['to_10K_total_VAO', 'to_10K_total_ZAO', 'to_10K_total_ZelAO',
'to_10K_total_SAO','to_10K_total_SVAO','to_10K_total_SZAO','to_10K_total_TiNAO','to_10K_total_CAO',
'to_10K_total_YUAO','to_10K_total_YUVAO','to_10K_total_YUZAO']
"""
# we will collect the data manually from
# https://ru.wikipedia.org/wiki/%D0%90%D0%B4%D0%BC%D0%B8%D0%BD%D0%B8%D1%81%D1%82%D1%80%D0%B0%D1%82%D0%B8%D0%B2%D0%BD%D0%BE-%D1%82%D0%B5%D1%80%D1%80%D0%B8%D1%82%D0%BE%D1%80%D0%B8%D0%B0%D0%BB%D1%8C%D0%BD%D0%BE%D0%B5_%D0%B4%D0%B5%D0%BB%D0%B5%D0%BD%D0%B8%D0%B5_%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D1%8B
#the data is filled in the same order as the district_columns
district_population=[1507198,1368731,239861,1160576,1415283,990696,339231,769630,1776789,1385385,1427284]
#transition from 1/10000 to citizens' appeal to the entire population of the district
total_appel_dep=district_population*distr_sum/10000
plt.figure(figsize=(16,9))
plt.bar(y_pos, total_appel_dep, align='center', alpha=0.5)
plt.xticks(y_pos, short_district_columns)
plt.ylabel('Number of appeals')
plt.title('Number of appeals per total pщulation of district for all time')
savefig('3.png')
Посмотрим связано ли как-нибудь количество положительных решений по обращениям с ценами на нефть. Соберем данные автоматически напишем простенький сборщик данных с сайта.
Let's see whether there is somehow related to the number of positive decisions on the treatment of oil prices. Let's collect the data automatically write a simple data collector from the site.
In [10]:
#we use beautifulsoup
oil_page=requests.get('https://worldtable.info/yekonomika/cena-na-neft-marki-brent-tablica-s-1986-po-20.html')
b=bs4.BeautifulSoup(oil_page.text, "html.parser")
table=b.select('.item-description')
table = b.find('div', {'class': 'item-description'})
table_tr=table.find_all('tr')
d_parse=OrderedDict()
for tr in table_tr[1:len(table_tr)-1]:
td=tr.find_all('td')
d_parse[td[0].get_text()]=float(td[1].get_text())
In [11]:
# dictionary selection boundaries
d_start=358
d_end=379 #because the site has no data for October
#d_end=380 if the authors in the data source fill in the values for October, you must enter 380
# Uncomment all if grabber doesn't work
#d_parse=[("январь 2016", 30.8), ("февраль 2016", 33.2), ("март 2016", 39.25), ("апрель 2016", 42.78), ("май 2016", 47.09),
# ("июнь 2016", 49.78), ("июль 2016", 46.63), ("август 2016", 46.37), ("сентябрь 2016", 47.68), ("октябрь 2016", 51.1),
# ("ноябрь 2016", 47.97), ("декабрь 2016", 54.44), ("январь 2017", 55.98), ("февраль 2017", 55.95), ("март 2017", 53.38),
# ("апрель 2017", 53.54), ("май 2017", 50.66), ("июнь 2017", 47.91), ("июль 2017", 49.51), ("август 2017", 51.82) , ("сентябрь 2017", 55.74)]
#d_parse=dict(d_parse)
#d_start=0
#d_end=20
In [12]:
# values from January 2016 to October 2017
oil_price=list(d_parse.values())[d_start:d_end]
oil_price.append(57.64) #delete this when the source site shows data for October
#In the collected data the October's the data was calculated manually,
#in the future if it is fixed in the source, you can delete these lines and the code (oil_price.append(57.64)) above
df['oil_price']=oil_price
df.tail(5)
Out[12]:
In [13]:
print("Correlation coefficient for the total number of appeals result to the oil price (in US $) =",
df.total_appeals.corr(df.oil_price) )
print("Correlation coefficient for a positive review result to the oil price (in US $) =",
df.res_positive.corr(df.oil_price) )
Произведём некоторые манипуляции с исходными данными для того чтобы строить модель линейной регрессии в последующих ячейках
Let's make some manipulations with the initial data in order to build a linear regression model in the cells below
In [14]:
df2=df.copy()
#Let's make a separate column for each value of our categorical variable
df2=pd.get_dummies(df2,prefix=['month'])
#Let's code the month with numbers
d={'January':1, 'February':2, 'March':3, 'April':4, 'May':5, 'June':6, 'July':7,
'August':8, 'September':9, 'October':10, 'November':11, 'December':12}
month=df.month.map(d)
#We paste the information about the date from several columns
dt=list()
for year,mont in zip(df2.year.values, month.values):
s=str(year)+' '+str(mont)+' 1'
dt.append(s)
#convert the received data into the DateTime type and replace them with a column year
df2.rename(columns={'year': 'DateTime'}, inplace=True)
df2['DateTime']=pd.to_datetime(dt, format='%Y %m %d')
df2.head(5)
Out[14]:
Построим модель на основании большинства столбцов таблицы в качестве признаков, без учета данных о месяцах. Посмотрим, как это поможет нам предсказать, число положительных решений по обращениям граждан.
We will construct the model on the basis of the majority of columns of the table as features, without taking the data on the months. Let's see how this will help us predict the number of positive decisions on citizens' appeals.
In [15]:
#Prepare the data
cols_for_regression=columns_to_show+district_columns
cols_for_regression.remove('res_positive')
cols_for_regression.remove('total_appeals')
X=df2[cols_for_regression].values
y=df2['res_positive']
#Scale the data
scaler =StandardScaler()
X_scal=scaler.fit_transform(X)
y_scal=scaler.fit_transform(y)
Мы будем использовать линейную регрессию с регуляризацией Гребень (Ridge). Данные поделим в соотношении 80% к 20 % (обучение / контроль), также будем проверять качество модели с помощью кросс валидации (в данном случае разбиение будет один цикл – один образец).
We will use linear regression with Ridge regularization. The data will be divided in the ratio of 80% to 20% (train/ test), and we will also check the quality of the model using cross validation (in this case, the split will be one cycle - one sample).
In [16]:
X_train, X_test, y_train, y_test = train_test_split(X_scal, y_scal, test_size=0.2, random_state=42)
#y_train=np.reshape(y_train,[y_train.shape[0],1])
#y_test=np.reshape(y_test,[y_test.shape[0],1])
loo = model_selection.LeaveOneOut()
#alpha coefficient is taken at a rough guess
lr = linear_model.Ridge(alpha=55.0)
scores = model_selection.cross_val_score(lr , X_train, y_train, scoring='mean_squared_error', cv=loo,)
print('CV Score:', scores.mean())
lr .fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Test Score:', lr.score(X_test,y_test))
Посмотрим, как влияет цена на нефть на качество предсказания.
Let's see how the price of oil affects the quality of the prediction.
In [17]:
X_oil=df2[cols_for_regression+['oil_price']].values
y_oil=df2['res_positive']
scaler =StandardScaler()
X_scal_oil=scaler.fit_transform(X_oil)
y_scal_oil=scaler.fit_transform(y_oil)
X_train, X_test, y_train, y_test = train_test_split(X_scal_oil, y_scal_oil, test_size=0.2, random_state=42)
#y_train=np.reshape(y_train,[y_train.shape[0],1])
#y_test=np.reshape(y_test,[y_test.shape[0],1])
lr = linear_model.Ridge()
loo = model_selection.LeaveOneOut()
lr = linear_model.Ridge(alpha=55.0)
scores = model_selection.cross_val_score(lr , X_train, y_train, scoring='mean_squared_error', cv=loo,)
print('CV Score:', scores.mean())
lr .fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Test Score:', lr.score(X_test,y_test))
In [18]:
# plot for test data
plt.figure(figsize=(16,9))
plt.scatter(lr.predict(X_test), y_test, color='black')
plt.plot(y_test, y_test, '-', color='green',
linewidth=1)
plt.xlabel('relative number of positive results (predict)')
plt.ylabel('relative number of positive results (test)')
plt.title="Regression on test data"
print('predict: {0} '.format(lr.predict(X_test)))
print('real: {0} '.format(y_test))
savefig('4.png')
При идеально точном предсказании, все 4 точки должны были бы располагаться на линии.
With an ideally accurate prediction, all 4 points would have to be on the line.
До этого мы брали случайные точки дал контроля предсказания. Давайте теперь рассмотрим тоже самое, но в контексте временного тренда. Будем предсказывать количество положительных решений в «будущем».
Previously, we took random points to give control of the prediction. Let's now consider the same thing, but in the context of the trend. We will predict the number of positive decisions in the "future".
Для начала посмотрим на нашу прошлую модель с ценами на нефть.
First, look at our past model with oil prices.
In [19]:
l_bord = 18
r_bord = 22
X_train=X_scal_oil[0:l_bord]
X_test=X_scal_oil[l_bord:r_bord]
y_train=y_scal_oil[0:l_bord]
y_test=y_scal_oil[l_bord:r_bord]
loo = model_selection.LeaveOneOut()
lr = linear_model.Ridge(alpha=7.0)
scores = model_selection.cross_val_score(lr , X_train, y_train, scoring='mean_squared_error', cv=loo,)
print('CV Score:', scores.mean())
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Test Score:', lr.score(X_test,y_test))
# plot for test data
plt.figure(figsize=(19,10))
#trainline
plt.scatter(df2.DateTime.values[0:l_bord], lr.predict(X_train), color='black')
plt.plot(df2.DateTime.values[0:l_bord], y_train, '--', color='green',
linewidth=3)
#test line
plt.scatter(df2.DateTime.values[l_bord:r_bord], lr.predict(X_test), color='black')
plt.plot(df2.DateTime.values[l_bord:r_bord], y_test, '--', color='blue',
linewidth=3)
#connecting line
plt.plot([df2.DateTime.values[l_bord-1],df2.DateTime.values[l_bord]], [y_train[l_bord-1],y_test[0]] ,
color='magenta',linewidth=2, label='train to test')
plt.xlabel('Date')
plt.ylabel('Relative number of positive results')
plt.title="Time series"
print('predict: {0} '.format(lr.predict(X_test)))
print('real: {0} '.format(y_test))
savefig('5.1.png')
Уберем цены на нефть, зато добавим закодированные данные о месяцах.
We remove oil prices, but we add coded data about the months.
In [20]:
l_bord = 18
r_bord = 22
cols_months=['month_December', 'month_February', 'month_January', 'month_July', 'month_June', 'month_March', 'month_May', 'month_November',
'month_October','month_September','month_April','month_August']
X_month=df2[cols_for_regression+cols_months].values
y_month=df2['res_positive']
scaler =StandardScaler()
X_scal_month=scaler.fit_transform(X_month)
y_scal_month=scaler.fit_transform(y_month)
X_train=X_scal_month[0:l_bord]
X_test=X_scal_month[l_bord:r_bord]
y_train=y_scal_month[0:l_bord]
y_test=y_scal_month[l_bord:r_bord]
loo = model_selection.LeaveOneOut()
lr = linear_model.Ridge(alpha=7.0)
scores = model_selection.cross_val_score(lr , X_train, y_train, scoring='mean_squared_error', cv=loo,)
print('CV Score:', scores.mean())
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Test Score:', lr.score(X_test,y_test))
# plot for test data
plt.figure(figsize=(19,10))
#trainline
plt.scatter(df2.DateTime.values[0:l_bord], lr.predict(X_train), color='black')
plt.plot(df2.DateTime.values[0:l_bord], y_train, '--', color='green',
linewidth=3)
#test line
plt.scatter(df2.DateTime.values[l_bord:r_bord], lr.predict(X_test), color='black')
plt.plot(df2.DateTime.values[l_bord:r_bord], y_test, '--', color='blue',
linewidth=3)
#connecting line
plt.plot([df2.DateTime.values[l_bord-1],df2.DateTime.values[l_bord]], [y_train[l_bord-1],y_test[0]] , color='magenta',linewidth=2, label='train to test')
plt.xlabel('Date')
plt.ylabel('Relative number of positive results')
plt.title="Time series"
print('predict: {0} '.format(lr.predict(X_test)))
print('real: {0} '.format(y_test))
savefig('5.2.png')
Возможно вы сможете применить библиотеку Statsmodels (http://www.statsmodels.org/stable/index.html), для анализа временного тренда, но мне кажется на данный момент данных для хорошего анализа немного недостаточно.
Perhaps you can use the Statsmodels library (http://www.statsmodels.org/stable/index.html) to analyze the time trend, but it seems to me that at the moment the data for a good analysis is not enough.