Данный блокнот является дополнительным материалом к статье по демонстрации примеров линейной регрессии представленной публикации на портале Habrahabr –
Учитывая возможные ошибки вызванные техническими и «человеческими» факторами при обработке данных, рекомендуется применение данного набора исключительно в демонстрационных целях.
This notebook is an additional material to the article on demonstrating examples of linear regression of the presented publication on the portal Habrahabr -
Materials may contain errors, not recommended for serious research.
P.S. English text from google translate :)
Данные о регистрации актов гражданского состояния в Москве с 2010 года по настоящее время с разбивкой по месяцам. Например, регистрации браков, рождений, смертей, установлений отцовства, смены имени и т.п.
Подробное описание данных по адресу: https://data.mos.ru/opendata/7704111479-dinamika-registratsii-aktov-grajdanskogo-sostoyaniya/description?versionNumber=2&releaseNumber=33
Data of registration of acts of civil status in Moscow from 2010 to the present time by months. For example, registration of marriages, births, deaths, paternity establishments, name changes, etc.
Detailed description of the data at: https://data.mos.ru/opendata/7704111479-dinamika-registratsii-aktov-grajdanskogo-sostoyaniya/description?versionNumber=2&releaseNumber=33
In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn import linear_model
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
In [2]:
#download
df = pd.read_csv('https://op.mos.ru/EHDWSREST/catalog/export/get?id=230308', compression='zip', header=0, encoding='cp1251', sep=';', quotechar='"')
#look at the data
df.head(12)
Out[2]:
Закодируем месяца числовыми значениями и удалим ненужные для анализа столбцы
We will code the month with numeric values and delete the columns we do not need for analysis
In [3]:
#code months
d={'январь':1, 'февраль':2, 'март':3, 'апрель':4, 'май':5, 'июнь':6, 'июль':7,
'август':8, 'сентябрь':9, 'октябрь':10, 'ноябрь':11, 'декабрь':12}
df.Month=df.Month.map(d)
#delete some unuseful columns
df.drop(['ID','global_id','Unnamed: 12'],axis=1,inplace=True)
#look at the data
df.head(12)
Out[3]:
Построим попарные графики зависимостей, но для наглядности возьмем только часть признаков
We construct pairwise graphs of dependencies, but for clarity we take only a part of the features
In [4]:
columns_to_show = ['StateRegistrationOfBirth', 'StateRegistrationOfMarriage',
'StateRegistrationOfPaternityExamination', 'StateRegistrationOfDivorce','StateRegistrationOfDeath']
data=df[columns_to_show]
In [5]:
grid = sns.pairplot(data)
Посмотрим, изменит ли что-то масштабирование.
Let's see the result of scaling.
In [6]:
# change scale of features
scaler = MinMaxScaler()
df2=pd.DataFrame(scaler.fit_transform(df))
df2.columns=df.columns
data2=df2[columns_to_show]
In [7]:
grid2 = sns.pairplot(data2)
Почти без разницы
Almost without difference
Рассмотрим два параметра с наиболее выраженной линейной зависимостью StateRegistrationOfBirth и StateRegistrationOfPaternityExamination
Consider two parameters with the most pronounced linear dependence StateRegistrationOfBirth and StateRegistrationOfPaternityExamination
In [8]:
#get data for model
X = data2['StateRegistrationOfBirth'].values
y = data2['StateRegistrationOfPaternityExamination'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train=np.reshape(X_train,[X_train.shape[0],1])
y_train=np.reshape(y_train,[y_train.shape[0],1])
X_test=np.reshape(X_test,[X_test.shape[0],1])
y_test=np.reshape(y_test,[y_test.shape[0],1])
#teach model and get predictions
lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Score:', lr.score(X_test,y_test))
График для зависимости, полученной по обучающим данным
The graph for the dependence obtained from the training data
In [9]:
plt.scatter(X_train, y_train, color='black')
plt.plot(X_train, lr.predict(X_train), color='blue',
linewidth=3)
plt.xlabel('StateRegistrationOfBirth')
plt.ylabel('State Registration OfPaternity Examination')
plt.title="Regression on train data"
График для зависимости, полученной поконтрольным данным
The graph for the dependence obtained from the test data
In [10]:
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, lr.predict(X_test), color='green',
linewidth=3)
plt.xlabel('StateRegistrationOfBirth')
plt.ylabel('State Registration OfPaternity Examination')
plt.title="Regression on test data"
Попробуем предсказать другой параметр - число зарегестрированных браков, на основании той части признаков, для которых ранее строили диаграммы ('StateRegistrationOfBirth', 'StateRegistrationOfMarriage', 'StateRegistrationOfPaternityExamination', 'StateRegistrationOfDivorce','StateRegistrationOfDeath')
Let's try to predict another parameter - the number of registered marriages, based on that part of the characteristics for which the charts were previously built ('StateRegistrationOfBirth', 'StateRegistrationOfMarriage', 'StateRegistrationOfPaternityExamination', 'StateRegistrationOfDivorce', 'StateRegistrationOfDeath')
In [11]:
#get main data
columns_to_show2=columns_to_show.copy()
columns_to_show2.remove("StateRegistrationOfMarriage")
#get data for a model
X = data2[columns_to_show2].values
y = data2['StateRegistrationOfMarriage'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
y_train=np.reshape(y_train,[y_train.shape[0],1])
y_test=np.reshape(y_test,[y_test.shape[0],1])
Обучим простою линейную регрессию на 4-х мерном векторе признаков
We teach a linear regression on a 4-dimensional vector of features
In [12]:
lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Score:', lr.score(X_test,y_test))
Рассмотрим линейную регрессию с регуляризацией - Лассо
Consider linear regression with Lasso regularization
In [13]:
#let's look at the different alpha parameter:
#large
Rid=linear_model.Lasso (alpha = 0.01)
Rid.fit(X_train, y_train)
print(' Appha:', Rid.alpha)
print(' Coefficients:', Rid.coef_)
print(' Score:', Rid.score(X_test,y_test))
#Small
Rid=linear_model.Lasso (alpha = 0.000000001)
Rid.fit(X_train, y_train)
print('\n Appha:', Rid.alpha)
print(' Coefficients:', Rid.coef_)
print(' Score:', Rid.score(X_test,y_test))
#Optimal (for these test data)
Rid=linear_model.Lasso (alpha = 0.00025)
Rid.fit(X_train, y_train)
print('\n Appha:', Rid.alpha)
print(' Coefficients:', Rid.coef_)
print(' Score:', Rid.score(X_test,y_test))
Добавим откровенно бесполезный признак
Add a seless feature
In [14]:
columns_to_show3=columns_to_show2.copy()
columns_to_show3.append("TotalNumber")
columns_to_show3
X = df2[columns_to_show3].values
# y hasn't changed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
y_train=np.reshape(y_train,[y_train.shape[0],1])
y_test=np.reshape(y_test,[y_test.shape[0],1])
Для начала посмотрим на результаты без регуляризации
First, look at the results without regularization
In [15]:
lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Score:', lr.score(X_test,y_test))
А теперь с регуляризацией (Lasso).
При малых значениях коэффициента регуляризации получаем незначительное улучшение.
And now with regularization (Lasso). For small values of the regularization coefficient we obtain a slight improvement.
In [16]:
#Optimal (for these test data)
Rid=linear_model.Lasso (alpha = 0.00015)
Rid.fit(X_train, y_train)
print('\n Appha:', Rid.alpha)
print(' Coefficients:', Rid.coef_)
print(' Score:', Rid.score(X_test,y_test))
При больших значениях альфа можно посмотреть, на отбор признаков в действии
For large alpha values, you can look at the selection of features in action
In [17]:
#large
Rid=linear_model.Lasso (alpha = 0.01)
Rid.fit(X_train, y_train)
print('\n Appha:', Rid.alpha)
print(' Coefficients:', Rid.coef_)
print(' Score:', Rid.score(X_test,y_test))
Резкий рост качества предсказаний можно объяснить, тем, что регистрация браков является составной величиной от общего количества.
Рассмотрим какую часть регистраций браков можно предсказать, только на основании общего количеств регистраций
The increase in the quality of predictions can be explained by the fact that registration of marriages is a composite of the total. Consider what part of the marriage registrations can be predicted, only based on the total number of registrations.
In [18]:
X_train=np.reshape(X_train[:,4],[X_train.shape[0],1])
X_test=np.reshape(X_test[:,4],[X_test.shape[0],1])
lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Score:', lr.score(X_train,y_train))
И взглянем на графики
And look at the graphics
In [19]:
# plot for train data
plt.figure(figsize=(8,10))
plt.subplot(211)
plt.scatter(X_train, y_train, color='black')
plt.plot(X_train, lr.predict(X_train), color='blue',
linewidth=3)
plt.xlabel('Total Number of Registration')
plt.ylabel('State Registration Of Marriage')
plt.title="Regression on train data"
# plot for test data
plt.subplot(212)
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, lr.predict(X_test), '--', color='green',
linewidth=3)
plt.xlabel('Total Number of Registration')
plt.ylabel('State Registration Of Marriage')
plt.title="Regression on test data"
Добавим другой малополезный признак State Registration Of Name Change
Add another less useful sign. State Registration Of Name Change
In [20]:
columns_to_show4=columns_to_show2.copy()
columns_to_show4.append("StateRegistrationOfNameChange")
X = df2[columns_to_show4].values
# y hasn't changed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
y_train=np.reshape(y_train,[y_train.shape[0],1])
y_test=np.reshape(y_test,[y_test.shape[0],1])
lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Score:', lr.score(X_test,y_test))
Как видно, он нам только мешает.
As you can see, it's just a hindrance.
Добавим полезный признак, закодированное значение месяца в который получил количество регистраций.
Add a useful feature, the encoded value of the month in which the number of registrations was received.
In [21]:
#get data
columns_to_show5=columns_to_show2.copy()
columns_to_show5.append("Month")
#get data for model
X = df2[columns_to_show5].values
# y hasn't changed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
y_train=np.reshape(y_train,[y_train.shape[0],1])
y_test=np.reshape(y_test,[y_test.shape[0],1])
#teach model and get predictions
lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Score:', lr.score(X_test,y_test))
Вернемся к исходным данным, но рассмотрим их теперь с учетом изменения во времени.
Для начала заменим колонку год на общее количество месяцев с момента начальной даты
В этот раз не будем масштабировать данные, большой пользы это не принесет.
Let's go back to the original data, but consider them now with the change in time. To begin with, replace the column year by the total number of months from the start date This time we will not scale the data, it will not be of much use.
In [22]:
#get data
df3=df.copy()
#get new column
df3.Year=df.Year.map(lambda x: (x-2010)*12)+df.Month
df3.rename(columns={'Year': 'Months'}, inplace=True)
#get data for model
X=df3[columns_to_show5].values
y=df3['StateRegistrationOfMarriage'].values
train=[df3.Months<=72]
test=[df3.Months>72]
X_train=X[train]
y_train=y[train]
X_test=X[test]
y_test=y[test]
y_train=np.reshape(y_train,[y_train.shape[0],1])
y_test=np.reshape(y_test,[y_test.shape[0],1])
#teach model and get predictions
lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_[0])
print('Score:', lr.score(X_test,y_test))
Результат предсказания "не очень", но думаю лучше, чем просто наобум
Посмотрим на данные в графическом виде, в начале по отдельности, а потом вместе.
Наша модель пусть и не очень хорошо, но улавливает основные особенности тренда, позволяя прогнозировать данные.
The result of the prediction is "not very," but I think it's better than just haphazardly Let's look at the data in a graphical form, at the beginning separately, and then together. Our model, though not very good, but captures the main features of the trend, allowing you to predict the data.
In [23]:
plt.figure(figsize=(9,23))
# plot for train data
plt.subplot(311)
plt.scatter(df3.Months.values[train], y_train, color='black')
plt.plot(df3.Months.values[train], lr.predict(X_train), color='blue', linewidth=2)
plt.xlabel('Months (from 01.2010)')
plt.ylabel('State Registration Of Marriage')
plt.title="Regression on train data"
# plot for test data
plt.subplot(312)
plt.scatter(df3.Months.values[test], y_test, color='black')
plt.plot(df3.Months.values[test], lr.predict(X_test), color='green', linewidth=2)
plt.xlabel('Months (from 01.2010)')
plt.ylabel('State Registration Of Marriage')
plt.title="Regression (prediction) on test data"
# plot for all data
plt.subplot(313)
plt.scatter(df3.Months.values[train], y_train, color='black')
plt.plot(df3.Months.values[train], lr.predict(X_train), color='blue', label='train', linewidth=2)
plt.scatter(df3.Months.values[test], y_test, color='black')
plt.plot(df3.Months.values[test], lr.predict(X_test), color='green', label='test', linewidth=2)
plt.title="Regression (prediction) on all data"
plt.xlabel('Months (from 01.2010)')
plt.ylabel('State Registration Of Marriage')
#plot line for link train to test
plt.plot([72,73], lr.predict([X_train[-1],X_test[0]]) , color='magenta',linewidth=2, label='train to test')
plt.legend()
Out[23]:
Для начала заново загрузим исходную таблицу
For a start, reload the original table
In [24]:
df_base = pd.read_csv('https://op.mos.ru/EHDWSREST/catalog/export/get?id=230308', compression='zip', header=0, encoding='cp1251', sep=';', quotechar='"')
Попробуем применить one-hot кодирование к графе Месяц
Let's try to apply one-hot encoding to the column Month
In [25]:
#get data for model
df4=df_base.copy()
df4.drop(['Year','StateRegistrationOfMarriage','ID','global_id','Unnamed: 12','TotalNumber','StateRegistrationOfNameChange','StateRegistrationOfAdoption'],axis=1,inplace=True)
df4=pd.get_dummies(df4,prefix=['Month'])
X=df4.values
X_train=X[train]
X_test=X[test]
#teach model and get predictions
lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_[0])
print('Score:', lr.score(X_test,y_test))
# plot for all data
plt.scatter(df3.Months.values[train], y_train, color='black')
plt.plot(df3.Months.values[train], lr.predict(X_train), color='blue', label='train', linewidth=2)
plt.scatter(df3.Months.values[test], y_test, color='black')
plt.plot(df3.Months.values[test], lr.predict(X_test), color='green', label='test', linewidth=2)
plt.title="Regression (prediction) on all data"
plt.xlabel('Months (from 01.2010)')
plt.ylabel('State Registration Of Marriage')
#plot line for link train to test
plt.plot([72,73], lr.predict([X_train[-1],X_test[0]]) , color='magenta',linewidth=2, label='train to test')
Out[25]:
Качество предсказания резко улучшилось
The quality of the prediction has has greatly improved
Теперь попробуем закодировать вместо значения месяца, среднее значение регистрации браков в данный месяц, взятое на основании обучающих данных.
Now try to encode instead of the month, the average value of registration of marriages in a given month, taken on the basis of training data.
In [26]:
#get data for pandas data frame
df5=df_base.copy()
d=dict()
#get we obtain the mean value of Registration Of Marriages by months on the training data
for mon in df5.Month.unique():
d[mon]=df5.StateRegistrationOfMarriage[df5.Month.values[train]==mon].mean()
#d+={}
df5['MeanMarriagePerMonth']=df5.Month.map(d)
df5.drop(['Month','Year','StateRegistrationOfMarriage','ID','global_id','Unnamed: 12','TotalNumber',
'StateRegistrationOfNameChange','StateRegistrationOfAdoption'],axis=1,inplace=True)
#get data for model
X=df5.values
X_train=X[train]
X_test=X[test]
#teach model and get predictions
lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_[0])
print('Score:', lr.score(X_test,y_test))
# plot for all data
plt.scatter(df3.Months.values[train], y_train, color='black')
plt.plot(df3.Months.values[train], lr.predict(X_train), color='blue', label='train', linewidth=2)
plt.scatter(df3.Months.values[test], y_test, color='black')
plt.plot(df3.Months.values[test], lr.predict(X_test), color='green', label='test', linewidth=2)
plt.title="Regression (prediction) on all data"
plt.xlabel('Months (from 01.2010)')
plt.ylabel('State Registration Of Marriage')
#plot line for link train to test
plt.plot([72,73], lr.predict([X_train[-1],X_test[0]]) , color='magenta',linewidth=2, label='train to test')
Out[26]:
Качество предсказания стало еще немного лучше
The quality of the prediction is even slightly better