Введение (Introduction)

Данный блокнот является дополнительным материалом к статье по демонстрации примеров линейной регрессии представленной публикации на портале Habrahabr –

Учитывая возможные ошибки вызванные техническими и «человеческими» факторами при обработке данных, рекомендуется применение данного набора исключительно в демонстрационных целях.

This notebook is an additional material to the article on demonstrating examples of linear regression of the presented publication on the portal Habrahabr - Materials may contain errors, not recommended for serious research.
P.S. English text from google translate :)

Описание данных (Data description)

Данные о регистрации актов гражданского состояния в Москве с 2010 года по настоящее время с разбивкой по месяцам. Например, регистрации браков, рождений, смертей, установлений отцовства, смены имени и т.п.
Подробное описание данных по адресу: https://data.mos.ru/opendata/7704111479-dinamika-registratsii-aktov-grajdanskogo-sostoyaniya/description?versionNumber=2&releaseNumber=33

Data of registration of acts of civil status in Moscow from 2010 to the present time by months. For example, registration of marriages, births, deaths, paternity establishments, name changes, etc.
Detailed description of the data at: https://data.mos.ru/opendata/7704111479-dinamika-registratsii-aktov-grajdanskogo-sostoyaniya/description?versionNumber=2&releaseNumber=33



In [1]:

    
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn import linear_model
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

Загрузка и предобработка (Download and preprocessing)



In [2]:

    
#download
df = pd.read_csv('https://op.mos.ru/EHDWSREST/catalog/export/get?id=230308', compression='zip', header=0, encoding='cp1251', sep=';', quotechar='"')
#look at the data
df.head(12)









    Out[2]:







  
    
      
      ID
      global_id
      Year
      Month
      StateRegistrationOfBirth
      StateRegistrationOfDeath
      StateRegistrationOfMarriage
      StateRegistrationOfDivorce
      StateRegistrationOfPaternityExamination
      StateRegistrationOfAdoption
      StateRegistrationOfNameChange
      TotalNumber
      Unnamed: 12
    
  
  
    
      0
      1
      37591658
      2010
      январь
      9206
      10430
      4997
      3302
      1241
      95
      491
      29762
      NaN
    
    
      1
      2
      37591659
      2010
      февраль
      9060
      9573
      4873
      2937
      1326
      97
      639
      28505
      NaN
    
    
      2
      3
      37591660
      2010
      март
      10934
      10528
      3642
      4361
      1644
      147
      717
      31973
      NaN
    
    
      3
      4
      37591661
      2010
      апрель
      10140
      9501
      9698
      3943
      1530
      128
      642
      35572
      NaN
    
    
      4
      5
      37591662
      2010
      май
      9457
      9482
      3726
      3554
      1397
      96
      492
      28204
      NaN
    
    
      5
      6
      62353812
      2010
      июнь
      11253
      9529
      9148
      3666
      1570
      130
      556
      35852
      NaN
    
    
      6
      7
      62353813
      2010
      июль
      11477
      14340
      12473
      3675
      1568
      123
      564
      44220
      NaN
    
    
      7
      8
      62353814
      2010
      август
      10302
      15016
      10882
      3496
      1512
      134
      578
      41920
      NaN
    
    
      8
      9
      62353816
      2010
      сентябрь
      10140
      9573
      10736
      3738
      1480
      101
      686
      36454
      NaN
    
    
      9
      10
      62353817
      2010
      октябрь
      10776
      9350
      8862
      3899
      1504
      89
      687
      35167
      NaN
    
    
      10
      11
      62353818
      2010
      ноябрь
      10293
      9091
      6080
      3923
      1355
      97
      568
      31407
      NaN
    
    
      11
      12
      62353819
      2010
      декабрь
      10600
      9664
      6023
      4145
      1556
      124
      681
      32793
      NaN

Закодируем месяца числовыми значениями и удалим ненужные для анализа столбцы

We will code the month with numeric values and delete the columns we do not need for analysis



In [3]:

    
#code months
d={'январь':1, 'февраль':2, 'март':3, 'апрель':4, 'май':5, 'июнь':6, 'июль':7,
       'август':8, 'сентябрь':9, 'октябрь':10, 'ноябрь':11, 'декабрь':12}
df.Month=df.Month.map(d)

#delete some unuseful columns
df.drop(['ID','global_id','Unnamed: 12'],axis=1,inplace=True)

#look at the data
df.head(12)









    Out[3]:







  
    
      
      Year
      Month
      StateRegistrationOfBirth
      StateRegistrationOfDeath
      StateRegistrationOfMarriage
      StateRegistrationOfDivorce
      StateRegistrationOfPaternityExamination
      StateRegistrationOfAdoption
      StateRegistrationOfNameChange
      TotalNumber
    
  
  
    
      0
      2010
      1
      9206
      10430
      4997
      3302
      1241
      95
      491
      29762
    
    
      1
      2010
      2
      9060
      9573
      4873
      2937
      1326
      97
      639
      28505
    
    
      2
      2010
      3
      10934
      10528
      3642
      4361
      1644
      147
      717
      31973
    
    
      3
      2010
      4
      10140
      9501
      9698
      3943
      1530
      128
      642
      35572
    
    
      4
      2010
      5
      9457
      9482
      3726
      3554
      1397
      96
      492
      28204
    
    
      5
      2010
      6
      11253
      9529
      9148
      3666
      1570
      130
      556
      35852
    
    
      6
      2010
      7
      11477
      14340
      12473
      3675
      1568
      123
      564
      44220
    
    
      7
      2010
      8
      10302
      15016
      10882
      3496
      1512
      134
      578
      41920
    
    
      8
      2010
      9
      10140
      9573
      10736
      3738
      1480
      101
      686
      36454
    
    
      9
      2010
      10
      10776
      9350
      8862
      3899
      1504
      89
      687
      35167
    
    
      10
      2010
      11
      10293
      9091
      6080
      3923
      1355
      97
      568
      31407
    
    
      11
      2010
      12
      10600
      9664
      6023
      4145
      1556
      124
      681
      32793

Построим попарные графики зависимостей, но для наглядности возьмем только часть признаков

We construct pairwise graphs of dependencies, but for clarity we take only a part of the features



In [4]:

    
columns_to_show = ['StateRegistrationOfBirth', 'StateRegistrationOfMarriage', 
                   'StateRegistrationOfPaternityExamination', 'StateRegistrationOfDivorce','StateRegistrationOfDeath']
data=df[columns_to_show]



In [5]:

    
grid = sns.pairplot(data)

Посмотрим, изменит ли что-то масштабирование.

Let's see the result of scaling.



In [6]:

    
# change scale of features
scaler = MinMaxScaler()
df2=pd.DataFrame(scaler.fit_transform(df))
df2.columns=df.columns
data2=df2[columns_to_show]



In [7]:

    
grid2 = sns.pairplot(data2)

Почти без разницы

Almost without difference

Простейшая регрессия по 1 признаку (Regression 1 features)

Рассмотрим два параметра с наиболее выраженной линейной зависимостью StateRegistrationOfBirth и StateRegistrationOfPaternityExamination

Consider two parameters with the most pronounced linear dependence StateRegistrationOfBirth and StateRegistrationOfPaternityExamination



In [8]:

    
#get data for model

X = data2['StateRegistrationOfBirth'].values
y  = data2['StateRegistrationOfPaternityExamination'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train=np.reshape(X_train,[X_train.shape[0],1])
y_train=np.reshape(y_train,[y_train.shape[0],1])
X_test=np.reshape(X_test,[X_test.shape[0],1])
y_test=np.reshape(y_test,[y_test.shape[0],1])

#teach model and get predictions
lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Score:', lr.score(X_test,y_test))









    



Coefficients: [[ 0.78600258]]
Score: 0.611493944197

График для зависимости, полученной по обучающим данным

The graph for the dependence obtained from the training data



In [9]:

    
plt.scatter(X_train, y_train,  color='black')
plt.plot(X_train, lr.predict(X_train), color='blue',
         linewidth=3)

plt.xlabel('StateRegistrationOfBirth')
plt.ylabel('State Registration OfPaternity Examination')
plt.title="Regression on train data"

График для зависимости, полученной поконтрольным данным

The graph for the dependence obtained from the test data



In [10]:

    
plt.scatter(X_test, y_test,  color='black')
plt.plot(X_test, lr.predict(X_test), color='green',
         linewidth=3)

plt.xlabel('StateRegistrationOfBirth')
plt.ylabel('State Registration OfPaternity Examination')
plt.title="Regression on test data"

Регрессия по нескольким признакам и Lasso регуляризация (Regression on several features and Lasso regularization)

Попробуем предсказать другой параметр - число зарегестрированных браков, на основании той части признаков, для которых ранее строили диаграммы ('StateRegistrationOfBirth', 'StateRegistrationOfMarriage', 'StateRegistrationOfPaternityExamination', 'StateRegistrationOfDivorce','StateRegistrationOfDeath')

Let's try to predict another parameter - the number of registered marriages, based on that part of the characteristics for which the charts were previously built ('StateRegistrationOfBirth', 'StateRegistrationOfMarriage', 'StateRegistrationOfPaternityExamination', 'StateRegistrationOfDivorce', 'StateRegistrationOfDeath')



In [11]:

    
#get main data
columns_to_show2=columns_to_show.copy()
columns_to_show2.remove("StateRegistrationOfMarriage")

#get data for a model
X = data2[columns_to_show2].values
y  = data2['StateRegistrationOfMarriage'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
y_train=np.reshape(y_train,[y_train.shape[0],1])
y_test=np.reshape(y_test,[y_test.shape[0],1])

Обучим простою линейную регрессию на 4-х мерном векторе признаков

We teach a linear regression on a 4-dimensional vector of features



In [12]:

    
lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Score:', lr.score(X_test,y_test))









    



Coefficients: [[-0.03475475  0.97143632 -0.44298685 -0.18245718]]
Score: 0.38137432065

Рассмотрим линейную регрессию с регуляризацией - Лассо

Consider linear regression with Lasso regularization



In [13]:

    
#let's look at the different alpha parameter:

#large
Rid=linear_model.Lasso (alpha = 0.01)
Rid.fit(X_train, y_train)
print(' Appha:', Rid.alpha)
print(' Coefficients:', Rid.coef_)
print(' Score:', Rid.score(X_test,y_test))

#Small
Rid=linear_model.Lasso (alpha = 0.000000001)
Rid.fit(X_train, y_train)
print('\n Appha:', Rid.alpha)
print(' Coefficients:', Rid.coef_)
print(' Score:', Rid.score(X_test,y_test))

#Optimal (for these test data)
Rid=linear_model.Lasso (alpha = 0.00025)
Rid.fit(X_train, y_train)
print('\n Appha:', Rid.alpha)
print(' Coefficients:', Rid.coef_)
print(' Score:', Rid.score(X_test,y_test))









    



 Appha: 0.01
 Coefficients: [ 0.          0.46642996 -0.         -0.        ]
 Score: 0.222071102783

 Appha: 1e-09
 Coefficients: [-0.03475462  0.97143616 -0.44298679 -0.18245715]
 Score: 0.38137433837

 Appha: 0.00025
 Coefficients: [-0.00387233  0.92989507 -0.42590052 -0.17411615]
 Score: 0.385551648602

Добавим откровенно бесполезный признак

Add a seless feature



In [14]:

    
columns_to_show3=columns_to_show2.copy()
columns_to_show3.append("TotalNumber")
columns_to_show3

X = df2[columns_to_show3].values
# y hasn't changed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
y_train=np.reshape(y_train,[y_train.shape[0],1])
y_test=np.reshape(y_test,[y_test.shape[0],1])

Для начала посмотрим на результаты без регуляризации

First, look at the results without regularization



In [15]:

    
lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Score:', lr.score(X_test,y_test))









    



Coefficients: [[-0.45286477 -0.08625204 -0.19375198 -0.63079401  1.57467774]]
Score: 0.999173764473

А теперь с регуляризацией (Lasso).
При малых значениях коэффициента регуляризации получаем незначительное улучшение.

And now with regularization (Lasso). For small values of the regularization coefficient we obtain a slight improvement.



In [16]:

    
#Optimal (for these test data)
Rid=linear_model.Lasso (alpha = 0.00015)
Rid.fit(X_train, y_train)
print('\n Appha:', Rid.alpha)
print(' Coefficients:', Rid.coef_)
print(' Score:', Rid.score(X_test,y_test))









    



 Appha: 0.00015
 Coefficients: [-0.44718703 -0.07491507 -0.1944702  -0.62034146  1.55890505]
 Score: 0.999266251287

При больших значениях альфа можно посмотреть, на отбор признаков в действии

For large alpha values, you can look at the selection of features in action



In [17]:

    
#large
Rid=linear_model.Lasso (alpha = 0.01)
Rid.fit(X_train, y_train)
print('\n Appha:', Rid.alpha)
print(' Coefficients:', Rid.coef_)
print(' Score:', Rid.score(X_test,y_test))









    



 Appha: 0.01
 Coefficients: [-0.         -0.         -0.         -0.05177979  0.87991931]
 Score: 0.802210158982

Резкий рост качества предсказаний можно объяснить, тем, что регистрация браков является составной величиной от общего количества.
Рассмотрим какую часть регистраций браков можно предсказать, только на основании общего количеств регистраций

The increase in the quality of predictions can be explained by the fact that registration of marriages is a composite of the total. Consider what part of the marriage registrations can be predicted, only based on the total number of registrations.



In [18]:

    
X_train=np.reshape(X_train[:,4],[X_train.shape[0],1])
X_test=np.reshape(X_test[:,4],[X_test.shape[0],1])

lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Score:', lr.score(X_train,y_train))









    



Coefficients: [[ 1.0571131]]
Score: 0.788270672692

И взглянем на графики

And look at the graphics



In [19]:

    
# plot for train data
plt.figure(figsize=(8,10))
plt.subplot(211)

plt.scatter(X_train, y_train,  color='black')
plt.plot(X_train, lr.predict(X_train),  color='blue',
         linewidth=3)

plt.xlabel('Total Number of Registration')
plt.ylabel('State Registration Of Marriage')
plt.title="Regression on train data"

# plot for test data
plt.subplot(212)
plt.scatter(X_test, y_test,  color='black')
plt.plot(X_test, lr.predict(X_test), '--', color='green',
         linewidth=3)

plt.xlabel('Total Number of Registration')
plt.ylabel('State Registration Of Marriage')
plt.title="Regression on test data"

Добавим другой малополезный признак State Registration Of Name Change

Add another less useful sign. State Registration Of Name Change



In [20]:

    
columns_to_show4=columns_to_show2.copy()
columns_to_show4.append("StateRegistrationOfNameChange")


X = df2[columns_to_show4].values
# y hasn't changed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
y_train=np.reshape(y_train,[y_train.shape[0],1])
y_test=np.reshape(y_test,[y_test.shape[0],1])

lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Score:', lr.score(X_test,y_test))









    



Coefficients: [[ 0.06583714  1.1080889  -0.35025999 -0.24473705 -0.4513887 ]]
Score: 0.285094398157

Как видно, он нам только мешает.

As you can see, it's just a hindrance.

Добавим полезный признак, закодированное значение месяца в который получил количество регистраций.

Add a useful feature, the encoded value of the month in which the number of registrations was received.



In [21]:

    
#get data
columns_to_show5=columns_to_show2.copy()
columns_to_show5.append("Month")

#get data for model
X = df2[columns_to_show5].values
# y hasn't changed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
y_train=np.reshape(y_train,[y_train.shape[0],1])
y_test=np.reshape(y_test,[y_test.shape[0],1])
#teach model and get predictions
lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Score:', lr.score(X_test,y_test))









    



Coefficients: [[-0.10613428  0.91315175 -0.55413198 -0.13253367  0.28536285]]
Score: 0.472057997208

Линейная регрессия для предсказания тренда (Linear regression for predicting a trend)

Вернемся к исходным данным, но рассмотрим их теперь с учетом изменения во времени.
Для начала заменим колонку год на общее количество месяцев с момента начальной даты
В этот раз не будем масштабировать данные, большой пользы это не принесет.

Let's go back to the original data, but consider them now with the change in time. To begin with, replace the column year by the total number of months from the start date This time we will not scale the data, it will not be of much use.



In [22]:

    
#get data
df3=df.copy()

#get new column
df3.Year=df.Year.map(lambda x: (x-2010)*12)+df.Month
df3.rename(columns={'Year': 'Months'}, inplace=True)

#get data for model
X=df3[columns_to_show5].values
y=df3['StateRegistrationOfMarriage'].values
train=[df3.Months<=72]
test=[df3.Months>72]
X_train=X[train]
y_train=y[train]
X_test=X[test]
y_test=y[test]
y_train=np.reshape(y_train,[y_train.shape[0],1])
y_test=np.reshape(y_test,[y_test.shape[0],1])    

#teach model and get predictions
lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_[0])
print('Score:', lr.score(X_test,y_test))









    



Coefficients: [  2.60708376e-01   1.30751121e+01  -3.31447168e+00  -2.34368684e-01
   2.88096512e+02]
Score: 0.383195050367

Результат предсказания "не очень", но думаю лучше, чем просто наобум

Посмотрим на данные в графическом виде, в начале по отдельности, а потом вместе.
Наша модель пусть и не очень хорошо, но улавливает основные особенности тренда, позволяя прогнозировать данные.

The result of the prediction is "not very," but I think it's better than just haphazardly Let's look at the data in a graphical form, at the beginning separately, and then together. Our model, though not very good, but captures the main features of the trend, allowing you to predict the data.



In [23]:

    
plt.figure(figsize=(9,23))

# plot for train data
plt.subplot(311)

plt.scatter(df3.Months.values[train], y_train,  color='black')
plt.plot(df3.Months.values[train], lr.predict(X_train),  color='blue', linewidth=2)
plt.xlabel('Months (from 01.2010)')
plt.ylabel('State Registration Of Marriage')
plt.title="Regression on train data"

# plot for test data
plt.subplot(312)

plt.scatter(df3.Months.values[test], y_test,  color='black')
plt.plot(df3.Months.values[test], lr.predict(X_test),  color='green',   linewidth=2)
plt.xlabel('Months (from 01.2010)')
plt.ylabel('State Registration Of Marriage')
plt.title="Regression (prediction) on test data"

# plot for all data
plt.subplot(313)

plt.scatter(df3.Months.values[train], y_train,  color='black')
plt.plot(df3.Months.values[train], lr.predict(X_train),  color='blue', label='train', linewidth=2)

plt.scatter(df3.Months.values[test], y_test,  color='black')
plt.plot(df3.Months.values[test], lr.predict(X_test),  color='green', label='test',  linewidth=2)

plt.title="Regression (prediction) on all data"
plt.xlabel('Months (from 01.2010)')
plt.ylabel('State Registration Of Marriage')

#plot line for link train to test
plt.plot([72,73],  lr.predict([X_train[-1],X_test[0]]) , color='magenta',linewidth=2, label='train to test')



plt.legend()









    Out[23]:





<matplotlib.legend.Legend at 0x1b41a26e048>

Бонус (Bonus)

Повышаем точность, за счет другого подхода к месяцам

(Increase the accuracy, due to a different approach to the months)

Для начала заново загрузим исходную таблицу

For a start, reload the original table



In [24]:

    
df_base = pd.read_csv('https://op.mos.ru/EHDWSREST/catalog/export/get?id=230308', compression='zip', header=0, encoding='cp1251', sep=';', quotechar='"')

Попробуем применить one-hot кодирование к графе Месяц

Let's try to apply one-hot encoding to the column Month



In [25]:

    
#get data for model

df4=df_base.copy()
df4.drop(['Year','StateRegistrationOfMarriage','ID','global_id','Unnamed: 12','TotalNumber','StateRegistrationOfNameChange','StateRegistrationOfAdoption'],axis=1,inplace=True)
df4=pd.get_dummies(df4,prefix=['Month'])

X=df4.values
X_train=X[train]
X_test=X[test]

#teach model and get predictions

lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_[0])
print('Score:', lr.score(X_test,y_test))


# plot for all data
plt.scatter(df3.Months.values[train], y_train,  color='black')
plt.plot(df3.Months.values[train], lr.predict(X_train),  color='blue', label='train', linewidth=2)

plt.scatter(df3.Months.values[test], y_test,  color='black')
plt.plot(df3.Months.values[test], lr.predict(X_test),  color='green', label='test',  linewidth=2)

plt.title="Regression (prediction) on all data"
plt.xlabel('Months (from 01.2010)')
plt.ylabel('State Registration Of Marriage')

#plot line for link train to test
plt.plot([72,73],  lr.predict([X_train[-1],X_test[0]]) , color='magenta',linewidth=2, label='train to test')









    



Coefficients: [  2.18633008e-01  -1.41397731e-01   4.56991414e-02  -5.17558633e-01
   4.48131002e+03  -2.94754108e+02  -1.14429758e+03   3.61201946e+03
   2.41208054e+03  -3.23415050e+03  -2.73587261e+03  -1.31020899e+03
   4.84757208e+02   3.37280689e+03  -2.40539320e+03  -3.23829714e+03]
Score: 0.869208071831






    Out[25]:





[<matplotlib.lines.Line2D at 0x1b41a674f60>]

Качество предсказания резко улучшилось

The quality of the prediction has has greatly improved

Теперь попробуем закодировать вместо значения месяца, среднее значение регистрации браков в данный месяц, взятое на основании обучающих данных.

Now try to encode instead of the month, the average value of registration of marriages in a given month, taken on the basis of training data.



In [26]:

    
#get data for pandas data frame
df5=df_base.copy()

d=dict()

#get we obtain the mean value of Registration Of Marriages by months on the training data
for mon in df5.Month.unique():

    d[mon]=df5.StateRegistrationOfMarriage[df5.Month.values[train]==mon].mean()
   #d+={}  

df5['MeanMarriagePerMonth']=df5.Month.map(d)
df5.drop(['Month','Year','StateRegistrationOfMarriage','ID','global_id','Unnamed: 12','TotalNumber',
          'StateRegistrationOfNameChange','StateRegistrationOfAdoption'],axis=1,inplace=True)

#get data for model
X=df5.values
X_train=X[train]
X_test=X[test]

#teach model and get predictions
lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_[0])
print('Score:', lr.score(X_test,y_test))

# plot for all data
plt.scatter(df3.Months.values[train], y_train,  color='black')
plt.plot(df3.Months.values[train], lr.predict(X_train),  color='blue', label='train', linewidth=2)

plt.scatter(df3.Months.values[test], y_test,  color='black')
plt.plot(df3.Months.values[test], lr.predict(X_test),  color='green', label='test',  linewidth=2)

plt.title="Regression (prediction) on all data"
plt.xlabel('Months (from 01.2010)')
plt.ylabel('State Registration Of Marriage')

#plot line for link train to test
plt.plot([72,73],  lr.predict([X_train[-1],X_test[0]]) , color='magenta',linewidth=2, label='train to test')









    



Coefficients: [ 0.16556761 -0.12746446 -0.03652408 -0.21649349  0.96971467]
Score: 0.875882918435






    Out[26]:





[<matplotlib.lines.Line2D at 0x1b41ba55080>]

Качество предсказания стало еще немного лучше

The quality of the prediction is even slightly better

	ID	global_id	Year	Month	StateRegistrationOfBirth	StateRegistrationOfDeath	StateRegistrationOfMarriage	StateRegistrationOfDivorce	StateRegistrationOfPaternityExamination	StateRegistrationOfAdoption	StateRegistrationOfNameChange	TotalNumber	Unnamed: 12
0	1	37591658	2010	январь	9206	10430	4997	3302	1241	95	491	29762	NaN
1	2	37591659	2010	февраль	9060	9573	4873	2937	1326	97	639	28505	NaN
2	3	37591660	2010	март	10934	10528	3642	4361	1644	147	717	31973	NaN
3	4	37591661	2010	апрель	10140	9501	9698	3943	1530	128	642	35572	NaN
4	5	37591662	2010	май	9457	9482	3726	3554	1397	96	492	28204	NaN
5	6	62353812	2010	июнь	11253	9529	9148	3666	1570	130	556	35852	NaN
6	7	62353813	2010	июль	11477	14340	12473	3675	1568	123	564	44220	NaN
7	8	62353814	2010	август	10302	15016	10882	3496	1512	134	578	41920	NaN
8	9	62353816	2010	сентябрь	10140	9573	10736	3738	1480	101	686	36454	NaN
9	10	62353817	2010	октябрь	10776	9350	8862	3899	1504	89	687	35167	NaN
10	11	62353818	2010	ноябрь	10293	9091	6080	3923	1355	97	568	31407	NaN
11	12	62353819	2010	декабрь	10600	9664	6023	4145	1556	124	681	32793	NaN