문제3

StandardScaler 클래스를 사용하여 X라는 2차원 배열 변수에 있는 데이터의 평균을 0, 분산을 1로 만들기 위한 Scikit-Learn 코드를 완성하세요


In [8]:
from sklearn.preprocessing import StandardScaler
X = np.array([[3,6], [6,7]])
#스케일러 객체 생성
scaler = StandardScaler()
#분포 추정
scaler.fit(X)
#스케일링
X2 = scaler.transform(X)


C:\Anaconda3\lib\site-packages\sklearn\utils\validation.py:420: DataConversionWarning: Data with input dtype int32 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)
C:\Anaconda3\lib\site-packages\sklearn\utils\validation.py:420: DataConversionWarning: Data with input dtype int32 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)

In [9]:
X


Out[9]:
array([[3, 6],
       [6, 7]])

In [10]:
X2


Out[10]:
array([[-1., -1.],
       [ 1.,  1.]])

문제4

다음 데이터를 Sckit-Learn의 디폴트 One-Hot-Encoder로 인코딩한 결과를 쓰세요


In [11]:
X = np.array([[0, 2], [1, 1]])
X


Out[11]:
array([[0, 2],
       [1, 1]])

In [12]:
from sklearn.preprocessing import OneHotEncoder
OneHotEncoder().fit_transform(X).toarray()


Out[12]:
array([[ 1.,  0.,  0.,  1.],
       [ 0.,  1.,  1.,  0.]])

문제6

Boston Housing Price 문제의 샘플 데이터를 다음과 같이 로드하였을 때, statsmodels 패키지의 OLS 명령을 사용하여 회귀 분석을 실시하고 결과 보고서를 출력하는 코드를 완성하세요


In [13]:
from sklearn.datasets import load_boston
boston = load_boston()
dfX0 = pd.DataFrame(boston.data, columns=boston.feature_names)
dfy = pd.DataFrame(boston.target, columns=["MEDV"])

In [14]:
dfX0.tail()   #독립변수 DataFrame


Out[14]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0 21.0 391.99 9.67
502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0 21.0 396.90 9.08
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0 21.0 396.90 5.64
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0 21.0 393.45 6.48
505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396.90 7.88

In [15]:
dfy.tail()   #종속변수 DataFrame


Out[15]:
MEDV
501 22.4
502 20.6
503 23.9
504 22.0
505 11.9

In [16]:
import statsmodels.api as sm
#augmentation
dfX = sm.add_constant(dfX0)
#모형 객체 생성
m = sm.OLS(dfy, dfX)
#추정 및 결과 객체 생성
r = m.fit()
#결과 객체에서 보고서 출력
print(r.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   MEDV   R-squared:                       0.741
Model:                            OLS   Adj. R-squared:                  0.734
Method:                 Least Squares   F-statistic:                     108.1
Date:                Fri, 09 Sep 2016   Prob (F-statistic):          6.95e-135
Time:                        09:42:37   Log-Likelihood:                -1498.8
No. Observations:                 506   AIC:                             3026.
Df Residuals:                     492   BIC:                             3085.
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         36.4911      5.104      7.149      0.000        26.462    46.520
CRIM          -0.1072      0.033     -3.276      0.001        -0.171    -0.043
ZN             0.0464      0.014      3.380      0.001         0.019     0.073
INDUS          0.0209      0.061      0.339      0.735        -0.100     0.142
CHAS           2.6886      0.862      3.120      0.002         0.996     4.381
NOX          -17.7958      3.821     -4.658      0.000       -25.302   -10.289
RM             3.8048      0.418      9.102      0.000         2.983     4.626
AGE            0.0008      0.013      0.057      0.955        -0.025     0.027
DIS           -1.4758      0.199     -7.398      0.000        -1.868    -1.084
RAD            0.3057      0.066      4.608      0.000         0.175     0.436
TAX           -0.0123      0.004     -3.278      0.001        -0.020    -0.005
PTRATIO       -0.9535      0.131     -7.287      0.000        -1.211    -0.696
B              0.0094      0.003      3.500      0.001         0.004     0.015
LSTAT         -0.5255      0.051    -10.366      0.000        -0.625    -0.426
==============================================================================
Omnibus:                      178.029   Durbin-Watson:                   1.078
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              782.015
Skew:                           1.521   Prob(JB):                    1.54e-170
Kurtosis:                       8.276   Cond. No.                     1.51e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.51e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

문제9

Scikit-Learn의 preprocessing 혹은 model 클래스들이 샘플 데이터로부터 추정 모형을 계산하는 과정(training)을 수행할 때 사용하는 메서드의 이름은? fit 메서드