patsy 패키지 소개

  • 회귀 분석 전처리 패키지
  • encoding/transform/design matrix 기능
  • R-style formula 문자열 지원

design matrix

  • dmatrix(fomula[, data])
    • R-style formula 문자열을 받아서 X matrix 생성
    • 자동으로 intercept (bias) column 생성
    • local namespace에서 변수를 찾음
    • data parameter에 pandas DataFrame을 주면 column lable에서 변수를 찾음

In [11]:
from patsy import dmatrix, dmatrices

In [12]:
np.random.seed(0)
x1 = np.random.rand(5) + 10
x2 = np.random.rand(5) * 10
x1, x2


Out[12]:
(array([ 10.5488135 ,  10.71518937,  10.60276338,  10.54488318,  10.4236548 ]),
 array([ 6.45894113,  4.37587211,  8.91773001,  9.63662761,  3.83441519]))

In [13]:
dmatrix("x1")


Out[13]:
DesignMatrix with shape (5, 2)
  Intercept        x1
          1  10.54881
          1  10.71519
          1  10.60276
          1  10.54488
          1  10.42365
  Terms:
    'Intercept' (column 0)
    'x1' (column 1)

R-style formula

기호 설명
+ 설명 변수 추가
- 설명 변수 제거
1, 0 intercept. (제거시 사용)
: interaction (곱)
* a*b = a + b + a:b
/ a/b = a + a:b
~ 종속 - 독립 관계

In [14]:
dmatrix("x1 - 1")


Out[14]:
DesignMatrix with shape (5, 1)
        x1
  10.54881
  10.71519
  10.60276
  10.54488
  10.42365
  Terms:
    'x1' (column 0)

In [15]:
dmatrix("x1 + 0")


Out[15]:
DesignMatrix with shape (5, 1)
        x1
  10.54881
  10.71519
  10.60276
  10.54488
  10.42365
  Terms:
    'x1' (column 0)

In [16]:
dmatrix("x1 + x2")


Out[16]:
DesignMatrix with shape (5, 3)
  Intercept        x1       x2
          1  10.54881  6.45894
          1  10.71519  4.37587
          1  10.60276  8.91773
          1  10.54488  9.63663
          1  10.42365  3.83442
  Terms:
    'Intercept' (column 0)
    'x1' (column 1)
    'x2' (column 2)

In [17]:
dmatrix("x1 + x2 - 1")


Out[17]:
DesignMatrix with shape (5, 2)
        x1       x2
  10.54881  6.45894
  10.71519  4.37587
  10.60276  8.91773
  10.54488  9.63663
  10.42365  3.83442
  Terms:
    'x1' (column 0)
    'x2' (column 1)

In [18]:
df = pd.DataFrame(np.array([x1, x2]).T, columns=["x1", "x2"])
df


Out[18]:
x1 x2
0 10.548814 6.458941
1 10.715189 4.375872
2 10.602763 8.917730
3 10.544883 9.636628
4 10.423655 3.834415

In [19]:
dmatrix("x1 + x2 - 1", data=df)


Out[19]:
DesignMatrix with shape (5, 2)
        x1       x2
  10.54881  6.45894
  10.71519  4.37587
  10.60276  8.91773
  10.54488  9.63663
  10.42365  3.83442
  Terms:
    'x1' (column 0)
    'x2' (column 1)

변환(Transform)

  • numpy 함수 이름 사용 가능
  • 사용자 정의 함수 사용 가능
  • patsy 전용 함수 이름 사용 가능
    • center(x)
    • standardize(x)
    • scale(x)

In [20]:
dmatrix("x1 + np.log(np.abs(x2))", data=df)


Out[20]:
DesignMatrix with shape (5, 3)
  Intercept        x1  np.log(np.abs(x2))
          1  10.54881             1.86547
          1  10.71519             1.47611
          1  10.60276             2.18804
          1  10.54488             2.26557
          1  10.42365             1.34402
  Terms:
    'Intercept' (column 0)
    'x1' (column 1)
    'np.log(np.abs(x2))' (column 2)

In [21]:
def doubleit(x):
    return 2 * x

dmatrix("doubleit(x1)", data=df)


Out[21]:
DesignMatrix with shape (5, 2)
  Intercept  doubleit(x1)
          1      21.09763
          1      21.43038
          1      21.20553
          1      21.08977
          1      20.84731
  Terms:
    'Intercept' (column 0)
    'doubleit(x1)' (column 1)

In [22]:
dmatrix("center(x1) + standardize(x2)", data=df)


Out[22]:
DesignMatrix with shape (5, 3)
  Intercept  center(x1)  standardize(x2)
          1    -0.01825         -0.07965
          1     0.14813         -0.97279
          1     0.03570          0.97458
          1    -0.02218          1.28282
          1    -0.14341         -1.20495
  Terms:
    'Intercept' (column 0)
    'center(x1)' (column 1)
    'standardize(x2)' (column 2)

변수 보호 I()

  • 다른 formula 기호로부터 보호

In [23]:
dmatrix("x1 + x2", data=df)


Out[23]:
DesignMatrix with shape (5, 3)
  Intercept        x1       x2
          1  10.54881  6.45894
          1  10.71519  4.37587
          1  10.60276  8.91773
          1  10.54488  9.63663
          1  10.42365  3.83442
  Terms:
    'Intercept' (column 0)
    'x1' (column 1)
    'x2' (column 2)

In [24]:
dmatrix("I(x1 + x2)", data=df)


Out[24]:
DesignMatrix with shape (5, 2)
  Intercept  I(x1 + x2)
          1    17.00775
          1    15.09106
          1    19.52049
          1    20.18151
          1    14.25807
  Terms:
    'Intercept' (column 0)
    'I(x1 + x2)' (column 1)

다항 선형 회귀


In [25]:
dmatrix("x1 + I(x1**2) + I(x1**3) + I(x1**4)", data=df)


Out[25]:
DesignMatrix with shape (5, 5)
  Intercept        x1  I(x1 ** 2)  I(x1 ** 3)   I(x1 ** 4)
          1  10.54881   111.27747  1173.84524  12382.67452
          1  10.71519   114.81528  1230.26750  13182.54925
          1  10.60276   112.41859  1191.94772  12637.93965
          1  10.54488   111.19456  1172.53366  12364.23047
          1  10.42365   108.65258  1132.55698  11805.38301
  Terms:
    'Intercept' (column 0)
    'x1' (column 1)
    'I(x1 ** 2)' (column 2)
    'I(x1 ** 3)' (column 3)
    'I(x1 ** 4)' (column 4)

카테고리 변수


In [26]:
df["a1"] = pd.Series(["a1", "a1", "a2", "a2", "a3", "a5"])
df["a2"] = pd.Series([1, 4, 5, 6, 8, 9])
df


Out[26]:
x1 x2 a1 a2
0 10.548814 6.458941 a1 1
1 10.715189 4.375872 a1 4
2 10.602763 8.917730 a2 5
3 10.544883 9.636628 a2 6
4 10.423655 3.834415 a3 8

In [27]:
dmatrix("a1", data=df)


Out[27]:
DesignMatrix with shape (5, 3)
  Intercept  a1[T.a2]  a1[T.a3]
          1         0         0
          1         0         0
          1         1         0
          1         1         0
          1         0         1
  Terms:
    'Intercept' (column 0)
    'a1' (columns 1:3)

In [28]:
dmatrix("a2", data=df)


Out[28]:
DesignMatrix with shape (5, 2)
  Intercept  a2
          1   1
          1   4
          1   5
          1   6
          1   8
  Terms:
    'Intercept' (column 0)
    'a2' (column 1)

In [29]:
dmatrix("C(a2)", data=df)


Out[29]:
DesignMatrix with shape (5, 5)
  Intercept  C(a2)[T.4]  C(a2)[T.5]  C(a2)[T.6]  C(a2)[T.8]
          1           0           0           0           0
          1           1           0           0           0
          1           0           1           0           0
          1           0           0           1           0
          1           0           0           0           1
  Terms:
    'Intercept' (column 0)
    'C(a2)' (columns 1:5)