17. decision trees

다양한 의사결정 경로(decision path)와 결과(outcome)를 나타내는 데 나무 구조를 사용

이해하고 해석하기 쉽고, 예측할 때 사용하는 프로세스가 명확
숫자형 데이터, 범주형 데이터를 동시에 다룰수 있고, 특정 변수 값이 누락되어도 사용 가능
학습 데이터에 대해 '최적의' 의사결정나무를 찾는 것은 계산적으로 무척 어려운 문제, 새로운 데이터에 대한 일반화 성능이 좋지 않게 오버피팅되기 쉬움

Classification tree

Regression tree

엔트로피(entropy) - '얼마만큼의 정보를 담고 있는가'

데이터셋 S

데이터 포인트 $c_1,....,c_n$

모든 데이터 포인트가 단 하나의 클래스에 속한다면 엔트로피는 낮고,

모든 데이터 포인트가 모든 클래스에 고르게 분포되어 있다면 엔트로피는 높다

클래스에 속할 확률을 $p_i$라고 하면

$H(S) = -p_1log_2p_1-...-p_nlog_2p_n$



In [2]:

    
from collections import Counter, defaultdict
from functools import partial
import math, random



In [3]:

    
def entropy(class_probabilities):
    """클래스에 속할 확률을 입력하면 엔트로피를 계산하라"""
    return sum(-p * math.log(p, 2) for p in class_probabilities if p)



In [4]:

    
#각 클래스 레이블의 확률은 별도로 계산
#엔트로피를 구할 때는 어떤 레이블에 어떤 확률값이 주어졌는지까지는 알 필요가 없고, 
#레이블과 무관하게 확률 값들만 알면됨
def class_probabilities(labels):
    total_count = len(labels)
    return [count / total_count
            for count in Counter(labels).values()]

def data_entropy(labeled_data):
    labels = [label for _, label in labeled_data]
    probabilities = class_probabilities(labels)
    return entropy(probabilities)

파티션의 엔트로피

데이터 $S$를 $q_1,...,q_m$의 비율을 가지는 파티션 $S_1,...,S_m$로 나눈 경우, 다음과 같은 가중합으로 구한다

$H = q_1H(S_1) + ... + q_mH(S_m)$

의사결정나무를 사용할 때는 다양한 값을 가질 수 있는 변수들의 경우를 최대한 피하거나, 변수에 속한 값을 적은 수의 버킷으로 나워서 선택하능한 값의 종류를 줄이는 것이 좋다



In [5]:

    
def partition_entropy(subsets):
    """subsets는 레이블이 있는 데이터의 list의 list이다.
    그에 대한 파티션 엔트로피를 구하라"""
    total_count = sum(len(subset) for subset in subsets)

    return sum( data_entropy(subset) * len(subset) / total_count
                for subset in subsets )

의사결정나무 만들기

(input, label) 쌍으로 구성된 인터뷰 후보자 데이터 - input:후보자 정보, label:인터뷰 결과
결정 노드(decision node) : 질문을 주고 답변에 따라 다른 경로로 안내
잎 노드(leaf node) : 예측값이 무엇인지 알려줌
ID3 알고리즘에 기반해서 구축



In [6]:

    
inputs = [
        ({'level':'Senior','lang':'Java','tweets':'no','phd':'no'},   False),
        ({'level':'Senior','lang':'Java','tweets':'no','phd':'yes'},  False),
        ({'level':'Mid','lang':'Python','tweets':'no','phd':'no'},     True),
        ({'level':'Junior','lang':'Python','tweets':'no','phd':'no'},  True),
        ({'level':'Junior','lang':'R','tweets':'yes','phd':'no'},      True),
        ({'level':'Junior','lang':'R','tweets':'yes','phd':'yes'},    False),
        ({'level':'Mid','lang':'R','tweets':'yes','phd':'yes'},        True),
        ({'level':'Senior','lang':'Python','tweets':'no','phd':'no'}, False),
        ({'level':'Senior','lang':'R','tweets':'yes','phd':'no'},      True),
        ({'level':'Junior','lang':'Python','tweets':'yes','phd':'no'}, True),
        ({'level':'Senior','lang':'Python','tweets':'yes','phd':'yes'},True),
        ({'level':'Mid','lang':'Python','tweets':'no','phd':'yes'},    True),
        ({'level':'Mid','lang':'Java','tweets':'yes','phd':'no'},      True),
        ({'level':'Junior','lang':'Python','tweets':'no','phd':'yes'},False)
    ]



In [7]:

    
# 1. 가장 낮은 엔트로피를 반환하는 파티션을 찾는다.
def partition_by(inputs, attribute):
    """attribute에 따라 inputs의 파티션을 나누자"""
    groups = defaultdict(list)
    for input in inputs:
        key = input[0][attribute]
        groups[key].append(input)
    return groups



In [8]:

    
# 2. 엔트로피를 계산
def partition_entropy_by(inputs,attribute):
    """주어진 파티션에 대응되는 엔트로피를 계산"""
    partitions = partition_by(inputs, attribute)
    return partition_entropy(partitions.values())



In [9]:

    
# 3. 전체 데이터셋에 대해 엔트로피를 최소화하는 파티션을 찾는다.
for key in ['level','lang','tweets','phd']:
    print(key, partition_entropy_by(inputs, key))
print()









    



level 0.6935361388961919
lang 0.8601317128547441
tweets 0.7884504573082896
phd 0.8921589282623617



In [10]:

    
#직급의 가능한 각 값에 대해 가지를 나눠 서브트리를 만들자.
#직급이 Mid인 경우는 예측값이 True
#Senior인 경우 True or False
senior_inputs = [(input, label)
    for input, label in inputs if input["level"] == "Senior"]

for key in ['lang', 'tweets', 'phd']:
    print(key, partition_entropy_by(senior_inputs, key))
print()









    



lang 0.4
tweets 0.0
phd 0.9509775004326938



In [11]:

    
junior_inputs = [(input, label)
    for input, label in inputs if input["level"] == "Junior"]

for key in ['lang', 'tweets', 'phd']:
    print(key, partition_entropy_by(junior_inputs, key))
print()









    



lang 0.9509775004326938
tweets 0.9509775004326938
phd 0.0

                  직급(level)?
        ↙(senior)     ↓(mid)     ↘(junior)
    <tweets?>       [합격!]      <phd?>
    ↙yes  ↘no                     ↙yes  ↘no
 [합격!]  [불합격!]              [불합격!] [합격!]

종합 - 일반화

True
False
(변수, 서브트리)로 구성된 tuple

('level',
    {'Junior': ('phd', {'no' : True, 'yes': False}),
     'Mid': True,
     'Senior': ('tweets', {'no': False, 'yes': True})})

만약 새로 입력받은 후보자의 level이 Intern인 경우?

None이라는 key값을 하나 추가하고 가장 빈도가 높은 클래스 레이블을 할당. (실제 데이터에 None값으로 존재한다면 그리 좋은 접근법은 아닐듯...)



In [20]:

    
def classify(tree, input):
    """의사결정나무 tree로 주어진 입력값 input을 분류"""

    # 잎 노드이면 값을 반환
    if tree in [True, False]:
        return tree

    # 그게 아니라면 데이터의 변수로 파티션을 나눔
    # key로 변수 값, 값으로 서브트리를 나타내는 dict를 사용하면됨
    attribute, subtree_dict = tree

    subtree_key = input.get(attribute)  # None if input is missing attribute

    if subtree_key not in subtree_dict: # 키에 해당하는 서브트리가 존재하지 않을때
        subtree_key = None              # None 서브트리를 사용

    subtree = subtree_dict[subtree_key] # 적절한 서브트리를 선택
    return classify(subtree, input)     # 입력된 데이터를 분류

이제 학습용 데이터로부터 실제 나무를 구축!!!



In [21]:

    
def build_tree_id3(inputs, split_candidates=None):

    # if this is our first pass,
    # all keys of the first input are split candidates
    if split_candidates is None:
        split_candidates = inputs[0][0].keys()

    # count Trues and Falses in the inputs
    num_inputs = len(inputs)
    num_trues = len([label for item, label in inputs if label])
    num_falses = num_inputs - num_trues

    if num_trues == 0:                  # if only Falses are left
        return False                    # return a "False" leaf

    if num_falses == 0:                 # if only Trues are left
        return True                     # return a "True" leaf

    if not split_candidates:            # if no split candidates left
        return num_trues >= num_falses  # return the majority leaf

    # otherwise, split on the best attribute
    best_attribute = min(split_candidates,
        key=partial(partition_entropy_by, inputs))

    partitions = partition_by(inputs, best_attribute)
    new_candidates = [a for a in split_candidates
                      if a != best_attribute]

    # recursively build the subtrees
    subtrees = { attribute : build_tree_id3(subset, new_candidates)
                 for attribute, subset in partitions.items() }

    subtrees[None] = num_trues > num_falses # default case

    return (best_attribute, subtrees)



In [22]:

    
print("building the tree")
tree = build_tree_id3(inputs)
print(tree)









    



building the tree
('level', {'Mid': True, 'Senior': ('tweets', {None: False, 'no': False, 'yes': True}), None: True, 'Junior': ('phd', {None: True, 'no': True, 'yes': False})})



In [23]:

    
print("Junior / Java / tweets / no phd", classify(tree,
    { "level" : "Junior",
      "lang" : "Java",
      "tweets" : "yes",
      "phd" : "no"} ))

print("Junior / Java / tweets / phd", classify(tree,
    { "level" : "Junior",
             "lang" : "Java",
             "tweets" : "yes",
             "phd" : "yes"} ))









    



Junior / Java / tweets / no phd True
Junior / Java / tweets / phd False



In [26]:

    
#심지어 관찰된 적 없는 값이 변수에 등장하거나 변수 값 자체가 누락되더라도 분류가 가능
print("Intern", classify(tree, { "level" : "Intern" } ))
print("Senior", classify(tree, { "level" : "Senior" } ))









    



Intern True
Senior False



In [25]:

    
#심지어 관찰된 적 없는 값이 변수에 등장하거나 변수 값 자체가 누락되더라도 분류가 가능

랜덤포레스트

오버피팅을 방지할 수 있는 대표적인 방법 중 하나
여러 개의 의사결정나무를 만들고, 그들의 다수결로 결과를 결정하는 방법



In [27]:

    
def forest_classify(trees, input):
    votes = [classify(tree, input) for tree in trees]
    vote_counts = Counter(votes)
    return vote_counts.most_common(1)[0][0]

랜덤하게 나무를 얻을수 있는 방법(앙상블 학습)

bootstrap aggregating, bagging(배깅) : bootstrap_sample(inputs)의 결과물을 각 나무의 입력값으로 넣어 학습.
파티션을 나누는 변수에 랜덤성을 부여 : 남아있는 모든 변수중 일부만 선택하고 그중 최적의 변수를 선택.



In [ ]: