Copyright 2019 The TensorFlow Authors.



In [ ]:

    
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

순환 신경망을 활용한 문자열 생성

TensorFlow.org에서 보기

깃허브(GitHub) 소스 보기

Download notebook

Note: 이 문서는 텐서플로 커뮤니티에서 번역했습니다. 커뮤니티 번역 활동의 특성상 정확한 번역과 최신 내용을 반영하기 위해 노력함에도 불구하고 공식 영문 문서의 내용과 일치하지 않을 수 있습니다. 이 번역에 개선할 부분이 있다면 tensorflow/docs-l10n 깃헙 저장소로 풀 리퀘스트를 보내주시기 바랍니다. 문서 번역이나 리뷰에 참여하려면 docs-ko@tensorflow.org로 메일을 보내주시기 바랍니다.

이 튜토리얼에서는 문자 기반 순환 신경망(RNN, Recurrent Neural Network)을 사용하여 어떻게 텍스트를 생성하는지 설명합니다. Andrej Karpathy의 순환 신경망의 뛰어난 효율에서 가져온 셰익스피어 데이터셋으로 작업할 것입니다. 이 데이터 셋에서 문자 시퀀스 ("Shakespear")가 주어지면, 시퀀스의 다음 문자("e")를 예측하는 모델을 훈련합니다. 모델을 반복하여 호출하면 더 긴 텍스트 시퀀스 생성이 가능합니다.

Note: 이 노트북을 더 빠르게 실행하기 위해 GPU 가속을 활성화합니다. 코랩(Colab)에서: Runtime > Change runtime type > Hardware acclerator > GPU* 탭을 선택합니다. 로컬에서 실행하려면 TensorFlow 버전이 1.11 이상이어야 합니다.

이 튜토리얼은 tf.keras와 즉시 실행(eager execution)을 활용하여 구현된 실행 가능한 코드가 포함되어 있습니다. 다음은 이 튜토리얼의 30번의 에포크(Epoch)로 훈련된 모델에서 "Q" 문자열로 시작될 때의 샘플 출력입니다.

QUEENE:
I had thought thou hadst a Roman; for the oracle,
Thus by All bids the man against the word,
Which are so weak of care, by old care done;
Your children were in your holy love,
And the precipitation through the bleeding throne.

BISHOP OF ELY:
Marry, and will, my lord, to weep in such a one were prettiest;
Yet now I was adopted heir
Of the world's lamentable day,
To watch the next way with his father with his face?

ESCALUS:
The cause why then we are all resolved more sons.

VOLUMNIA:
O, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, it is no sin it should be dead,
And love and pale as any will to that word.

QUEEN ELIZABETH:
But how long have I heard the soul for this world,
And show his hands of life be proved to stand.

PETRUCHIO:
I say he look'd on, if I must be content
To stay him from the fatal of our country's bliss.
His lordship pluck'd from this sentence then for prey,
And then let us twain, being the moon,
were she such a case as fills m

문장 중 일부는 문법적으로 맞지만 대부분 자연스럽지 않습니다. 이 모델은 단어의 의미를 학습하지는 않았지만, 고려해야 할 점으로:

모델은 문자 기반입니다. 훈련이 시작되었을 때, 이 모델은 영어 단어의 철자를 모르거나 심지어 텍스트의 단위가 단어라는 것도 모릅니다.
출력의 구조는 대본과 유사합니다. 즉, 텍스트 블록은 대개 화자의 이름으로 시작하고 이 이름들은 모든 데이터셋에서 대문자로 씌여 있습니다.
아래에 설명된 것처럼 이 모델은 작은 텍스트 배치(각 100자)로 훈련되었으며 논리적인 구조를 가진 더 긴 텍스트 시퀀스를 생성할 수 있습니다.

설정

텐서플로와 다른 라이브러리 임포트



In [ ]:

    
!pip install tensorflow-gpu==2.0.0
import tensorflow as tf

import numpy as np
import os
import time

셰익스피어 데이터셋 다운로드

다음 코드를 실행하여 데이터를 불러오세요.



In [ ]:

    
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

데이터 읽기

먼저, 텍스트를 살펴봅시다:



In [ ]:

    
# 읽은 다음 파이썬 2와 호환되도록 디코딩합니다.
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
# 텍스트의 길이는 그 안에 있는 문자의 수입니다.
print ('텍스트의 길이: {}자'.format(len(text)))



In [ ]:

    
# 텍스트의 처음 250자를 살펴봅니다
print(text[:250])



In [ ]:

    
# 파일의 고유 문자수를 출력합니다.
vocab = sorted(set(text))
print ('고유 문자수 {}개'.format(len(vocab)))

텍스트 처리

텍스트 벡터화

훈련 전, 문자들을 수치화할 필요가 있습니다. 두 개의 조회 테이블(lookup table)을 만듭니다: 하나는 문자를 숫자에 매핑하고 다른 하나는 숫자를 문자에 매핑하는 것입니다.



In [ ]:

    
# 고유 문자에서 인덱스로 매핑 생성
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])

이제 각 문자에 대한 정수 표현을 만들었습니다. 문자를 0번 인덱스부터 고유 문자 길이까지 매핑한 것을 기억합시다.



In [ ]:

    
print('{')
for char,_ in zip(char2idx, range(20)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')



In [ ]:

    
# 텍스트에서 처음 13개의 문자가 숫자로 어떻게 매핑되었는지를 보여줍니다
print ('{} ---- 문자들이 다음의 정수로 매핑되었습니다 ---- > {}'.format(repr(text[:13]), text_as_int[:13]))

예측 과정

주어진 문자나 문자 시퀀스가 주어졌을 때, 다음 문자로 가장 가능성 있는 문자는 무엇일까요? 이는 모델을 훈련하여 수행할 작업입니다. 모델의 입력은 문자열 시퀀스가 될 것이고, 모델을 훈련시켜 출력을 예측합니다. 이 출력은 현재 타임 스텝(time step)의 다음 문자입니다.

RNN은 이전에 본 요소에 의존하는 내부 상태를 유지하고 있으므로, 이 순간까지 계산된 모든 문자를 감안할 때, 다음 문자는 무엇일까요?

훈련 샘플과 타깃 만들기

다음으로 텍스트를 샘플 시퀀스로 나눕니다. 각 입력 시퀀스에는 텍스트에서 나온 seq_length개의 문자가 포함될 것입니다.

각 입력 시퀀스에서, 해당 타깃은 한 문자를 오른쪽으로 이동한 것을 제외하고는 동일한 길이의 텍스트를 포함합니다.

따라서 텍스트를seq_length + 1개의 청크(chunk)로 나눕니다. 예를 들어, seq_length는 4이고 텍스트를 "Hello"이라고 가정해 봅시다. 입력 시퀀스는 "Hell"이고 타깃 시퀀스는 "ello"가 됩니다.

이렇게 하기 위해 먼저 tf.data.Dataset.from_tensor_slices 함수를 사용해 텍스트 벡터를 문자 인덱스의 스트림으로 변환합니다.



In [ ]:

    
# 단일 입력에 대해 원하는 문장의 최대 길이
seq_length = 100
examples_per_epoch = len(text)//seq_length

# 훈련 샘플/타깃 만들기
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
  print(idx2char[i.numpy()])

batch 메서드는 이 개별 문자들을 원하는 크기의 시퀀스로 쉽게 변환할 수 있습니다.



In [ ]:

    
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(5):
  print(repr(''.join(idx2char[item.numpy()])))

각 시퀀스에서, map 메서드를 사용해 각 배치에 간단한 함수를 적용하고 입력 텍스트와 타깃 텍스트를 복사 및 이동합니다:



In [ ]:

    
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

첫 번째 샘플의 타깃 값을 출력합니다:



In [ ]:

    
for input_example, target_example in  dataset.take(1):
  print ('입력 데이터: ', repr(''.join(idx2char[input_example.numpy()])))
  print ('타깃 데이터: ', repr(''.join(idx2char[target_example.numpy()])))

이 벡터의 각 인덱스는 하나의 타임 스텝(time step)으로 처리됩니다. 타임 스텝 0의 입력으로 모델은 "F"의 인덱스를 받고 다음 문자로 "i"의 인덱스를 예측합니다. 다음 타임 스텝에서도 같은 일을 하지만 RNN은 현재 입력 문자 외에 이전 타임 스텝의 컨텍스트(context)를 고려합니다.



In [ ]:

    
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("{:4d}단계".format(i))
    print("  입력: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  예상 출력: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

훈련 배치 생성

텍스트를 다루기 쉬운 시퀀스로 분리하기 위해 tf.data를 사용했습니다. 그러나 이 데이터를 모델에 넣기 전에 데이터를 섞은 후 배치를 만들어야 합니다.



In [ ]:

    
# 배치 크기
BATCH_SIZE = 64

# 데이터셋을 섞을 버퍼 크기
# (TF 데이터는 무한한 시퀀스와 함께 작동이 가능하도록 설계되었으며,
# 따라서 전체 시퀀스를 메모리에 섞지 않습니다. 대신에,
# 요소를 섞는 버퍼를 유지합니다).
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

모델 설계

모델을 정의하려면 tf.keras.Sequential을 사용합니다. 이 간단한 예제에서는 3개의 층을 사용하여 모델을 정의합니다:

tf.keras.layers.Embedding : 입력층. embedding_dim 차원 벡터에 각 문자의 정수 코드를 매핑하는 훈련 가능한 검색 테이블.
tf.keras.layers.GRU : 크기가 units = rnn_units인 RNN의 유형(여기서 LSTM층을 사용할 수도 있습니다.)
tf.keras.layers.Dense : 크기가 vocab_size인 출력을 생성하는 출력층.



In [ ]:

    
# 문자로 된 어휘 사전의 크기
vocab_size = len(vocab)

# 임베딩 차원
embedding_dim = 256

# RNN 유닛(unit) 개수
rnn_units = 1024



In [ ]:

    
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.LSTM(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model



In [ ]:

    
model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)

각 문자에 대해 모델은 임베딩을 검색하고, 임베딩을 입력으로 하여 GRU를 1개의 타임 스텝으로 실행하고, 완전연결층을 적용하여 다음 문자의 로그 가능도(log-likelihood)를 예측하는 로짓을 생성합니다:

모델 사용

이제 모델을 실행하여 원하는대로 동작하는지 확인합니다.

먼저 출력의 형태를 확인합니다:



In [ ]:

    
for input_example_batch, target_example_batch in dataset.take(1):
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (배치 크기, 시퀀스 길이, 어휘 사전 크기)")

위 예제에서 입력의 시퀀스 길이는 100이지만 모델은 임의 길이의 입력에서 실행될 수 있습니다.



In [ ]:

    
model.summary()

모델로부터 실제 예측을 얻으려면 출력 배열에서 샘플링하여 실제 문자 인덱스를 얻어야 합니다. 이 분포는 문자 어휘에 대한 로짓에 의해 정의됩니다.

Note: 배열에 argmax를 취하면 모델이 쉽게 루프에 걸릴 수 있으므로 배열에서 샘플링하는 것이 중요합니다.

배치의 첫 번째 샘플링을 시도해 봅시다:



In [ ]:

    
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()

이렇게 하면 각 타임 스텝(time step)에서 다음 문자 인덱스에 대한 예측을 제공합니다:



In [ ]:

    
sampled_indices

훈련되지 않은 모델에 의해 예측된 텍스트를 보기 위해 복호화합니다.



In [ ]:

    
print("입력: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("예측된 다음 문자: \n", repr("".join(idx2char[sampled_indices ])))

모델 훈련

이 문제는 표준 분류 문제로 취급될 수 있습니다. 이전 RNN 상태와 이번 타임 스텝(time step)의 입력으로 다음 문자의 클래스를 예측합니다.

옵티마이저 붙이기, 그리고 손실 함수

표준 tf.keras.losses.sparse_softmax_crossentropy 손실 함수는 이전 차원의 예측과 교차 적용되기 때문에 이 문제에 적합합니다.

이 모델은 로짓을 반환하기 때문에 from_logits 플래그를 설정해야 합니다.



In [ ]:

    
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("예측 배열 크기(shape): ", example_batch_predictions.shape, " # (배치 크기, 시퀀스 길이, 어휘 사전 크기")
print("스칼라 손실:          ", example_batch_loss.numpy().mean())

tf.keras.Model.compile 메서드를 사용하여 훈련 절차를 설정합니다. 기본 매개변수의 tf.keras.optimizers.Adam과 손실 함수를 사용합니다.



In [ ]:

    
model.compile(optimizer='adam', loss=loss)

체크포인트 구성

tf.keras.callbacks.ModelCheckpoint를 사용하여 훈련 중 체크포인트(checkpoint)가 저장되도록 합니다:



In [ ]:

    
# 체크포인트가 저장될 디렉토리
checkpoint_dir = './training_checkpoints'
# 체크포인트 파일 이름
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

훈련 실행

훈련 시간이 너무 길지 않도록 모델을 훈련하는 데 10개의 에포크(Epoch)를 사용합니다. 코랩(Colab)에서는 런타임을 GPU로 설정하여 보다 빠르게 훈련할 수 있습니다.



In [ ]:

    
EPOCHS=10



In [ ]:

    
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

텍스트 생성

최근 체크포인트 복원

이 예측 단계를 간단히 유지하기 위해 배치 크기로 1을 사용합니다.

RNN 상태가 타임 스텝에서 타임 스텝으로 전달되는 방식이기 때문에 모델은 한 번 빌드된 고정 배치 크기만 허용합니다.

다른 배치 크기로 모델을 실행하려면 모델을 다시 빌드하고 체크포인트에서 가중치를 복원해야 합니다.



In [ ]:

    
tf.train.latest_checkpoint(checkpoint_dir)



In [ ]:

    
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))



In [ ]:

    
model.summary()

예측 루프

다음 코드 블록은 텍스트를 생성합니다:

시작 문자열 선택과 순환 신경망 상태를 초기화하고 생성할 문자 수를 설정하면 시작됩니다.
시작 문자열과 순환 신경망 상태를 사용하여 다음 문자의 예측 배열을 가져옵니다.
다음, 범주형 배열을 사용하여 예측된 문자의 인덱스를 계산합니다. 이 예측된 문자를 모델의 다음 입력으로 활용합니다.
모델에 의해 리턴된 RNN 상태는 모델로 피드백되어 이제는 하나의 단어가 아닌 더 많은 컨텍스트를 갖추게 됩니다. 다음 단어를 예측한 후 수정된 RNN 상태가 다시 모델로 피드백되어 이전에 예측된 단어에서 더 많은 컨텍스트를 얻으면서 학습하는 방식입니다.

생성된 텍스트를 보면 모델이 언제 대문자로 나타나고, 절을 만들고 셰익스피어와 유사한 어휘를 가져오는지 알 수 있습니다. 훈련 에포크(Epoch)가 적은 관계로 논리적인 문장을 형성하는 것은 아직 훈련되지 않았습니다.



In [ ]:

    
def generate_text(model, start_string):
  # 평가 단계 (학습된 모델을 사용하여 텍스트 생성)

  # 생성할 문자의 수
  num_generate = 1000

  # 시작 문자열을 숫자로 변환(벡터화)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # 결과를 저장할 빈 문자열
  text_generated = []

  # 온도가 낮으면 더 예측 가능한 텍스트가 됩니다.
  # 온도가 높으면 더 의외의 텍스트가 됩니다.
  # 최적의 세팅을 찾기 위한 실험
  temperature = 1.0

  # 여기에서 배치 크기 == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # 배치 차원 제거
      predictions = tf.squeeze(predictions, 0)

      # 범주형 분포를 사용하여 모델에서 리턴한 단어 예측
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # 예측된 단어를 다음 입력으로 모델에 전달
      # 이전 은닉 상태와 함께
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))



In [ ]:

    
print(generate_text(model, start_string=u"ROMEO: "))

결과를 개선하는 가장 쉬운 방법은 더 오래 훈련하는 것입니다(EPOCHS = 30을 시도해 봅시다).

또한 다른 시작 문자열을 시험해 보거나 모델의 정확도를 높이기 위해 다른 RNN 레이어를 추가하거나 온도 파라미터를 조정하여 많은 혹은 적은 임의의 예측을 생성할 수 있습니다.

고급: 맞춤식 훈련

위의 훈련 절차는 간단하지만 많은 권한을 부여하지는 않습니다.

이제 수동으로 모델을 실행하는 방법을 살펴 보았으니 이제 훈련 루프를 해제하고 직접 구현합시다. 이는 시작점을 제공해 주는데, 예를 들어 커리큘럼 학습(curriculum learning)을 구현하면 모델의 오픈 루프(open-loop) 출력을 안정적으로 하는 데 도움을 줍니다.

기울기 추적을 위해 tf.GradientTape을 사용합니다. 이 방법에 대한 자세한 내용은 즉시 실행 가이드를 참조하십시오.

절차는 다음과 같이 동작합니다:

먼저 RNN 상태를 초기화합니다. 우리는tf.keras.Model.reset_states 메서드를 호출하여 이를 수행합니다.
다음으로 데이터셋(배치별로)를 반복하고 각각에 연관된 예측을 계산합니다.
tf.GradientTape를 열고 그 컨텍스트에서의 예측과 손실을 계산합니다.
tf.GradientTape.grads 메서드를 사용하여 모델 변수에 대한 손실의 기울기를 계산합니다.
마지막으로 옵티마이저의 tf.train.Optimizer.apply_gradients 메서드를 사용하여 이전 단계로 이동합니다.



In [ ]:

    
model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)



In [ ]:

    
optimizer = tf.keras.optimizers.Adam()



In [ ]:

    
@tf.function
def train_step(inp, target):
  with tf.GradientTape() as tape:
    predictions = model(inp)
    loss = tf.reduce_mean(
        tf.keras.losses.sparse_categorical_crossentropy(
            target, predictions, from_logits=True))
  grads = tape.gradient(loss, model.trainable_variables)
  optimizer.apply_gradients(zip(grads, model.trainable_variables))

  return loss



In [ ]:

    
# 훈련 횟수
EPOCHS = 10

for epoch in range(EPOCHS):
  start = time.time()

  # 모든 에포크(Epoch)의 시작에서 은닉 상태를 초기화
  # 초기 은닉 상태는 None
  hidden = model.reset_states()

  for (batch_n, (inp, target)) in enumerate(dataset):
    loss = train_step(inp, target)

    if batch_n % 100 == 0:
      template = '에포크 {} 배치 {} 손실 {}'
      print(template.format(epoch+1, batch_n, loss))

  # 모든 5 에포크(Epoch)마다(체크포인트) 모델 저장
  if (epoch + 1) % 5 == 0:
    model.save_weights(checkpoint_prefix.format(epoch=epoch))

  print ('에포크 {} 손실 {:.4f}'.format(epoch+1, loss))
  print ('1 에포크 당 {}초 소요\n'.format(time.time() - start))

model.save_weights(checkpoint_prefix.format(epoch=epoch))