Use Tensorflow to Generate Text for Modu

Define some helper functions


In [1]:
import pickle
import os
import re

def load_text(path):
    input_file = os.path.join(path)
    with open(input_file, 'r') as f:
        text_data = f.read()
    return text_data

def preprocess_and_save_data(text, token_lookup, create_lookup_tables):
    token_dict = token_lookup()
    for key, token in token_dict.items():
        text = text.replace(key, '{}'.format(token))
        
    text = list(text)
    vocab_to_in, int_to_vocab = create_lookup_tables(text)
    int_text = [vocab_to_in[word] for word in text]
    pickle.dump((int_text, vocab_to_in, int_to_vocab, token_dict), open('preprocess.p', 'wb'))
    
def load_preprocess():
    return pickle.load(open('preprocess.p', mode='rb'))

def save_params(params):
    pickle.dump(params, open('params.p', 'wb'))
    
def load_params():
    return pickle.load(open('params.p', mode='rb'))

Read in data & Do some cleaning


In [2]:
data_path = './text/modu.txt'
text = load_text(data_path)
text = text.replace(' ', '')

# reduce 2 or more empty line to 1
text = re.sub(r'\n{2,}', '\n', text)
# replace ...... with 。
text = re.sub(r'\.+', '。', text)
# remove the content in 《》
text = re.sub(r'《.*》', '', text)
# remove ――
text = re.sub(r'――', '', text)
text = re.sub(r'\u3000', '', text)

num_words_for_training = 100000

text = text[:num_words_for_training]
lines_of_text = text.split('\n')
print(len(lines_of_text))
print(lines_of_text[:20])


1898
['第1章序章', '真实,这残酷的真实。', '燕城花市区南平大道北一带,就像个画了半面妆的妖怪。', '宽阔笔直的双向车道把整个花市区一分为二,东区是本市最繁华的核心商圈之一,西区则是被遗忘的旧城区,城市贫民的聚集地。', '随着东区这几年接连拍出天价“地王”,亟待改造的老城区也跟着沾了光,拆迁成本水涨船高,活生生地吓跑了一帮开发商,在逼仄贫困的窄巷中生生铸起了一道资本的藩篱。', '危房里的街坊们整天幻想着能傍着这十几平方的小破房一夜暴富,精神上已经率先享受起了“我家房子拆了就是几百万”的优越感。', '当然,这些贫民窟里的百万富翁们还是要每天圾着拖鞋排队倒尿盆。', '初夏的夜里尚有凉意,白天积攒的那一点暑气很快溃不成军,西区非法占道的小烧烤摊陆续偃旗息鼓,纳凉的居民们也都早早回了家,偶尔有个旧路灯电压不稳地乱闪,多半是附近群租房的从上面私接电线的缘故。', '而一街之隔的繁华区,夜生活才刚刚开始', '傍晚时分,东区商圈临街的一家咖啡店里,刚打发完一大批客人的店员终于逮着机会出了口长气,可还不等她把笑僵的五官手动归位,玻璃门上挂的小铃铛又响了。', '店员只好重新端出八颗牙的标准微笑:“欢迎光临。”', '“一杯低因的香草拿铁,谢谢。”', '客人是个身材修长的青年男子,留着几乎及肩的长发,穿一身熨帖又严肃的正装,戴着金属框的眼镜,细细的镜框压在他高挺的鼻梁上,他低头摸钱夹,勾在下巴上的长发挡住了小半张脸,鼻梁和嘴唇在灯光下好像刷了一层苍白的釉,看起来有种格外禁欲的冷淡气质。', '爱美之心人皆有之,店员不由多看了他几眼,揣度着客人的喜好搭话:“您需要换成无糖香草吗?”', '“不,糖浆多一点。”客人递过零钱,一抬头,店员的目光正好和他撞在一起。', '客人大约是出于礼貌,冲店员笑了一下,藏在镜片后面的眼角微妙地一弯,温柔又有些暧昧的笑意顷刻就穿透了他方才严肃的假正经。', '店员这才发现,这位客人的模样虽然很好,却不是周正端庄的好,有点眼带桃花的意思,她的脸莫名有点发烫,连忙避开客人的视线,低头下单。', '幸好这时给店里补货的来了,店员赶紧给自己找了点事干,大声招呼送货的到后面核对货单。', '送货的是个年轻小伙,二十岁上下,整个人好似一团洋溢的青春,就着余晖弹进了店里,他皮肤黝黑,一笑一口小白牙,活力十足地跟店员打招呼:“美女好,美女今天气色不错,生意很好吧?”', '店员按月拿死工资,并不盼着店里生意好,听了这通拍歪的马屁,她哭笑不得地一摆手:“还行吧,你快去干活,出来我给你倒杯冰水喝。”']

In [3]:
print(lines_of_text[-10:])


['骆闻舟此时正在横穿中央广场,左耳的耳机里听着各小组的进度汇报,右耳留心着周围环境,一心二用地吩咐说:“中央广场找几个人维护一下现场秩序,人手不够让保安兄弟们帮个忙,不要让围观的人乱说话干扰她的情绪”', '这时,大屏幕上的费渡开了口:“阿姨,我自己的妈妈如果还活着,应该是跟您差不多的年纪。”', '骆闻舟听了这一句,下意识地抬头看了他一眼,但是看归看,他脚步不停,飞快地穿过广场空地,赶往下一座建筑物:“3组,临街的那几个大高楼顶楼有监控,可以直接调,不要浪费时间。陶然你那边注意疏散通道,4组跟我去东区的双子大楼,有几个楼层正在施工,重点排查。”', '费渡略微有些低沉的声音如影随形地追着他匆忙的脚步:“……我比忠义回家回得勤一些,毕竟他得辛苦攒钱给您治病,我当时只是个无所事事的学生,每周末,她都会提前在花瓶里换好鲜花,强打精神准备好我喜欢吃的东西,打扫我的房间,把我的被子拿出去晒。她不喜欢和保姆住,所以这些事都必须独自完成您也会给忠义晒被子吗?”', '王秀娟难以忍受地发出一声长长的抽泣,旋即被卷入了风中。', '而抽泣的风从高楼楼顶盘旋而下,刮过骆闻舟见汗的鬓角,像一声掠过的叹息。', '“可是有一天,我满怀期待地回到家,推开门,却发现门口的花瓶里只有一堆枯枝败叶,所有的窗帘都拉着,屋里透着一股死气沉沉的味道,等我战战兢兢地来到她房间里,发现等着我的不是晒好的被子,而是她的尸体。”费渡说到这里,略微停顿了一下,“您不久前才和我说过,‘我妈肯定每天盼着我回家’,可是当时办案的民警告诉我,她是在我回来的前一晚死于自杀我每周都是固定的时间回家,她一直都知道。”', '“妈,我一直很想问您一个问题,什么样的妈妈会掐着时间,特意把自己的尸体留给她的孩子呢?我每天都在想怎么样讨你喜欢,怎么样能让你高兴一点怎么样攒够给你治病的钱,还清当年人家借给我的手术费……钱还没有还清,我现在一个人在冰库里回不了家,你就打算把我扔在那不管了吗?你们如果都这么狠心,为什么以前还要表现出好像很在乎我们的样子?”', '王秀娟缓缓地就着跨在防护栏上的动作蹲了下来。', '']

Preprocess the text data to suit the model


In [4]:
def create_lookup_tables(input_data):
    vocab = set(input_data)
    vocab_to_int = {word: idx for idx, word in enumerate(vocab)}
    int_to_vocab = dict(enumerate(vocab))
    return vocab_to_int, int_to_vocab

def token_lookup():
    """ Lookup tables for Chinese punctuations
    """
    symbols = set(['。', ',', '“', "”", ';', '!', '?', '(', ')', '\n'])
    tokens = ["P", "C", "Q", "T", "S", "E", "M", "I", "O", "R"]
    return dict(zip(symbols, tokens))

In [5]:
# process and save the processed data
preprocess_and_save_data(''.join(lines_of_text), token_lookup, create_lookup_tables)

In [6]:
int_text, vocab_to_int, int_to_vocab, token_dict = load_preprocess()

Check Tensorflow environment


In [7]:
import warnings
import tensorflow as tf
import numpy as np

# Check TensorFlow Version, need > 1
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
if not tf.test.gpu_device_name():
    warnings.warn('No GPU found. Please use a GPU to train your neural network.')
else:
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))


TensorFlow Version: 1.2.1
/home/ubuntu/anaconda2/envs/py36/lib/python3.6/site-packages/ipykernel/__main__.py:10: UserWarning: No GPU found. Please use a GPU to train your neural network.

Build Our RNN!

Set super params


In [8]:
# 训练循环次数
num_epochs = 200

# batch大小
batch_size = 256

# lstm层中包含的unit个数
rnn_size = 512

# embedding layer的大小
embed_dim = 1000

# 训练步长
seq_length = 60

# 学习率
learning_rate = 0.002

# 每多少步打印一次训练信息
show_every_n_batches = 60

# 保存session状态的位置
save_dir = './save'

Create the placeholders for input, targets and learning_Rate


In [9]:
def get_inputs():

    # inputs和targets的类型都是整数的
    inputs = tf.placeholder(tf.int32, [None, None], name='inputs')
    targets = tf.placeholder(tf.int32, [None, None], name='targets')
    learning_rate = tf.placeholder(tf.float32, name='learning_rate')

    return inputs, targets, learning_rate

Create rnn cell,use lstm cell,and create lstm layers. Use dropout,and initialize lstm layer


In [10]:
def get_init_cell(batch_size, rnn_size):
    # lstm层数
    num_layers = 2

    # dropout时的保留概率
    keep_prob = 0.8

    # 创建包含rnn_size个神经元的lstm cell
    cell = tf.contrib.rnn.BasicLSTMCell(rnn_size)

    # 使用dropout机制防止overfitting等
    drop = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob=keep_prob)

    # 创建2层lstm层
    cell = tf.contrib.rnn.MultiRNNCell([drop for _ in range(num_layers)])

    # 初始化状态为0.0
    init_state = cell.zero_state(batch_size, tf.float32)

    # 使用tf.identify给init_state取个名字,后面生成文字的时候,要使用这个名字来找到缓存的state
    init_state = tf.identity(init_state, name='init_state')

    return cell, init_state

Create embedding layers


In [11]:
def get_embed(input_data, vocab_size, embed_dim):
    # create tf variable based on embedding layer size and vocab size
    embedding = tf.Variable(tf.random_uniform((vocab_size, embed_dim)), dtype=tf.float32)
    return tf.nn.embedding_lookup(embedding, input_data)

创建rnn节点,使用dynamic_rnn方法计算出output和final_state


In [12]:
def build_rnn(cell, inputs):
    outputs, final_state = tf.nn.dynamic_rnn(cell, inputs, dtype=tf.float32)
    final_state = tf.identity(final_state, name="final_state")
    return outputs, final_state

In [13]:
def build_nn(cell, rnn_size, input_data, vocab_size, embed_dim):
    embed = get_embed(input_data, vocab_size, rnn_size)
    outputs, final_state = build_rnn(cell, embed)
    logits = tf.contrib.layers.fully_connected(outputs, vocab_size, activation_fn=None,
                                               weights_initializer = tf.truncated_normal_initializer(stddev=0.1),
                                               biases_initializer=tf.zeros_initializer())
    return logits, final_state

Define get_batches to train batch by batch


In [14]:
def get_batches(int_text, batch_size, seq_length):
    # 计算有多少个batch可以创建
    n_batches = (len(int_text) // (batch_size * seq_length))

    # 计算每一步的原始数据,和位移一位之后的数据
    batch_origin = np.array(int_text[: n_batches * batch_size * seq_length])
    batch_shifted = np.array(int_text[1: n_batches * batch_size * seq_length + 1])

    # 将位移之后的数据的最后一位,设置成原始数据的第一位,相当于在做循环
    batch_shifted[-1] = batch_origin[0]

    batch_origin_reshape = np.split(batch_origin.reshape(batch_size, -1), n_batches, 1)
    batch_shifted_reshape = np.split(batch_shifted.reshape(batch_size, -1), n_batches, 1)

    batches = np.array(list(zip(batch_origin_reshape, batch_shifted_reshape)))

    return batches

Now, let's build the whole RNN model!


In [15]:
from tensorflow.contrib import seq2seq

train_graph = tf.Graph()
with train_graph.as_default():
    vocab_size = len(int_to_vocab)
    
    input_text, targets, lr = get_inputs()
    input_data_shape = tf.shape(input_text)
    
    # 创建rnn的cell和初始状态节点,rnn的cell已经包含了lstm,dropout
    # 这里的rnn_size表示每个lstm cell中包含了多少的神经元
    cell, initial_state = get_init_cell(input_data_shape[0], rnn_size)
    
    # 创建计算loss和finalstate的节点
    logits, final_state = build_nn(cell, rnn_size, input_text, vocab_size, embed_dim)

    # 使用softmax计算最后的预测概率
    probs = tf.nn.softmax(logits, name='probs')

    # 计算loss
    cost = seq2seq.sequence_loss(
        logits,
        targets,
        tf.ones([input_data_shape[0], input_data_shape[1]]))

    # 使用Adam提督下降
    optimizer = tf.train.AdamOptimizer(lr)

    # 裁剪一下Gradient输出,最后的gradient都在[-1, 1]的范围内
    gradients = optimizer.compute_gradients(cost)
    capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
    train_op = optimizer.apply_gradients(capped_gradients)

Now, let's train the model


In [16]:
# 获得训练用的所有batch
batches = get_batches(int_text, batch_size, seq_length)

# 打开session开始训练,将上面创建的graph对象传递给session
with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())

    for epoch_i in range(num_epochs):
        state = sess.run(initial_state, {input_text: batches[0][0]})

        for batch_i, (x, y) in enumerate(batches):
            feed = {
                input_text: x,
                targets: y,
                initial_state: state,
                lr: learning_rate}
            train_loss, state, _ = sess.run([cost, final_state, train_op], feed)

            # 打印训练信息
            if (epoch_i * len(batches) + batch_i) % show_every_n_batches == 0:
                print('Epoch {:>3} Batch {:>4}/{}   train_loss = {:.3f}'.format(
                    epoch_i,
                    batch_i,
                    len(batches),
                    train_loss))

    # 保存模型
    saver = tf.train.Saver()
    saver.save(sess, save_dir)
    print('Model Trained and Saved')

save_params((seq_length, save_dir))


Epoch   0 Batch    0/6   train_loss = 7.944
Epoch  10 Batch    0/6   train_loss = 6.087
Epoch  20 Batch    0/6   train_loss = 5.427
Epoch  30 Batch    0/6   train_loss = 4.825
Epoch  40 Batch    0/6   train_loss = 4.402
Epoch  50 Batch    0/6   train_loss = 4.108
Epoch  60 Batch    0/6   train_loss = 3.798
Epoch  70 Batch    0/6   train_loss = 3.492
Epoch  80 Batch    0/6   train_loss = 3.251
Epoch  90 Batch    0/6   train_loss = 2.999
Epoch 100 Batch    0/6   train_loss = 2.834
Epoch 110 Batch    0/6   train_loss = 2.614
Epoch 120 Batch    0/6   train_loss = 2.429
Epoch 130 Batch    0/6   train_loss = 2.274
Epoch 140 Batch    0/6   train_loss = 2.097
Epoch 150 Batch    0/6   train_loss = 2.021
Epoch 160 Batch    0/6   train_loss = 1.783
Epoch 170 Batch    0/6   train_loss = 1.626
Epoch 180 Batch    0/6   train_loss = 1.512
Epoch 190 Batch    0/6   train_loss = 1.360
Model Trained and Saved

In [27]:
# Save the trained model
save_params((seq_length, save_dir))

Let's Generate Some Text!


In [2]:
import tensorflow as tf
import numpy as np

_, vocab_to_int, int_to_vocab, token_dict = load_preprocess()
seq_length, load_dir = load_params()

要使用保存的模型,我们要将保存下来的变量(tensor)通过指定的name获取到


In [3]:
def get_tensors(loaded_graph):
    inputs = loaded_graph.get_tensor_by_name("inputs:0")

    initial_state = loaded_graph.get_tensor_by_name("init_state:0")

    final_state = loaded_graph.get_tensor_by_name("final_state:0")

    probs = loaded_graph.get_tensor_by_name("probs:0")

    return inputs, initial_state, final_state, probs

def pick_word(probabilities, int_to_vocab):
    chances = []
    for idx, prob in enumerate(probabilities):    
        if prob >= 0.05:
            chances.append(int_to_vocab[idx])
    if len(chances) < 1:
        return str(int_to_vocab[np.argmax(probabilities)])
    else:
        rand = np.random.randint(0, len(chances))
        return str(chances[rand])

In [4]:
# 生成文本的长度
gen_length = 500

# 文章开头的字,指定一个即可,这个字必须是在训练词汇列表中的
prime_word = '来'

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # 加载保存过的session
    loader = tf.train.import_meta_graph(load_dir + '.meta')
    loader.restore(sess, load_dir)

    # 通过名称获取缓存的tensor
    input_text, initial_state, final_state, probs = get_tensors(loaded_graph)

    # 准备开始生成文本
    gen_sentences = [prime_word]
    prev_state = sess.run(initial_state, {input_text: np.array([[1]])})

    # 开始生成文本
    for n in range(gen_length):
        dyn_input = [[vocab_to_int[word] for word in gen_sentences[-seq_length:]]]
        dyn_seq_length = len(dyn_input[0])

        probabilities, prev_state = sess.run(
            [probs, final_state],
            {input_text: dyn_input, initial_state: prev_state})
        # print(probabilities)
        # print(len(probabilities))
        pred_word = pick_word(probabilities[0][dyn_seq_length - 1], int_to_vocab)

        gen_sentences.append(pred_word)

    # 将标点符号还原
    novel = ''.join(gen_sentences)
    for key, token in token_dict.items():
        ending = ' ' if key in ['\n', '(', '“'] else ''
        novel = novel.replace(token.upper(), key)
    novel = novel.replace('\n ', '\n')
    novel = novel.replace('( ', '(')

    print(novel)


INFO:tensorflow:Restoring parameters from ./save
来自然一脸:“我去问我怎么办的时候走你,我看你办事!”只有那是在非年排查,合适的时间,她抽了手机。那烟头倏地抬起头,正好和费渡已经跟分局那边绕着原本,是和他没人把一股轻声的不在气质,连上小时候大心点大声。“市局,”黄队牙不着凝气,然后应该客气了,骆闻舟把车停在陈振一看,听起那张外地说:“前天晚上被周关地’。”郎乔飞快地看了眼,眼皮笑不同不和:“不是你问什么,我有一些暗藏,下班不要找人,”费渡一点头:“一起……”骆闻舟一个微思走向费渡:“我不让你们就听见你,我觉得不是听谁就是他!”骆闻舟一偏眉,一脸住花白地方也不抬,“那我是一遮了。”陶然三言两语把费渡对视了一眼。“不是你打电话跟值班。”骆闻舟:“陶然?”送在这里头也就一直有明天,皱下吹,下头递,他一仰眉,一手虚地飞开了,公交车站了墙,他已经察过了车座机旁边。骆闻舟对着拉链,他的长说得好似等费渡说话,“她是不是还有什么好好的告诉他。”“我可以去干什么?”陶然一把头,伸手把骆闻舟在陶然走过去,冲他们就见这一瞬间,骆闻舟想开了口,提用居民已经成了曲。老家里面某种小心行,方才能再次感觉“一语无声……”骆队回在医年,有血腥气熏力游番,连忙避避话那么

In [37]:
vocab_size


Out[37]:
2734

In [47]:
len(probabilities)


Out[47]:
1

In [49]:
max(probabilities[0][0])


Out[49]:
0.17373672

In [ ]: