CPT Poem Generator

How to use CPT

  • In your terminal, type git clone https://github.com/NeerajSarwan/CPT.git to download CPT
    • If you don't have git, download and install it.
  • After download CPT, type cd CPT to enter into the folder
  • You code should be created under this file, so that you can run the code below.
  • Python CPT open source: https://github.com/NeerajSarwan/CPT/blob/master/CPT.py

Poem Generation Methods with CPT


In [1]:
from CPT import *
import pandas as pd

In [2]:
sample_poem = open('sample_sonnets.txt').read().lower().replace('\n', '')  # smaller data sample
all_poem = open('sonnets.txt').read().lower().replace('\n', '')  # larger data sample

In [3]:
def generate_char_seq(whole_str, n):
    """
    Generate a dataframe, each row contains a sequence with length n.
    Next sequence is 1 character of the previous sequence.
    param: whole_str: original text in string format
    param: n: the length of each sequence
    return: a dataframe that contains all the sequences.
    """
    dct = {}
    idx = 0  # the index of the dataframe, the key of each key-value in the dictionary
    
    for i in range(len(whole_str)-n):
        sub_str = whole_str[i:i+n]
        dct[idx] = {}
        for j in range(n):
            dct[idx][j] = sub_str[j]
        idx += 1
    df = pd.DataFrame(dct)
    return df

Method 1 Part 1 - Character Based Poem Generator (Smaller Sample Data)


In [37]:
# I'm using 410 character sequence to train
train_seq_len = 410
training_df = generate_char_seq(sample_poem, train_seq_len).T
training_df.head()


Out[37]:
0 1 2 3 4 5 6 7 8 9 ... 400 401 402 403 404 405 406 407 408 409
0 i f r o m f a i ... a n d o n l y
1 f r o m f a i r ... a n d o n l y h
2 f r o m f a i r e ... n d o n l y h e
3 r o m f a i r e s ... d o n l y h e r
4 o m f a i r e s t ... o n l y h e r a

5 rows × 410 columns


In [77]:
# The testing data will use 20 characters, and I only choose 7 seperate rows from all sequences.
## The poem output will try to predict characters based on the 20 characters in each row, 7 rows in total
test_seq_len = 20
all_testing_df = generate_char_seq(sample_poem, test_seq_len).T
testing_df = all_testing_df.iloc[[77, 99, 177, 199, 277, 299, 410],:]
testing_df.head()


Out[77]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
77 n e v e r d i e , b u t a s t
99 r i p e r s h o u l d b y t i m
177 , c o n t r a c t e d t o t h i n
199 o w n b r i g h t e y e s , f e e
277 a f a m i n e w h e r e a b u n d

In [78]:
# This python open source has a bit weird model input requirement, so I will use its own functions to load the data.
training_df.to_csv("train.csv", index=False)
testing_df.to_csv("test.csv")

In [79]:
model = CPT()
train, test = model.load_files("train.csv", "test.csv")
model.train(train)


Out[79]:
True

In [80]:
predict_len = 10
predictions = model.predict(train,test,test_seq_len,predict_len)


100%|████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:06<00:00,  1.33it/s]

In [81]:
predictions


Out[81]:
[['h', 'o', 'l', 'y', 'w', 'f', 'm', 'g', 'p', 'c'],
 ['a', 'n', 'w', ',', 'g', 'f', 'c', "'", 'v', 'k'],
 ['s', 'l', 'u', 'w', 'm', 'y', 'g', 'f', 'b', 'p'],
 ['a', 'l', 'u', 'd', 'm', 'c', "'", 'p', 'v', '.'],
 ['t', 'o', 's', 'l', ',', 'p', 'y', 'g', 'c', 'v'],
 ['o', 'a', 'n', 'r', 'u', 'd', 'b', 'w', 'm', 'g'],
 ['s', 'i', 'n', 'w', 'b', 'f', 'm', 'c', 'p', 'v']]

Generate the poem


In [82]:
for i in range(testing_df.shape[0]):
    all_char_lst = testing_df.iloc[i].tolist()
    all_char_lst.append(' ')
    all_char_lst.extend(predictions[i])
    print(''.join(all_char_lst))


 never die, but as t holywfmgpc
 riper should by tim anw,gfc'vk
, contracted to thin sluwmygfbp
own bright eyes, fee aludmc'pv.
a famine where abund tosl,pygcv
ce lies, thy self th oanrudbwmg
herald to the gaudy  sinwbfmcpv

Obseravtion

  • The data size is very smaller here. But comparing with LSTM poem generation, CPT is much much faster, and the output is more descent than LSTM results.

Method 1 Part 2 - Character Based Poem Generator (Larger Sample Data)


In [84]:
# I'm using 410 character sequence to train
train_seq_len = 410
training_df = generate_char_seq(all_poem, train_seq_len).T
training_df.head()


Out[84]:
0 1 2 3 4 5 6 7 8 9 ... 400 401 402 403 404 405 406 407 408 409
0 i f r o m f a i ... a n d o n l y
1 f r o m f a i r ... a n d o n l y h
2 f r o m f a i r e ... n d o n l y h e
3 r o m f a i r e s ... d o n l y h e r
4 o m f a i r e s t ... o n l y h e r a

5 rows × 410 columns


In [91]:
# The testing data will use 30 characters, and I only choose 10 seperate rows from all sequences.
## The poem output will try to predict characters based on the 20 characters in each row, 10 rows in total
test_seq_len = 30
all_testing_df = generate_char_seq(all_poem, test_seq_len).T
testing_df = all_testing_df.iloc[[1,2,3,4,5,77, 99, 177, 199, 277, 299, 410],:]
testing_df.head()


Out[91]:
0 1 2 3 4 5 6 7 8 9 ... 20 21 22 23 24 25 26 27 28 29
1 f r o m f a i r ... r e s w e d e s
2 f r o m f a i r e ... e s w e d e s i
3 r o m f a i r e s ... s w e d e s i r
4 o m f a i r e s t ... w e d e s i r e
5 m f a i r e s t ... w e d e s i r e

5 rows × 30 columns


In [92]:
training_df.to_csv("all_train.csv", index=False)
testing_df.to_csv("all_test.csv")

In [93]:
model = CPT()
train, test = model.load_files("all_train.csv", "all_test.csv")
model.train(train)


Out[93]:
True

In [94]:
predict_len = 10  # predict the next 10 characters
predictions = model.predict(train,test,test_seq_len,predict_len)


100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [05:04<00:00, 25.50s/it]

Generate the poem


In [95]:
for i in range(testing_df.shape[0]):
    all_char_lst = testing_df.iloc[i].tolist()
    all_char_lst.append(' ')
    all_char_lst.extend(predictions[i])
    print(''.join(all_char_lst))


 from fairest creatures we des hnl,ybgpvk
from fairest creatures we desi nhl,gybvpk
rom fairest creatures we desir hnl,ygbv'p
om fairest creatures we desire nhl,ygbp'.
m fairest creatures we desire  honlbpgyvk
 never die, but as the riper s olmwfycgk'
 riper should by time decease, nwfgvk'.;x
, contracted to thine own brig slumyfvp.'
own bright eyes, feed'st thy l aumvcpk;.x
a famine where abundance lies, toygpvk'.x
ce lies, thy self thy foe, to  arnwmubdpg
herald to the gaudy spring, wi mfcbv'k;.:

Observations

  • At the very beginning, I tried to generate poem with selected 7 rows, each row uses 20 characters to predict the next 10 characters. It gave exactly the same output as smaller data sample output.
    • This may indicate that, when the selected testing data is very very small and tend to be unique in the training data, smaller data inout is enough to get the results.
  • Then I changed to 12 rows, each row uses 30 characters to predict the next 10 character.
    • The first 5 rows came from continuous rows, which means the next row is 1 character shift from its previous row. If you check their put, although cannot say accurate, but the 5 rows has similar prediction.
  • I think CPT can be more accurate when there are repeated sub-sequences appeared in the training data, because the algorithm behind CPT is prediction tree + inverted index + lookup table, similar to FP-growth in transaction prediction, more repeat more accurate.

Method 2 - Word Based Poem Generation


In [4]:
all_words = all_poem.split()
print(len(all_words))


17670

In [5]:
# With selected sequence length to train
train_seq_len = 1000
training_df = generate_char_seq(all_words, train_seq_len).T
training_df.head()


Out[5]:
0 1 2 3 4 5 6 7 8 9 ... 990 991 992 993 994 995 996 997 998 999
0 i from fairest creatures we desire increase, that thereby beauty's ... commits. x for shame! deny that thou bear'st love to
1 from fairest creatures we desire increase, that thereby beauty's rose ... x for shame! deny that thou bear'st love to any,
2 fairest creatures we desire increase, that thereby beauty's rose might ... for shame! deny that thou bear'st love to any, who
3 creatures we desire increase, that thereby beauty's rose might never ... shame! deny that thou bear'st love to any, who for
4 we desire increase, that thereby beauty's rose might never die, ... deny that thou bear'st love to any, who for thy

5 rows × 1000 columns


In [15]:
# The testing data will use 20 words, and I only choose 10 seperate rows from all sequences.
## The poem output will try to predict characters based on the 20 characters in each row, 10 rows in total
test_seq_len = 20
output_poem_rows = 10

all_testing_df = generate_char_seq(all_words, test_seq_len).T
selected_row_idx_lst = [train_seq_len*i for i in range(output_poem_rows)]
testing_df = all_testing_df.iloc[selected_row_idx_lst,:]
testing_df.head()


Out[15]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 i from fairest creatures we desire increase, that thereby beauty's rose might never die, but as the riper should by
1000 any, who for thy self art so unprovident. grant, if thou wilt, thou art belov'd of many, but that thou
2000 when in eternal lines to time thou grow'st, so long as men can breathe, or eyes can see, so long
3000 my thoughts--from far where i abide-- intend a zealous pilgrimage to thee, and keep my drooping eyelids open wide, looking
4000 must be twain, although our undivided loves are one: so shall those blots that do with me remain, without thy

In [16]:
training_df.to_csv("all_train_words.csv", index=False)
testing_df.to_csv("all_test_words.csv")

In [17]:
model = CPT()
train, test = model.load_files("all_train_words.csv", "all_test_words.csv")
model.train(train)


Out[17]:
True

In [18]:
predict_len = 10  # predict the next 10 words
predictions = model.predict(train,test,test_seq_len,predict_len)


100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:31<00:00,  4.23s/it]

Generate the poem


In [21]:
for i in range(testing_df.shape[0]):
    all_char_lst = testing_df.iloc[i].tolist()
    all_char_lst.extend(predictions[i])
    print(' '.join(all_char_lst))


i from fairest creatures we desire increase, that thereby beauty's rose might never die, but as the riper should by in world an for love children's eyes, her husband's shape
any, who for thy self art so unprovident. grant, if thou wilt, thou art belov'd of many, but that thou and you in your the i to this when with
when in eternal lines to time thou grow'st, so long as men can breathe, or eyes can see, so long and the of for my with thy i is that
my thoughts--from far where i abide-- intend a zealous pilgrimage to thee, and keep my drooping eyelids open wide, looking the of with in by that for all when night
must be twain, although our undivided loves are one: so shall those blots that do with me remain, without thy my i thou and to is love for all it
both sea and land, as soon as think the place where he would be. but, ah! thought kills me that my of i to with in from love doth mine
millions of strange shadows on you tend? since every one, hath every one, one shade, and you but one, can the your in that to shall doth be which nor
all too near. lxii sin of self-love possesseth all mine eye and all my soul, and all my every part; i the his with to when so for is in
to the world that i am fled from this vile world with vilest worms to dwell: nay, if you read and my of me in love thou for which is
praise to thee, but what in thee doth live. then thank him not for that which he doth say, since your the of and i my you shall a as

Summary

  • After trying LSTM, CPT on the same data (Shakespeare's sonnets), with both character based and the most simple word based methods, now we can see CPT word based creates more descent results. Look at the final output, it's just look like poems (difficult to understand), you won't feel anything is wrong at the first glance, it's a poem.
  • I think python CPT is really good, although it's poen source, you don't need to convert character or words into numerical data, categorical data can be the input, and the model training as well as the prediction is very fast.