CPT Poem Generator

How to use CPT

In your terminal, type git clone https://github.com/NeerajSarwan/CPT.git to download CPT
- If you don't have git, download and install it.
After download CPT, type cd CPT to enter into the folder
You code should be created under this file, so that you can run the code below.
Python CPT open source: https://github.com/NeerajSarwan/CPT/blob/master/CPT.py

Poem Generation Methods with CPT

Character based poem generation
Word based poem generation
To compare with LSTM poem generator: https://github.com/hanhanwu/Hanhan_Data_Science_Practice/blob/master/sequencial_analysis/try_poem_generator.ipynb

Download the sonnets text from : https://github.com/pranjal52/text_generators/blob/master/sonnets.txt



In [1]:

    
from CPT import *
import pandas as pd



In [2]:

    
sample_poem = open('sample_sonnets.txt').read().lower().replace('\n', '')  # smaller data sample
all_poem = open('sonnets.txt').read().lower().replace('\n', '')  # larger data sample



In [3]:

    
def generate_char_seq(whole_str, n):
    """
    Generate a dataframe, each row contains a sequence with length n.
    Next sequence is 1 character of the previous sequence.
    param: whole_str: original text in string format
    param: n: the length of each sequence
    return: a dataframe that contains all the sequences.
    """
    dct = {}
    idx = 0  # the index of the dataframe, the key of each key-value in the dictionary
    
    for i in range(len(whole_str)-n):
        sub_str = whole_str[i:i+n]
        dct[idx] = {}
        for j in range(n):
            dct[idx][j] = sub_str[j]
        idx += 1
    df = pd.DataFrame(dct)
    return df

Method 1 Part 1 - Character Based Poem Generator (Smaller Sample Data)



In [37]:

    
# I'm using 410 character sequence to train
train_seq_len = 410
training_df = generate_char_seq(sample_poem, train_seq_len).T
training_df.head()









    Out[37]:







  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      ...
      400
      401
      402
      403
      404
      405
      406
      407
      408
      409
    
  
  
    
      0
      i
      
      f
      r
      o
      m
      
      f
      a
      i
      ...
      
      a
      n
      d
      
      o
      n
      l
      y
      
    
    
      1
      
      f
      r
      o
      m
      
      f
      a
      i
      r
      ...
      a
      n
      d
      
      o
      n
      l
      y
      
      h
    
    
      2
      f
      r
      o
      m
      
      f
      a
      i
      r
      e
      ...
      n
      d
      
      o
      n
      l
      y
      
      h
      e
    
    
      3
      r
      o
      m
      
      f
      a
      i
      r
      e
      s
      ...
      d
      
      o
      n
      l
      y
      
      h
      e
      r
    
    
      4
      o
      m
      
      f
      a
      i
      r
      e
      s
      t
      ...
      
      o
      n
      l
      y
      
      h
      e
      r
      a
    
  

5 rows × 410 columns



In [77]:

    
# The testing data will use 20 characters, and I only choose 7 seperate rows from all sequences.
## The poem output will try to predict characters based on the 20 characters in each row, 7 rows in total
test_seq_len = 20
all_testing_df = generate_char_seq(sample_poem, test_seq_len).T
testing_df = all_testing_df.iloc[[77, 99, 177, 199, 277, 299, 410],:]
testing_df.head()



In [78]:

    
# This python open source has a bit weird model input requirement, so I will use its own functions to load the data.
training_df.to_csv("train.csv", index=False)
testing_df.to_csv("test.csv")



In [79]:

    
model = CPT()
train, test = model.load_files("train.csv", "test.csv")
model.train(train)









    Out[79]:





True



In [80]:

    
predict_len = 10
predictions = model.predict(train,test,test_seq_len,predict_len)









    



100%|████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:06<00:00,  1.33it/s]



In [81]:

    
predictions









    Out[81]:





[['h', 'o', 'l', 'y', 'w', 'f', 'm', 'g', 'p', 'c'],
 ['a', 'n', 'w', ',', 'g', 'f', 'c', "'", 'v', 'k'],
 ['s', 'l', 'u', 'w', 'm', 'y', 'g', 'f', 'b', 'p'],
 ['a', 'l', 'u', 'd', 'm', 'c', "'", 'p', 'v', '.'],
 ['t', 'o', 's', 'l', ',', 'p', 'y', 'g', 'c', 'v'],
 ['o', 'a', 'n', 'r', 'u', 'd', 'b', 'w', 'm', 'g'],
 ['s', 'i', 'n', 'w', 'b', 'f', 'm', 'c', 'p', 'v']]

Generate the poem



In [82]:

    
for i in range(testing_df.shape[0]):
    all_char_lst = testing_df.iloc[i].tolist()
    all_char_lst.append(' ')
    all_char_lst.extend(predictions[i])
    print(''.join(all_char_lst))









    



 never die, but as t holywfmgpc
 riper should by tim anw,gfc'vk
, contracted to thin sluwmygfbp
own bright eyes, fee aludmc'pv.
a famine where abund tosl,pygcv
ce lies, thy self th oanrudbwmg
herald to the gaudy  sinwbfmcpv

Obseravtion

The data size is very smaller here. But comparing with LSTM poem generation, CPT is much much faster, and the output is more descent than LSTM results.

Method 1 Part 2 - Character Based Poem Generator (Larger Sample Data)



In [84]:

    
# I'm using 410 character sequence to train
train_seq_len = 410
training_df = generate_char_seq(all_poem, train_seq_len).T
training_df.head()









    Out[84]:







  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      ...
      400
      401
      402
      403
      404
      405
      406
      407
      408
      409
    
  
  
    
      0
      i
      
      f
      r
      o
      m
      
      f
      a
      i
      ...
      
      a
      n
      d
      
      o
      n
      l
      y
      
    
    
      1
      
      f
      r
      o
      m
      
      f
      a
      i
      r
      ...
      a
      n
      d
      
      o
      n
      l
      y
      
      h
    
    
      2
      f
      r
      o
      m
      
      f
      a
      i
      r
      e
      ...
      n
      d
      
      o
      n
      l
      y
      
      h
      e
    
    
      3
      r
      o
      m
      
      f
      a
      i
      r
      e
      s
      ...
      d
      
      o
      n
      l
      y
      
      h
      e
      r
    
    
      4
      o
      m
      
      f
      a
      i
      r
      e
      s
      t
      ...
      
      o
      n
      l
      y
      
      h
      e
      r
      a
    
  

5 rows × 410 columns



In [91]:

    
# The testing data will use 30 characters, and I only choose 10 seperate rows from all sequences.
## The poem output will try to predict characters based on the 20 characters in each row, 10 rows in total
test_seq_len = 30
all_testing_df = generate_char_seq(all_poem, test_seq_len).T
testing_df = all_testing_df.iloc[[1,2,3,4,5,77, 99, 177, 199, 277, 299, 410],:]
testing_df.head()









    Out[91]:







  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      ...
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
    
  
  
    
      1
      
      f
      r
      o
      m
      
      f
      a
      i
      r
      ...
      r
      e
      s
      
      w
      e
      
      d
      e
      s
    
    
      2
      f
      r
      o
      m
      
      f
      a
      i
      r
      e
      ...
      e
      s
      
      w
      e
      
      d
      e
      s
      i
    
    
      3
      r
      o
      m
      
      f
      a
      i
      r
      e
      s
      ...
      s
      
      w
      e
      
      d
      e
      s
      i
      r
    
    
      4
      o
      m
      
      f
      a
      i
      r
      e
      s
      t
      ...
      
      w
      e
      
      d
      e
      s
      i
      r
      e
    
    
      5
      m
      
      f
      a
      i
      r
      e
      s
      t
      
      ...
      w
      e
      
      d
      e
      s
      i
      r
      e
      
    
  

5 rows × 30 columns



In [92]:

    
training_df.to_csv("all_train.csv", index=False)
testing_df.to_csv("all_test.csv")



In [93]:

    
model = CPT()
train, test = model.load_files("all_train.csv", "all_test.csv")
model.train(train)









    Out[93]:





True



In [94]:

    
predict_len = 10  # predict the next 10 characters
predictions = model.predict(train,test,test_seq_len,predict_len)









    



100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [05:04<00:00, 25.50s/it]

Generate the poem



In [95]:

    
for i in range(testing_df.shape[0]):
    all_char_lst = testing_df.iloc[i].tolist()
    all_char_lst.append(' ')
    all_char_lst.extend(predictions[i])
    print(''.join(all_char_lst))









    



 from fairest creatures we des hnl,ybgpvk
from fairest creatures we desi nhl,gybvpk
rom fairest creatures we desir hnl,ygbv'p
om fairest creatures we desire nhl,ygbp'.
m fairest creatures we desire  honlbpgyvk
 never die, but as the riper s olmwfycgk'
 riper should by time decease, nwfgvk'.;x
, contracted to thine own brig slumyfvp.'
own bright eyes, feed'st thy l aumvcpk;.x
a famine where abundance lies, toygpvk'.x
ce lies, thy self thy foe, to  arnwmubdpg
herald to the gaudy spring, wi mfcbv'k;.:

Observations

At the very beginning, I tried to generate poem with selected 7 rows, each row uses 20 characters to predict the next 10 characters. It gave exactly the same output as smaller data sample output.
- This may indicate that, when the selected testing data is very very small and tend to be unique in the training data, smaller data inout is enough to get the results.
Then I changed to 12 rows, each row uses 30 characters to predict the next 10 character.
- The first 5 rows came from continuous rows, which means the next row is 1 character shift from its previous row. If you check their put, although cannot say accurate, but the 5 rows has similar prediction.
I think CPT can be more accurate when there are repeated sub-sequences appeared in the training data, because the algorithm behind CPT is prediction tree + inverted index + lookup table, similar to FP-growth in transaction prediction, more repeat more accurate.

Method 2 - Word Based Poem Generation



In [4]:

    
all_words = all_poem.split()
print(len(all_words))



In [5]:

    
# With selected sequence length to train
train_seq_len = 1000
training_df = generate_char_seq(all_words, train_seq_len).T
training_df.head()









    Out[5]:







  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      ...
      990
      991
      992
      993
      994
      995
      996
      997
      998
      999
    
  
  
    
      0
      i
      from
      fairest
      creatures
      we
      desire
      increase,
      that
      thereby
      beauty's
      ...
      commits.
      x
      for
      shame!
      deny
      that
      thou
      bear'st
      love
      to
    
    
      1
      from
      fairest
      creatures
      we
      desire
      increase,
      that
      thereby
      beauty's
      rose
      ...
      x
      for
      shame!
      deny
      that
      thou
      bear'st
      love
      to
      any,
    
    
      2
      fairest
      creatures
      we
      desire
      increase,
      that
      thereby
      beauty's
      rose
      might
      ...
      for
      shame!
      deny
      that
      thou
      bear'st
      love
      to
      any,
      who
    
    
      3
      creatures
      we
      desire
      increase,
      that
      thereby
      beauty's
      rose
      might
      never
      ...
      shame!
      deny
      that
      thou
      bear'st
      love
      to
      any,
      who
      for
    
    
      4
      we
      desire
      increase,
      that
      thereby
      beauty's
      rose
      might
      never
      die,
      ...
      deny
      that
      thou
      bear'st
      love
      to
      any,
      who
      for
      thy
    
  

5 rows × 1000 columns



In [15]:

    
# The testing data will use 20 words, and I only choose 10 seperate rows from all sequences.
## The poem output will try to predict characters based on the 20 characters in each row, 10 rows in total
test_seq_len = 20
output_poem_rows = 10

all_testing_df = generate_char_seq(all_words, test_seq_len).T
selected_row_idx_lst = [train_seq_len*i for i in range(output_poem_rows)]
testing_df = all_testing_df.iloc[selected_row_idx_lst,:]
testing_df.head()









    Out[15]:







  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
    
  
  
    
      0
      i
      from
      fairest
      creatures
      we
      desire
      increase,
      that
      thereby
      beauty's
      rose
      might
      never
      die,
      but
      as
      the
      riper
      should
      by
    
    
      1000
      any,
      who
      for
      thy
      self
      art
      so
      unprovident.
      grant,
      if
      thou
      wilt,
      thou
      art
      belov'd
      of
      many,
      but
      that
      thou
    
    
      2000
      when
      in
      eternal
      lines
      to
      time
      thou
      grow'st,
      so
      long
      as
      men
      can
      breathe,
      or
      eyes
      can
      see,
      so
      long
    
    
      3000
      my
      thoughts--from
      far
      where
      i
      abide--
      intend
      a
      zealous
      pilgrimage
      to
      thee,
      and
      keep
      my
      drooping
      eyelids
      open
      wide,
      looking
    
    
      4000
      must
      be
      twain,
      although
      our
      undivided
      loves
      are
      one:
      so
      shall
      those
      blots
      that
      do
      with
      me
      remain,
      without
      thy



In [16]:

    
training_df.to_csv("all_train_words.csv", index=False)
testing_df.to_csv("all_test_words.csv")



In [17]:

    
model = CPT()
train, test = model.load_files("all_train_words.csv", "all_test_words.csv")
model.train(train)









    Out[17]:





True



In [18]:

    
predict_len = 10  # predict the next 10 words
predictions = model.predict(train,test,test_seq_len,predict_len)









    



100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:31<00:00,  4.23s/it]

Generate the poem



In [21]:

    
for i in range(testing_df.shape[0]):
    all_char_lst = testing_df.iloc[i].tolist()
    all_char_lst.extend(predictions[i])
    print(' '.join(all_char_lst))









    



i from fairest creatures we desire increase, that thereby beauty's rose might never die, but as the riper should by in world an for love children's eyes, her husband's shape
any, who for thy self art so unprovident. grant, if thou wilt, thou art belov'd of many, but that thou and you in your the i to this when with
when in eternal lines to time thou grow'st, so long as men can breathe, or eyes can see, so long and the of for my with thy i is that
my thoughts--from far where i abide-- intend a zealous pilgrimage to thee, and keep my drooping eyelids open wide, looking the of with in by that for all when night
must be twain, although our undivided loves are one: so shall those blots that do with me remain, without thy my i thou and to is love for all it
both sea and land, as soon as think the place where he would be. but, ah! thought kills me that my of i to with in from love doth mine
millions of strange shadows on you tend? since every one, hath every one, one shade, and you but one, can the your in that to shall doth be which nor
all too near. lxii sin of self-love possesseth all mine eye and all my soul, and all my every part; i the his with to when so for is in
to the world that i am fled from this vile world with vilest worms to dwell: nay, if you read and my of me in love thou for which is
praise to thee, but what in thee doth live. then thank him not for that which he doth say, since your the of and i my you shall a as

Summary

After trying LSTM, CPT on the same data (Shakespeare's sonnets), with both character based and the most simple word based methods, now we can see CPT word based creates more descent results. Look at the final output, it's just look like poems (difficult to understand), you won't feel anything is wrong at the first glance, it's a poem.
I think python CPT is really good, although it's poen source, you don't need to convert character or words into numerical data, categorical data can be the input, and the model training as well as the prediction is very fast.

	0	1	2	3	4	5	6	7	8	9	...	400	401	402	403	404	405	406	407	408	409
0	i		f	r	o	m		f	a	i	...		a	n	d		o	n	l	y
1		f	r	o	m		f	a	i	r	...	a	n	d		o	n	l	y		h
2	f	r	o	m		f	a	i	r	e	...	n	d		o	n	l	y		h	e
3	r	o	m		f	a	i	r	e	s	...	d		o	n	l	y		h	e	r
4	o	m		f	a	i	r	e	s	t	...		o	n	l	y		h	e	r	a

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19
77		n	e	v	e	r		d	i	e	,		b	u	t		a	s		t
99		r	i	p	e	r		s	h	o	u	l	d		b	y		t	i	m
177	,		c	o	n	t	r	a	c	t	e	d		t	o		t	h	i	n
199	o	w	n		b	r	i	g	h	t		e	y	e	s	,		f	e	e
277	a		f	a	m	i	n	e		w	h	e	r	e		a	b	u	n	d

	0	1	2	3	4	5	6	7	8	9	...	400	401	402	403	404	405	406	407	408	409
0	i		f	r	o	m		f	a	i	...		a	n	d		o	n	l	y
1		f	r	o	m		f	a	i	r	...	a	n	d		o	n	l	y		h
2	f	r	o	m		f	a	i	r	e	...	n	d		o	n	l	y		h	e
3	r	o	m		f	a	i	r	e	s	...	d		o	n	l	y		h	e	r
4	o	m		f	a	i	r	e	s	t	...		o	n	l	y		h	e	r	a

	0	1	2	3	4	5	6	7	8	9	...	20	21	22	23	24	25	26	27	28	29
1		f	r	o	m		f	a	i	r	...	r	e	s		w	e		d	e	s
2	f	r	o	m		f	a	i	r	e	...	e	s		w	e		d	e	s	i
3	r	o	m		f	a	i	r	e	s	...	s		w	e		d	e	s	i	r
4	o	m		f	a	i	r	e	s	t	...		w	e		d	e	s	i	r	e
5	m		f	a	i	r	e	s	t		...	w	e		d	e	s	i	r	e

	0	1	2	3	4	5	6	7	8	9	...	990	991	992	993	994	995	996	997	998	999
0	i	from	fairest	creatures	we	desire	increase,	that	thereby	beauty's	...	commits.	x	for	shame!	deny	that	thou	bear'st	love	to
1	from	fairest	creatures	we	desire	increase,	that	thereby	beauty's	rose	...	x	for	shame!	deny	that	thou	bear'st	love	to	any,
2	fairest	creatures	we	desire	increase,	that	thereby	beauty's	rose	might	...	for	shame!	deny	that	thou	bear'st	love	to	any,	who
3	creatures	we	desire	increase,	that	thereby	beauty's	rose	might	never	...	shame!	deny	that	thou	bear'st	love	to	any,	who	for
4	we	desire	increase,	that	thereby	beauty's	rose	might	never	die,	...	deny	that	thou	bear'st	love	to	any,	who	for	thy

	0	1	2	3	4	5	6	7	8	9	...	400	401	402	403	404	405	406	407	408	409
0	i		f	r	o	m		f	a	i	...		a	n	d		o	n	l	y
1		f	r	o	m		f	a	i	r	...	a	n	d		o	n	l	y		h
2	f	r	o	m		f	a	i	r	e	...	n	d		o	n	l	y		h	e
3	r	o	m		f	a	i	r	e	s	...	d		o	n	l	y		h	e	r
4	o	m		f	a	i	r	e	s	t	...		o	n	l	y		h	e	r	a

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19
77		n	e	v	e	r		d	i	e	,		b	u	t		a	s		t
99		r	i	p	e	r		s	h	o	u	l	d		b	y		t	i	m
177	,		c	o	n	t	r	a	c	t	e	d		t	o		t	h	i	n
199	o	w	n		b	r	i	g	h	t		e	y	e	s	,		f	e	e
277	a		f	a	m	i	n	e		w	h	e	r	e		a	b	u	n	d

	0	1	2	3	4	5	6	7	8	9	...	400	401	402	403	404	405	406	407	408	409
0	i		f	r	o	m		f	a	i	...		a	n	d		o	n	l	y
1		f	r	o	m		f	a	i	r	...	a	n	d		o	n	l	y		h
2	f	r	o	m		f	a	i	r	e	...	n	d		o	n	l	y		h	e
3	r	o	m		f	a	i	r	e	s	...	d		o	n	l	y		h	e	r
4	o	m		f	a	i	r	e	s	t	...		o	n	l	y		h	e	r	a

	0	1	2	3	4	5	6	7	8	9	...	20	21	22	23	24	25	26	27	28	29
1		f	r	o	m		f	a	i	r	...	r	e	s		w	e		d	e	s
2	f	r	o	m		f	a	i	r	e	...	e	s		w	e		d	e	s	i
3	r	o	m		f	a	i	r	e	s	...	s		w	e		d	e	s	i	r
4	o	m		f	a	i	r	e	s	t	...		w	e		d	e	s	i	r	e
5	m		f	a	i	r	e	s	t		...	w	e		d	e	s	i	r	e

	0	1	2	3	4	5	6	7	8	9	...	400	401	402	403	404	405	406	407	408	409
0	i		f	r	o	m		f	a	i	...		a	n	d		o	n	l	y
1		f	r	o	m		f	a	i	r	...	a	n	d		o	n	l	y		h
2	f	r	o	m		f	a	i	r	e	...	n	d		o	n	l	y		h	e
3	r	o	m		f	a	i	r	e	s	...	d		o	n	l	y		h	e	r
4	o	m		f	a	i	r	e	s	t	...		o	n	l	y		h	e	r	a

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19
77		n	e	v	e	r		d	i	e	,		b	u	t		a	s		t
99		r	i	p	e	r		s	h	o	u	l	d		b	y		t	i	m
177	,		c	o	n	t	r	a	c	t	e	d		t	o		t	h	i	n
199	o	w	n		b	r	i	g	h	t		e	y	e	s	,		f	e	e
277	a		f	a	m	i	n	e		w	h	e	r	e		a	b	u	n	d

	0	1	2	3	4	5	6	7	8	9	...	400	401	402	403	404	405	406	407	408	409
0	i		f	r	o	m		f	a	i	...		a	n	d		o	n	l	y
1		f	r	o	m		f	a	i	r	...	a	n	d		o	n	l	y		h
2	f	r	o	m		f	a	i	r	e	...	n	d		o	n	l	y		h	e
3	r	o	m		f	a	i	r	e	s	...	d		o	n	l	y		h	e	r
4	o	m		f	a	i	r	e	s	t	...		o	n	l	y		h	e	r	a

	0	1	2	3	4	5	6	7	8	9	...	20	21	22	23	24	25	26	27	28	29
1		f	r	o	m		f	a	i	r	...	r	e	s		w	e		d	e	s
2	f	r	o	m		f	a	i	r	e	...	e	s		w	e		d	e	s	i
3	r	o	m		f	a	i	r	e	s	...	s		w	e		d	e	s	i	r
4	o	m		f	a	i	r	e	s	t	...		w	e		d	e	s	i	r	e
5	m		f	a	i	r	e	s	t		...	w	e		d	e	s	i	r	e